Predictive Coding: The Good, the Bad, and the Ugly

When you turn on your favorite television show, you know it will soon be interrupted by a commercial about a newly developed “wonder drug” that is touted as the cure for whatever ails you. And inevitably, at the end of the recitation of its amazing benefits comes the litany of awkward side effects.

In the world of electronic discovery, predictive coding has been proclaimed the new wonder drug — a virtual panacea. Most attorneys involved in large scale document reviews can recite its benefits, gained either from direct experience in utilizing the methodology or at least from a presentation given by their favorite ESI vendor.

The ingredient this wonder drug includes is artificial intelligence (AI), introduced into the ESI industry in the late 1990s by companies such as Attenex and DolphinSearch. These new search capabilities using AI technologies gave attorneys their first window into the exciting advances technology could offer their mundane, time-consuming, and expensive document review processes.

Over the years, this wonder drug evolved into what is now known as Technology-Assisted Review (TAR). TAR-based review platforms may include the concept search or classification capabilities found in earlier applications, but add the big-daddy machine learning tool, predictive coding.

There are literally hundreds of articles and white papers espousing the virtues of predictive coding in excruciating detail. They explain the science behind it, the algorithms and statistical sampling methodologies used, and how systems are trained to categorize documents.

What you won’t find is that tell-all list of the ugly “side effects.” Through all the fanfare and noise, it is important to understand that predictive coding is not the cure-all for every ESI project. Every case is different, and no matter how widely praised, predictive coding has its limitations and is not suited for every situation.

It is no wonder that one rarely hears about the downsides of predictive coding. Most of the materials on the internet regarding its usage are prepared by ESI vendors who offer predictive coding in their platforms. Because they have a dog in the fight, they tout the benefits that can be gained from this miracle technology while downplaying or altogether ignoring the drawbacks. Pictures of ESI specialists walking into the sunset after a successful day of teaching the computer how to code documents come to mind.

We, however, intend to show you not only that beautiful TAR sunset, but also that pesky list of limitations so that you can make a well-informed decision about when predictive coding is suitable for your ESI project.

The Good: Predictive Coding as the Miracle Cure

It saves money. If used properly, predictive coding can greatly reduce the volume of documents to be reviewed by providing a statistically valid sampling and categorizing documents appropriately, thereby saving both time and money. Teams of costly staff attorneys no longer need to lay eyes on every single document. This is the predominant rationale for using predictive coding in the industry today.
It easily handles large volumes. Predictive coding provides an excellent solution for making decisions about extensive amounts of data that require a simple binary choice: responsive or non-responsive. Once trained properly, the machine can literally categorize millions of documents in a fraction of the time required for manual coding.
It allows for iterative learning. Recent versions of predictive coding, sometimes called TAR 2.0, enable the machine’s algorithms to dynamically learn based on reviewer decisions. This iterative learning process requires more initial review time for calibration, but it may lead to better overall results. Additionally, the newer, advanced workflows in TAR 2.0 have also reduced set-up times required for effective training.
It provides consistency of coding. Without question, humans are more fallible than machines. Thus, the use of predictive coding provides more consistency in the review process. As explained in the 2016 Tax Court decision Dynamo Holdings Ltd. Partnership v. CIR, “[H]uman review is far from perfect… [I]f two sets of human reviewers review the same set of documents to identify what is responsive, research shows that those reviewers will disagree with each other on more than half of the responsiveness claims.”[1]
It ensures quality control. The statistical calculations used to define the scope and execution in a predictive coding project allow for reproducible measurements of results, adding to the defensibility of the method.
It is accepted by the courts. Predictive coding is widely accepted by the courts as an effective and defensible ESI methodology. Since US Magistrate Judge Andrew Peck’s vocal acceptance of the use of machine learning in 2011[2], there have been dozens of cases recognizing it as a proper method of review.[3]

The Bad: Why Predictive Coding Might not be Right for You

Machine decision-making is not transparent. Although humans train the machine for categorization purposes, the “how” is often hidden away deep in a proprietary black box of algorithms and statistical modeling. Certainly, reasonable assumptions can be made about some of the terms the system deemed relevant and related, but these algorithms are not designed for the typical layperson to understand. Unlike keywords, where you can see the actual language searched in the documents, predictive coding is much less transparent.
Setup takes much longer. Predictive coding requires a considerable investment of time to train the system to understand how to categorize documents. Training sets could be as large as 2,000 to 20,000 documents depending on the specifics of the case. While TAR 2.0 somewhat reduces this volume, it requires more manual-review time for the machine learning to be effective. And like anything else, garbage in yields garbage out. If the setup is faulty, the process will produce poor results, necessitating repetition of the entire process. Moreover, this up-front investment generally needs the person most knowledgeable about the facts to make these early decisions; often, a more senior attorney whose time is significantly limited (and more expensive).
The workflow is limited. A big ESI project may have rolling data source inputs added from custodians in different positions, departments or divisions. This often means that the original training set may not include new concepts introduced by custodians from different departments or roles. Thus, it must be reconfigured to account for any future changes in the matter.
Machines understand logic only. No matter how intelligent, a computer program is ultimately a logical machine that cannot make nuanced content distinctions. For example, predictive coding cannot readily determine whether a document contains privileged information. Additionally, documents that have limited text or structured data, such as images or database files, are difficult for machines to categorize.
Predictive Coding uses only binary coding. Predictive coding tools are not suitable for specialized-attribute coding meant to define the physical characteristics of a document, such as determining whether it is a contract, letter, facsimile, or email.
Metadata is difficult for Predictive Coding. Information that may be critical in defining responsiveness, such as date ranges or file names, is challenging to integrate into a predictive coding workflow. This information is found in structured database fields that may or may not be indexed by the PC engine.
It is difficult to filter signal noise. Lengthy documents often include a multitude of topics, which may adversely affect their classification by the predictive coding algorithm. For example, a 50-page document may contain one short phrase of importance on page 33, but the computer may miss this due to the relative noise of the other content in the source.
It is difficult to create a standardized workflow. Because different vendors may offer different predictive coding tools, the workflows for implementation of a project may vary not only from vendor to vendor, but also from project to project. Predictive coding workflows are somewhat complex in their own right, given the seed sets, training sets, statistics, and other variables. For the inexperienced, this can be confusing and difficult to master.
There is still some legal risk. While most courts accept a predictive coding workflow, some legal risks may still exist depending on the procedures employed and the results obtained. This is often the case when no agreement between the parties exists regarding the details of how predictive coding will be deployed. Thus, the means and methodologies of the workflow must be clearly articulated and stipulated in Rule 26(f) Meet and Confer sessions in order to alleviate objections downstream.
It is difficult to see the big picture. One critical component missing from predictive coding workflows is the ability to review conversations as a whole. Often, issues regarding who knew what when must be determined by looking at an entire conversation or exchange in context. This ability to examine patterns in email exchanges between certain custodians may be critical to a case. Although many vendors also offer email threading review capabilities, these rarely integrate seamlessly into a predictive coding workflow.

The End Game

Always seek advice from your ESI specialist before deploying predictive coding. Predictive coding is not right for everyone. Do not take predictive coding if you are allergic to it, i.e., if there are other more suitable options. The bottom line is this: to take full advantage of predictive coding and all the other analytical tools on the market — such as near-duping, email threading, entity identification and others — you need an expert advisor who has worked with these tools and understands which ones may be most appropriate for your project. A wide variety of workflows can optimize your time and reduce your client’s expenses. Predictive coding might be your miracle drug, but it takes an expert to know.

[1] Dynamo Holdings LLP, et al. v. Commissioner of IRS, July 13, 2016. U.S. Tax Court. Retrieved from https://www.ustaxcourt.gov/InternetOrders/DocumentViewer.aspx?IndexSearchableOrdersID=204756, pp. 7-8.

[2] Dale, Chris. “Judge Peck and Predictive Coding at the Carmel EDiscovery Retreat.” EDisclosure Information Project, 4 Aug. 2011, chrisdaleoxford.com/2011/08/02/judge-peck-and-predictive-coding-at-the-carmel-ediscovery-retreat.

[3] “Technology Assisted Review & Predictive Coding — a Library.” IT Law Today, Fenwick & West, 9 Oct. 2018, www.itlawtoday.com/tar-predictive-coding-caselaw/.