Inside the (Chat) Bubble: How Keywords Can Be Used More Effectively in Text Message Discovery

Let’s begin with two seemingly uncontroversial statements.

Statement #1: Lawyers are comfortable using keywords to search discovered content, much more so than advanced methods available within the eDiscovery technology landscape such as machine learning algorithms, TAR 1.0/2.0/3.0, and emerging AI tools.

Statement #2: Using keywords to effectively search text messages is trickier than searching email messages and standard office file types.

What happens when these two statements intersect? Since rulings, case law, and industry standards are presently scarce, what are the unique requirements that attorneys and eDiscovery practitioners should understand and anticipate when using keywords to search text messages?

Although this article focuses on text messages created and received on a mobile phone, “text messages” can be understood as a general proxy for other applications that utilize a chat-style communication with ongoing threaded conversations. Examples of common business applications with chat functionality are Microsoft Teams, Slack, WhatsApp, WeChat, or Signal.

But first, a quick review of an old concept

When discovery was supported by the first generation of electronic database applications such as Summation and Concordance, a pivotal concept emerged regarding what defined a document as a database record and its relationship to related documents. This determination had a profound impact on the ability of case teams to efficiently search the database for relevant information.

When paper documents were scanned and converted to electronic versions such as PDF or TIF formats, the unitization of the scanned pages was critically important. Unitization defines the beginning and the end of the document, which in turn becomes a unique record and a fundamental unit of organization.

There may have been higher level groupings of multiple documents that matched the physical or logical organization of the source hardcopies. For instance, a cover letter may reference other attached documents (via a paper clip or staple) that describe a higher-level organizational structure — i.e., an attachment range or document family. Further groupings might include folders, boxes, etc., depending on the level of organization that existed with the hardcopy versions of the documents.

When business documents began to exist only in native electronic formats, primarily email, this organizational hierarchy was already predefined and easy to understand. Email messages had attachments, and higher levels of organization could be achieved by creating folder names within a mailbox. The unitization was already established and followed a consistent standard defined primarily by the email application.

That’s all well and good, but what does that have to do with the organization of text messages and its impact on using keywords to search for relevant content?

Plenty, it turns out.

A document is the text message itself (aka the bubble that appears on my phone). That’s easy, right?

Maybe, but maybe not. Unlike email, which naturally adopted the metaphor of paper cover letters referencing attached documents as a standard organizing principle, the text message format is more like a free form verbal conversation that is not necessarily characterized by a single message, a set of messages or attached files, a defined topic, or even a specific time period.

An individual text message bubble, identified by a keyword search and read in isolation from the individual messages that precede or follow, is likely to be devoid of context that may determine relevancy during a review. By its very nature, assessing the relevance of text messages requires reading the natural flow of the threaded conversation among participants to understand what is being communicated.

What if we define the document unit as the entire threaded conversation between specific participants? That might work, right?

The logic leads naturally to this conclusion, but there are practical considerations that make this definition problematic from an eDiscovery search and review perspective. Consider the following:

1) Threaded conversations of individual text messages, unbound by a predetermined time period, could easily span many weeks, years, or even decades in duration — potentially comprising thousands or even tens of thousands of individual text messages.

2) Unlike an email message, there is no unifying subject line that ties together individual text messages to provide cues on the topic being discussed. Threaded conversations can be numerous, nonlinear, contain intersections and conflations, and can start, stop, and restart unpredictably over the course of a conversation.

From an eDiscovery search and review perspective, threaded conversations are inherently difficult for a third party to assess for legal review. It often is impractical to read an entire threaded conversation identified by a keyword search to assess its relevance.

Ok, this feels like the Goldilocks Dilemma where an individual text message bubble is too short, and the threaded conversation is too long. Can we define the document unit as a 24-hour calendar day and call it a day?

Yes, we can, and this is primarily where the current practice has settled. But this approach has its own set of challenges and corresponding solutions:

1) Because the calendar day is an arbitrary construct (as threaded conversations commonly span across multiple days), it may be necessary to assess many calendar days of messages to understand the meaning of what is being communicated.

2) When a keyword search identifies a potentially relevant threaded conversation, it is often difficult to anticipate the range of time or the number of individual messages that may be necessary to ensure that context can be understood. Is a calendar week sufficient? Are 50 individual messages before and after the message identified by a search hit sufficient?

The answer is the age-old lawyerism: It depends. Since the topic(s) of the conversations and the communication styles of the participants vary from situation to situation, it is difficult to create any type of standardized practice.

What does all this unitization business have to do with using keyword search methods to find content of interest?

Understanding how the unitization of text messages is managed in a database has material impact on the efficiency of the retrieval and accuracy of the keyword search results to identify relevant content. Some of the challenges include:

1) Defining the context. As mentioned previously, there are many possibilities to consider regarding the unitization of conversation threads to optimize keyword search and the corresponding legal review. Depending on the situation, the range may be a 24-hour span, weeks of conversation, or simply a predetermined set of messages before and after the keyword is found.

2) Proximity searches. How unitization is defined will affect the accuracy of proximity searches (e.g., “Jordan” near20 “champion”). If the technology platform defines text messages as individual documents, the proximity search will be confined to those individual “bubbles,” and will not span the entire conversation.

3) AND / OR operators. Similarly, creating compound statements and nested search syntax with AND / OR operators may also be limited to the text content found in an individual message.

The issues and approaches described above are likely to be heavily influenced by the technology platform and how the underlying search engine is configured to define the unitization of content. It is imperative that practitioners completely understand the search technology’s capabilities and perform documented validation steps to assure the quality of the results.

Are there any other challenges to address when searching text message content?

Many — and they are varied. For example:

1) Use of informal language. Text messages tend to be much more informal in both grammar and spelling, and often include slang, colloquialisms, acronyms, and emojis. They also include non-searchable embedded graphics and hyperlinks and the style of communication is in a constant cycle of dynamic evolution.

2) Difficulty identifying conversation participants. The identity of participants of a text message conversation is typically attained by the associated mobile phone number, and in some limited instances secondary identifiers such as an Apple ID. Connecting these to a participant’s name can prove challenging because they may be stored in contacts by a nickname, be misspelled, or not identified at all. Emerging technologies have yet to develop a failsafe solution to address these issues.

Popping the Bubbles: A Comprehensive Approach to Identify Relevant Content

There are several opportunities legal teams can leverage to mitigate the risks involved with utilizing keywords to search and identify relevant content in threaded conversations. They involve limiting the universe of text message content by executing data governance best practices and strategically harnessing information gathered from custodian interviews. Here are some best practices to develop powerful keyword strategies:

• Monitor data retention protocols. Leverage enterprise mobile device management technology as much as possible to automate the disposition of content that is aged past the retention policies and is not subject to a legal hold.

• Ask questions to pinpoint relevant content. Conduct timely and comprehensive custodian and data steward interviews to understand the relevant time period, conversation participants, topics, and the texting style of the custodian. Use the information to prioritize text message content based upon its relevance to the matter and the anticipated burden of collection, processing, review, and production.

• Utilize collection technology that specializes in text formats. Where possible and appropriate from a risk perspective, utilize collection technology that provides capabilities to limit text messages content to a defined date range, participant listing, and simple keywords.

• Develop unitization methodology. Thoughtfully consider the appropriate unitization for text messages for the case at hand, with an eye toward executing efficient searching and review. Remember that rendering text messages to a flat image or PDF file, can limit your ability to expand or contract threaded conversations. If uncertain, grouping threaded conversations in 24-hour increments is a reasonable place to start, but having the flexibility to easily navigate from the individual text message to the calendar day or to the entire conversation provides the optimal flexibility.

• Update ESI protocol. Amend your standard ESI protocol language and associated case strategy regarding text message content. Develop specific language for utilizing keywords to identify relevant content. Remember that agreed-upon search protocols for traditional data sources may not work for text message content, particularly regarding proximity limiters and elaborate compound syntax construction.

• Engage a search expert. It is important to utilize experts who are highly knowledgeable about the capabilities of the search technology, required syntax, and nuanced issues related to the configuration of its underlying indexing process. They should also advise on developing a documented approach to executing searches that is backed up by an audit trail of reporting and statistical validation.

As time passes and a new generation of text-savvy workers emerges, it’s not hard to imagine a business world where the use of email significantly wanes and is morphed by mobile device apps using text- and chat-style formats. Be prepared to litigate in this new text-bubble world.