In the email debacle that engulfed the presidential election this past week, the question arose whether the government could possibly review 650,000 emails in eight days. With today’s technology, not only is it possible, but those who work in the litigation/practice support business know full well how it might be done. What surprises me I guess is that more people don’t realize that it’s possible.
My friend Eric Mandel recently wrote on LinkedIn about what could be learned in just 72 hours about the new emails. And I commented that I don’t understand how we in the industry could understand this, but the government does not. Well, apologies to the government – apparently they were doing (had to be doing) what we suspected all along –using technology to churn through the emails to determine whether there was anything of substance there.
We Have the Technology
For the uninitiated, using technology available today –technology, by the way that has been available for a long time— the government could not only parse through the email on that laptop in no time at all, they could also gain some real insight into the substance of the documents. To conclude otherwise, would be to deny science (and that’s not something politicians do, is it?).
How did they do this? As Eric pointed out, the initial question is whether this new tranche of emails contains anything new. One might start by doing some basic filtering and culling. Eliminate junk email, parse the messages by domain names, filter or sort by date, or author and recipient. And of course set aside duplicates and any messages that have already been reviewed. I mean, deduplication is something we in the industry do every day on millions and millions of electronic documents. Applying a file hash algorithm, the digital equivalent of the file’s fingerprint, to each message file and then comparing the hash values would reveal any duplicate files. Heck, near-deduplication technology could even show the slight differences between very similar files and group similar documents together or highlight the differences. Deduplication tools are particularly useful on email message files because these files typically contain a unique message ID that helps identify related messages.
The truth is that it’s fairly easy with the right software to get through 650,000 emails. Processing software extracts the header, message body, and fields of metadata (the To, From, Date, etc.) associated with each email message and within hours all that information can be loaded to a fully searchable database or document review platform. From there it take minutes to search, sort, and filter to narrow the files down to those that are relevant.
Not Without Limitations
Obviously, this is a high-level overview of the process, and to be sure, the volume of emails dictates how long it takes to process the files, but even 650,000 emails would be considered a small- to medium-sized project that could be handled in just a few days. Law firms, corporate legal departments and service providers in our industry do this every single day of the week. It should not come as a surprise that the government can do it too.
My guess would be that the laptop in this case contained a high volume of duplicate emails and email messages that that the government had already seen. So, what was reported as 650,000 emails was probably quickly reduced to a more manageable number. From there, the documents could easily be reviewed in a linear manner one by one, or additional machine learning, predictive analytic tools, or conceptual search tools could have been used to reduce the volume even further or focus on particular concepts of particular relevance.
But the short answer is, yes, it is possible to review 650,000 emails in eight days.