Skip to main content

Reduce, Reuse, Recycle: Data Science for Legal Demystified


Watch it on demand by clicking here.

It may not be immediately obvious how data science as a discipline correlates to the complex but typically non-scientific world of ediscovery and legal data. Data science, in its most simplistic form, is the study of data across a wide array of disciplines and industries. But actually applying that data science is the interesting part, as humans have learned to take the concepts of data science and implement machine learning to record, store, analyze, and ultimately extract information to gain knowledge from that data and improve processes. So why can’t data science be just as useful to extract meaning and improve efficiency when it comes to the ever-growing mountains of legal data? As legal and ediscovery teams who are tasked with managing the complexities of legal data are increasingly coming to realize, data science may be the key.

Reduce, Reuse, Recycle Data Science for Legal Demystified

In a recent webinar, I moderated an insightful discussion on data science and how it’s being used to harness legal data. I spoke alongside panelists CJ Mahoney, Senior Attorney at Cleary Gottlieb and pioneer of data science in ediscovery, and Lighthouse’s Karl Sobylak, Senior Product Manager and big data analytics guru. We discussed the evolution of data science and how it applies to legal data, the main components and considerations behind data science, the importance of leveraging the right technology, as well as common hurdles to overcome to get to a place where data science is widely adopted in ediscovery. We also talked about the specific benefits experienced by those who have seen first hand how data science and analytics can be applied to legal data to substantially save time and cost on future matters. Outlined below are the key components, top lessons learned, and common objections when it comes to using data science for legal data:

Key Components

  • Subject-matter experts – So where do you start? A good place to begin is with a data science team ideally comprised of data scientists, engineers, product managers, and software developers. The type of data science expert you’ll need is someone at your firm or company that really understands the process and can effectively communicate to stakeholders and opposing counsel, particularly around the methods used to validate the process and the results.
  • Technology – Once you’ve assembled a team focused on building out a pipeline to manipulate large amounts of data, it’s just as important to choose the right technology that is legally defensible and can store, analyze, and manipulate the data. Bringing together your subject-matter experts and the technology is the key to running a successful machine learning project as a whole.


Lessons Learned

  • There’s no right size – Although a big data set might seem better suited for training purposes, as there’s a larger amount of previous work product for machine learning to analyze and apply to new data, it’s not all or nothing with data science. In fact, even a small data set can provide valuable information where learnings from old data can be applied to new data and incorporated into the system. See more on the concept of data reuse in my blog here.
  • Be realistic with expectations and goals – For those who have already jumped in and applied data science to their matters, one of the first discoveries is understanding just how much work is involved in the process. In particular, data science isn’t magic, but it’s a process that takes time and a substantial amount of effort (though much less than a traditional manual approach). Think of data science and analytics as a way to reduce your ediscovery burden and not as a magic button that will provide a quick end result.


Common Objections

  • Liability of keeping data around indefinitely – Many organizations want to manage costs and protect themselves by implementing aggressive data retention policies. They’re concerned about the potential liability of keeping data longer than needed to use for data classification purposes on future matters. But with data anonymization, it’s less of a concern and as you train the data, you don’t need to keep all the old data around in perpetuity.
  • Quality concerns on unknown historical data – Another common objection is the fear of using and leveraging historical data that was coded by someone else. There can be concern around the ability to verify the quality of that work product due to lack of insight. But that’s what validation is for. It’s critical to build in time for testing, adjustments, and further iteration. In addition, if you have enough coded data, those errors will be corrected with machine learning over time.

Ultimately, when a team of subject-matter experts and data scientists is assembled and the right technology is adopted, embracing data science and analytics brings measurable benefits in the form of major time and cost reductions. As data science adapts and creates new ways of using machine learning, we’ve come to understand that applying data science to all types and sizes of legal data is not only possible but realistic, and that validation of results is the key to success. To share other data science insights or your experiences with data reuse in ediscovery, please feel free to reach out to me at


About the Author

Executive Director, Global Advisory Services | Erika is an industry expert with over 13 years of experience leading legal services, operations, and consulting projects for law firms and corporations. She has a proven track record in building and growing teams supporting ediscovery, investigation, and compliance functions and leads the team focusing on client technology and business workflow within the Enterprise Technology division of Lighthouse's Global Advisory Services business. Her specialties within the broader Advisory Services team include responsive review expertise with respect to building efficient review workflows; leveraging analytics and automation tools; setting up quality control protocols and procedures; defining production criteria and requirements; ensuring complete, accurate, and timely productions; expert search, development, testing, and validation of linguistic models as well as the execution of those models across the larger data population; designing and implementing strategies for data organization, retrieval, and processing, including workflow; and addressing business challenges with data remediation.

Profile Photo of Erika  Namnath