Overview
If pressed for an answer, few attorneys would say math is one of their interests. Yet increasingly, math is creeping into the world of document review. As a result, it is more important than ever for attorneys and litigation support professionals to gain a basic understanding of how mathematics can, and should, be applied in litigation.
The term sampling has become ubiquitous within the legal industry, while at the same time very few attorneys really understand how to apply a sampling methodology in a truly defensible way. While I applaud anyone doing additional quality control (QC) during a review, applying a random 5% sample in all situations won't always lead to the most defensible outcomes.
This QuickCounsel explores the sampling methods to help attorneys and litigation support professional greater understand how sampling and validation techniques can be used in their review process.
Sampling methods
In the modern review environment, sampling is often used both as a method for validating accuracy of document review (QC), and for defining populations for training some of the more advanced "Computer Assisted Review" technologies. Using sampling in litigation enables you to lower costs and improve review accuracy. While there are many different methods available for sampling, those most often found in litigation are Simple Random Sampling (SRS) and Stratified Sampling (SS).
The SRS method is, as the name suggests, the simplest method for sampling and the one most often found in litigation. With SRS, documents are chosen at random and each document has an equal probability of being selected. The advantage of SRS is its simplicity, because it does not require you to create groupings within the review set ahead of time. As we will discuss later, SRS is also useful because it makes it easier to validate results. The main weakness of the SRS method is that due to the pure randomness of the selection process, it is more susceptible to sampling error which effectively means the sample population accurately represents the larger universe of documents.
Stratified Sampling (SS) relies on sampling from different sub-groups within the document population. The exact methods for creating these sub-groups or "strata" may vary, but often rely on some form of metadata-based grouping such as custodian, document type, or date. Users of advanced clustering technologies are often able to leverage these technologies to define potential strata based on what is known as Bayesian Classification, a system named after the mathematical pioneer Thomas Bayes. Bayesian Classification is effectively used to analyze documents by logically identifying key concepts contained within the information and by comparing the concepts identified within groups of documents according to conceptual similarity. One of the benefits of Bayesian Classification is that it does not rely on a preconceived notion of how documents relate to each other and can often group documents that, while conceptually similar, may not be similar in form. For example, an email, an audio recording, and a spreadsheet may all be related to each other, yet not contain similar patterns of text or speech.
Irrespective of whether you are using manual or automated methods for creating these strata groups, Stratified Sampling has a major advantage over the use of a Simplified Random Sampling method in that it reduces the sampling error and helps to ensure the sample is more representative of the variation of the overall corpus of data. As a result, Stratified Sampling can often allow for the identification of patterns within the various sub-groups that would otherwise not be apparent. The drawback to using Stratified Sampling is that because the population is classified into strata prior to sampling, it often means the results for each stratum must be validated independently.
As a general rule Simple Random Sampling is often best used when looking at a population where either it is not feasible create sub-groups in the population, or where the variable you are trying to define has limited variation. An example of an ideal population for SRS in the review context would be documents that have been culled using keywords, where you are evaluating for binary criteria, such as whether a document is responsive or not. Conversely Stratified Sampling may be optimal when working with an un-culled population, where you are evaluating documents that may have various potential outcomes, such as when coding for sub-issues in a case.
Sample size
Determining the appropriate sample size involves defining how many documents need to be selected to generate the statistical sample. As a general rule, the larger the sample size the more accurate the result. However, it is usually not practical or advisable to QC every document in a review or to review 100% of a population to train a Computer Assisted Review technology because it would defeat the purpose of using sampling. Generally speaking, you want to select a sample size that is as small as possible, while maintaining the desired confidence interval and confidence level.
The confidence interval, also commonly known as the margin for error, is a plus-or-minus figure that denotes the variation in actuality that the result may differ from the observation. For example, if my result was 30% responsive with a confidence interval of 3, it means that the actual responsive rate across the population may be as low as 27% (30 minus 3) or as high as 33% (30 plus 3).
The confidence level measures how sure you are that your resulting range is accurate. For example, a 95% confidence level combined with the confidence interval above would indicate that you are 95% certain the documents in the review population are between 27% and 33% responsive. When choosing confidence intervals and levels, it is important to note that the wider the confidence interval selected, the higher confidence level you can achieve, and vice versa.
Determining the best confidence interval
There are three main factors that are generally used to help determine the best confidence interval to use for a given confidence level: sample size, the percentage of documents responsive, and the overall size of the review population. Generally, the larger the sample size the more you can be sure of the results. This ultimately means as your sample population grows, the confidence interval narrows and your confidence level increases.
Accuracy is also impacted by the percentage of a sample that is identified as responsive, which is known as the response distribution. For example, if 5% of the sample was identified as responsive and 95% as non-responsive the chance of error would be lower than if it was 40% and 60%, respectively. Unfortunately, with the exception of after the fact recall analysis that we will discuss later, the responsive rate is not known up front. As a result, we are forced to either use an assumption based on prior observations, or assume a worst case scenario of a 50/50 split between responsive and non-responsive.
Finally, consideration must be given to the overall review population if doing a Stratified Sampling of the population in each stratum. The good news here is that the size of the population doesn't really matter that much, as long as the sample is a small percentage of the review population. Population size is only likely to be a factor when you work with a small population of documents, and it is important to keep this in mind when working with smaller strata or when targeting something such as a single reviewers work over a short period of time.
In practice, we can solve for the sample size using the following formula:
In the real world, there is no need for counsel to have to apply these formulas. Since they are based on a single mathematical principle, it is usually just a matter of looking up the result. The following table assumes a population of one million documents and a worst cast response distribution of .5 (50%).
Results Table |
|||
Confidence Level |
|||
Confidence Interval |
|||
90% |
95% |
99% |
|
1 |
6,719 |
9,513 |
16,317 |
2 |
1,689 |
2,396 |
4,130 |
3 |
751 |
1,066 |
1,840 |
5 |
271 |
384 |
664 |
Validation
Sampling used in Quality Checking (QC) is typically handled by selecting X% of documents that have been reviewed, or that come out of an automated review algorithm, and having them reviewed by another reviewer. Then, if the reviewers confirm the original coding, this may be sufficient to validate. However, when there are differences or discrepancies between the results and how those discrepancies should be addressed, simply changing the coding on that subset does not help validate the larger result. As a simple example, let's say there were 1,000 documents reviewed in the initial review. From that population, a 5% sample (50 documents) is taken and reviewed, and the QC reviewer identifies 5 documents (10% of the QC set) that were coded incorrectly. Let's make a common assumption that the QC reviewer is the final arbiter of what is correct. In many cases, this is where the validation ends and the problems begin. What this analysis shows us is that 10% of the QC sample was done incorrectly—a possible indication that there may be a population of documents in the larger universe that are also incorrectly coded and have not been identified as part of this QC process.
Precision and recall analysis
At its core, sampling is used to estimate characteristics of an entire population by analyzing a subset. In other words, we can say with some level of confidence that the sample is representative of the larger population. What sampling does not do, however, is measure accuracy. To evaluate outputs from attorney review, the use of search terms, or any other technique used in discovery to identify relevant material, we must measure precision and recall. For purposes of simplicity, the discussion must be presented from the perspective of an analysis of responsive documents, yet the same method may be applied to any desired coding or classification. Precision and recall analysis are based on being able to measure four possible outcomes. Sometimes known as a confusion matrix, these four possible outcomes are as follows:
True Positive = TP = Correct result
False Positive = FP = Unexpected result
False Negative = FN = Missing result
True Negative = TN = Correct absence of result
Precision, as described in Advanced data mining techniques by David Louis Olson and Dursun Delen, in the review context effectively measures, as a percentage, the documents coded as responsive that actually were responsive.
Precision = True Positive/(True Positive + False Positive)
This formula in effect measures the accuracy of those coded as responsive. Generally in the review context precision is one of the most straight forward things to capture since it focuses on the documents identified as being included in the set (in this case based on responsiveness). Recall conversely, attempts to measure the documents identified as responsive as a percentage of the total documents in the overall review set that are actually responsive or how much of the responsive documents have been found.
Recall = True Positive/(True Positive + False Negative)
Because Precision and Recall typically have an inverse relationship, the overall effectiveness of a review is often measured using a combined measure known as an F1-score, As explained in Information Retrieval by C. J. van Rijsbergen, preceision and recall provide a balanced blending of the two scores, with each being given equal weight, as follows:
F1 = 2 X ((Precision X Recall)/(Precision + Recall))
Conclusion
This QuickCounsel was created to give attorneys and litigation support professional a greater understanding of how sampling and validation techniques can be used in their review process. Unfortunately, understanding the math is only half of the puzzle. We must still grapple with what the appropriate result from the analysis tells us. With no established standards for what a satisfactory level of precision or recall actually are, it is important to view these methods through the prism of reasonableness that we already apply to our reviews. These sampling and validation methodologies can also come in handy earlier in the review process, such as during negotiations with opposing parties as part of a 26(b) "meet-and-confer" conference, to be used as a method of certification. For example, as part of the production process, you could sample to an agreed-upon confidence interval and level and report on the outcome of that sampled set.
Additional Resources
- The Grossman-Cormack Glossary of Technology-Assisted Review, 2013 Fed. Cts. L. Rev. 7 Meaning Based Coding in Autonomy eDiscovery (HP Autonomy 2013 ) Advanced Data Mining Techniques (Olson, David L. and Delen, Dursun 2008) Information Retrieval (C. J. van Rijsbergen 1979)