97a388a - Partnership on AI

Minimum Requirements for the Responsible Deployment of Criminal Justice Risk Assessment Tools

Accuracy, Validity, and Bias

What is Accuracy?

Accuracy represents the model’s performance compared to an accepted baseline or predefined correct answer based on the dataset available. Quantitatively, accuracy is usually defined as the fraction of correct answers the model produces among all the answers it gives. So a model that answers correctly in 4 out of 5 cases would have an accuracy of 80%. Interestingly, models which predict rare phenomena (like violent criminality) can be incredibly accurate without being useful for their prediction tasks. For example, if only 1% of individuals will commit a violent crime, a model that predicts that no one will commit a violent crime will have 99% accuracy even though it does not correctly identify any of the cases where someone actually commits a violent crime. For this reason and others, evaluation of machine learning models is a complicated and subtle topic which is the subject of active research. In particular, note that inaccuracy can and should be subdivided into errors of “Type I” (false positive) and “Type II” (false negative) – one of which may be more acceptable than the other, depending on the context. Most commonly, some of the data used to create the model will be reserved for testing and model tuning. These reserved data provide for fresh assessments that help toolmakers avoid overfittingOverfitting is a statistical problem that is analogous to learning the answer to all the questions on an exam by heart, without having actually understood the true principles that made them correct. When a model is said to have overfitted, this means that it has limited ability to generalize its evaluation to new data, and thus limited application for the complex and varied real-world. during the process of experimentation.

Measuring accuracy involves assessing whether the model did the best possible job of prediction on the test data. To say that a model predicts inaccurately is to say that it is giving the wrong answer according to the data, either in a particular case or across many cases.

Since accuracy is focused narrowly on how the tool performs on data reserved from the original data set, it does not address issues that might undermine the reasonableness of the dataset itself (discussed in the section on validity). Indeed, because accuracy is calculated with respect to an accepted baseline of correctness, accuracy fails to account for whether the data used to test or validate the model are uncertain or contested. Such issues are generally taken into account under an analysis of validity. Although accuracy is often the focus of toolmakers when evaluating the performance of their models, validity and bias are often the more relevant concerns in the context of using such tools in the criminal justice system.

What is Validity?

A narrow focus on accuracy can blind decision-makers to important real-world considerations related to the use of prediction tools. With any statistical model, and especially one used in as critical a context as criminal justice risk assessments, it is important to establish the model’s validity, or fidelity to the real world. That is, if risk assessments purport to measure how likely an individual is to fail to appear or to be the subject of a future arrest, then it should be the case that the scores produced in fact reflect the relevant likelihoods. Unlike accuracy, validity takes into consideration the broader context around how the data was collected and what kind of inference is being drawn. A tool might not be valid because the data that was used to develop it does not properly reflect what is happening in the real world (due to measurement error, sampling error, improper proxy variables, failure to calibrate probabilities, Calibration is a property of models such that among the group they predict a 50% risk for, 50% of cases recidivate. Note that this says nothing about the accuracy of the prediction, because a coin toss would be calibrated in that sense. All risk assessment tools should be calibrated, butthere are more specific desirable properties such as calibration within groups (discussed in Requirement 2 below) that not all tools will or should satisfy completely. or other issues).

Separate from data and statistical challenges, a tool might also not be valid because the tool does not actually answer the correct question. Because validation is always with respect to a particular context of use and a particular task to which a system is being put, validating a tool in one context says little about whether that tool is valid in another context. For example, a risk assessment might predict future arrests quite well when applied to individuals in a pretrial context, but quite poorly when applied to individuals post-conviction, or it might predict future arrest well in one jurisdiction, but not another. Sarah L. Desmarais, Evan M. Lowder, Pretrial Risk Assessment Tools: A Primer for Judges, Prosecutors, and Defense Attorneys, MacArthur Safety and Justice Challenge (Feb 2019). The issue of cross-comparison applies not only to geography but to time. It may be valuable to use comparisons over time to assist in measuring the validity of tools, though such evaluations must be corrected for the fact that crime in the United States is presently a rapidly changing (and still on the whole rapidly declining) phenomenon. Similarly, different models built based on the same data, created with different modeling decisions and assumptions, may have different levels of validity. Thus, different kinds of predictions (e.g., failure to appear, flight, recidivism, violent recidivism) in different contexts require separate validation. Without such validation, even well-established methods can produce flawed predictions. In other words, just because a tool uses data collected from the real world does not automatically make its findings valid.

Fundamental Issues with Using Group-Level Data to Judge Individuals

A fundamental philosophical and legal question is whether it is acceptable to make determinations about individuals’ liberty based on data about others in their group. In technical communities, making predictions about individuals from group-level data is known as the ecological fallacy. Although risk assessment tools use data about an individual as inputs, the relationship between these inputs and the predicted outcome is determined by patterns in training data about other people’s behavior.

In the context of sentencing, defendants have a constitutional right to have their sentence determined based on what they did themselves instead of what others with similarities to them have done. This concern arose in Wisconsin v. Loomis, where the court prohibited the use of risk scores as the decisive factor in liberty decisions, noting that “offender who is young, unemployed, has an early age-at-first-arrest and a history of supervision failure, will score medium or high on the Violence Risk Scale even though the offender never had a violent offense,” illustrating how the predictions of these tools do not necessarily map onto individual cases.

The ecological fallacy is especially problematic in the criminal justice system given the societal biases that are reflected in criminal justice data, as described in the sections on Requirements 1 and 2. It is thus likely that decisions made by risk assessment tools are driven in part by what protected class an individual may belong to, raising significant Equal Protection Clause concerns.

While there is a statistical literature on how to deal with technical issues resulting from the ecological fallacy, the fundamental philosophical question of whether it is permissible to detain individuals based on data about others in their group remains. As more courts grapple with whether to use risk assessment tools, this question should be at the forefront of debate and discussed as a first-order principle.

What is Bias?

In statistical prediction settings, “bias” has several overlapping meanings. The simplest meaning is that a prediction made by a model errs in a systematic direction—for instance, it predicts a value that is too low on average, or too high on average for the general population. In the machine learning fairness literature, however, the term bias is used to refer to situations where the predicted probabilities are systematically either too high or too low for specific subpopulations.As a technical matter, a model can be biased for subpopulations while being unbiased on average for the population as a whole. These subpopulations may be defined by protected class variables (race, gender, age, etc.) or other variables of concern, like socioeconomic class. In this paper, we will primarily use the term “bias” in this narrower sense, which aligns with the everyday use of the term referring to disparate judgments about different groups of people. Note here that the phenomenon of societal bias—the existence of beliefs, expectations, institutions, or even self-propagating patterns of behavior that lead to unjust outcomes for some groups—is not always the same as, or reflected in statistical bias, and vice versa. One can instead think of these as an overlapping Venn diagram with a large intersection. Most of the concerns about risk assessment tools are about biases that are simultaneously statistical and societal, though there are some that are about purely societal bias. For instance, if non-uniform access to transportation (which is a societal bias) causes higher rates of failure to appear for court dates in some communities, the problem is a societal bias, but not a statistical one. The inclusion of demographic parity measurements as part of model bias measurement (see Requirement 2) may be a way to measure this, though really the best solutions involve distinct policy responses (for instance, providing transportation assistance for court dates or finding ways to improve transit to underserved communities).

Bias in risk assessment tools can come from many sources. For instance, Eckhouse et al. propose a 3-level taxonomy of biases. Laurel Eckhouse, Kristian Lum, Cynthia Conti-Cook, and Julie Ciccolini, Layers of Bias: A Unified Approach for Understanding Problems with Risk Assessment, Criminal Justice and Behavior, (Nov 2018). Requirement 1 below discusses data bias that is caused by imperfect data quality, missing data, and sampling bias. Requirement 2 discusses model bias that stems from omitted variable bias and proxy variables. Requirement 3 discusses model bias that results from the use of composite scores that conflate multiple distinct predictions. In combination with concerns about accuracy and validity, these challenges present significant concern for the use of risk assessment tools in criminal justice domains.