Introduction

Introduction

Context

Risk assessment instruments are statistical models used to predict the probability of a particular future outcome. Such predictions are accomplished by measuring the relationship between an individual’s features (for example, their demographic information, criminal history, or answers to a psychometric questionnaire) and combining numerical representations of those features into a risk score. Scoring systems are generally created using statistical techniques and heuristics applied to data to consider how each feature contributes to prediction of a particular outcome (e.g., failure to appear at court). These scores are often then used to assign individuals to different brackets of risk. For example, many risk assessment tools assign individuals to decile ranks, converting their risk score into a rating from 1-10 which reflects whether they’re in the bottom 10% of risky individuals (1), the next highest 10% (2), and so on (3-10). Alternatively, risk categorization could be based on thresholds labeled as “low,” “medium,” or “high” risk.

Though they are usually much simpler than the deep neural networks used in many modern artificial intelligence systems, criminal justice risk assessment tools are basic forms of AI. Whether this is the case depends on how one defines AI; it would be true under many but not all of the definitions surveyed for instance in Stuart Russell & Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 2010, at 2. PAI considers more expansive definitions, that include any automation of analysis and decision making by humans, to be most helpful. Some use heuristic frameworks to produce their scores, though most use simple machine learning methods to train predictive models from input datasets. As such, they present a paradigmatic example of the potential social and ethical consequences of automated AI decision-making.

The use of risk assessment tools in criminal justice processes is expanding rapidly, and policymakers at both the federal and state level have passed legislation to mandate their use. In California, the recently enacted California Bail Reform Act (S.B. 10) mandates the implementation of risk assessment tools while eliminating money bail in the state, though implementation of the law has been put on hold as a result of a 2020 ballot measure funded by the bail bonds industry to repeal it; see https://ballotpedia.org/California_Replace_Cash_Bail_with_Risk_Assessments_Referendum_(2020); Robert Salonga, Law ending cash bail in California halted after referendum qualifies for 2020 ballot, San Jose Mercury News (Jan. 17, 2019), https://www.mercurynews.com/2019/01/17/law-ending-cash-bail-in-california-halted-after-referendum-qualifies-for-2020-ballot/. In addition, a new federal law, the First Step Act of 2018 (S. 3649), requires the Attorney General to review existing risk assessment tools and develop recommendations for “evidence-based recidivism reduction programs” and to “develop and release” a new risk- and needs- assessment system by July 2019 for use in managing the federal prison population. The bill allows the Attorney General to use currently-existing risk and needs assessment tools, as appropriate, in the development of this system. This has largely occurred as part of a reform effort that is grappling with extremely high incarceration rates in the United States, which are disproportionate to crime rates and to international and historical baselines (see Figures 1-3). Proponents of these tools have advocated for their potential to streamline inefficiencies, reduce costs, and provide rigor and reproducibility for life-critical decisions. Some advocates hope that these changes will mean a reduction in unnecessary detention and provide fairer and less punitive decisions than the cash bail system or systems where human decision-makers like judges have complete discretion.

Figure 1: Incarceration in the U.S. Relative to OECD and Historical Baselines
Exhibit 1 – Source: Bureau of Justice Statistics, World Prison Brief—Birkbeck, University of London (2015/2016 data) Note: U.S. 1960 figure includes those in state or federal institutions only
Figure 2: U.S. State and Federal Incarceration Rates (1925-2014)

Exhibit 2 – Source: Bureau of Justice Statistics & Wikimedia
Figure 3: U.S. State and Federal Incarceration Relative to All Reported Crimes (1970-2014)
Exhibit 3 – Source: National Archive of Criminal Justice Data, Bureau of Justice Statistics & Wikimedia Note: Punishment rate is calculated based on the number of people incarcerated per year rather than convictions in a given year.

These are critically important public policy goals, but there is reason to believe that these views may be too optimistic. There remain serious and unresolved problems with accuracy, validity, and bias in both the datasets and statistical models that drive these tools. Moreover, these tools are also often built to answer the wrong questions, used in poorly conceived settings, or are not subject to sufficient review, auditing, and scrutiny. These concerns are nearly universal in the AI research community and across our Partnership, though views differ on whether they could realistically be solved by improvements to the tools.

Scope of this report

This Report of the Partnership on AI was written to gather, synthesize, and document the views of the artificial intelligence research community on the use of risk assessment tools in the U.S. criminal justice system. This report focuses on the use of these tools in the pretrial context, but many of the concerns identified with these tools are applicable across other risk assessment contexts (e.g., consideration of parole release and sentencing within the U.S.; design of risk assessment systems generally in other countries). The report attempts to answer: What technical and human-computer interface challenges prevent risk assessment tools from being used to inform fair decisions? And with what transparency, auditing, and procedural protections would it be acceptable to use these tools as possible inputs into criminal justice determinations?

Background on PAI

The Partnership on AI is a 501(c)3 non-profit organization that convenes a coalition of over 80 members, including civil society groups, corporate developers and users of AI, and numerous academic artificial intelligence research labs, to answer important questions about artificial intelligence policy and ethics. This particular report reflects input from conversations that PAI has convened with dozens of its member organizations, as well as numerous experts on fairness and bias in machine learning and the U.S. criminal justice system. Though the report should not be taken as stating an official stance of any particular member, it attempts to report views widely held across our membership and the artificial intelligence research community.

PAI’s work on risk assessment tools in the criminal justice system was initially prompted by the passage of Senate Bill 10 (S.B. 10) in California, which would use risk assessment tools in making pretrial detention decisions. The scope of this project has since expanded, with this report addressing not only the S.B. 10 context but also the concerns more broadly with the use of risk assessment tools around the country.

Objectives of the report

An overwhelming majority of the Partnership’s consulted experts agreed that current risk assessment tools are not ready for use in helping to make decisions to detain or continue to detain criminal defendants without the use of an individualized hearing. In addition, many of our civil society partners have taken a clear public stance to this effect, and some go further in suggesting that only individual-level decision-making will be adequate for this application regardless of the robustness and validity of risk assessment instruments. See The Use of Pretrial ‘Risk Assessment’ Instruments: A Shared Statement of Civil Rights Concerns, http://civilrightsdocs.info/pdf/criminal-justice/Pretrial-Risk-Assessment-Full.pdf (shared statement of 115 civil rights and technology policy organizations, arguing that all pretrial detention should follow from evidentiary hearings rather than machine learning determinations, on both procedural and accuracy grounds); see also Comments of Upturn; The Leadership Conference on Civil and Human Rights; The Leadership Conference Education Fund; NYU Law’s Center on Race, Inequality, and the Law; The AI Now Institute; Color Of Change; and Media Mobilizing Project on Proposed California Rules of Court 4.10 and 4.40, https://www.upturn.org/static/files/2018-12-14_Final-Coalition-Comment-on-SB10-Proposed-Rules.pdf (“Finding that the defendant shares characteristics with a collectively higher risk group is the most specific observation that risk assessment instruments can make about any person. Such a finding does not answer, or even address, the question of whether detention is the only way to reasonably assure that person’s reappearance or the preservation of public safety. That question must be asked specifically about the individual whose liberty is at stake — and it must be answered in the affirmative in order for detention to be constitutionally justifiable.”) PAI notes that the requirement for an individualized hearing before detention implicitly includes a need for timeliness. Many jurisdictions across the US have detention limits at 24 or 48 hours without hearings. Aspects of this stance are shared by some risk assessment tool makers; see, Arnold Ventures’ Statement of Principles on Pretrial Justice and Use of Pretrial Risk Assessment, https://craftmediabucket.s3.amazonaws.com/uploads/AV-Statement-of-Principles-on-Pretrial-Justice.pdf. One objective of this report is to articulate the reasons for this nearly unanimous view of contributors and to help inform a dialogue with policymakers considering the use of these tools. PAI members and the wider AI community do not, however, have consensus on whether statistical risk assessment tools could ever be improved to justly detain or continue to detain someone on the basis of their risk assessment score without an individualized hearing. For some of our members, the concerns remain structural and procedural as well as technical. See Ecological Fallacy section and Baseline D for further discussion of this topic. Regardless of the differing views on these particular issues, this report summarizes the technical, human-computer interface, and governance problems that the community has collectively identified.

Baselines for Comparison

Some of the controversy about risk assessment tools derives from different baselines against which risk assessment tools are evaluated. Policymakers have many possible baselines they can use in deciding whether to procure and use these tools, including:

  1. Do risk assessment tools achieve absolute fairness? This is unlikely to be achieved by any system or institution due to serious limitations in data and also unresolved philosophical questions about fairness.
  2. Are risk assessment tools as fair as they can possibly be based on available datasets? This may be achievable, but only in the context of (a) deciding on a specific measure of fairness and (b) using the best available methods to mitigate societal and statistical biases in the data. In practice, however, given the limitations in available data, this often translates to ignoring biases in the data that are difficult to address.
  3. Are risk assessment tools an improvement over current processes and human decision-makers? Risk assessment tools can be benchmarked against the performance of the processes, institutions, and human decision-making practices in place before their introduction, or similar systems in other jurisdictions without risk assessment tools. Such evaluations could be based on measurable goals (like better predicting appearance for court dates or recidivism) or lack of susceptibility to human biases. In this sense, risk assessment tools may not achieve a defined notion of fairness, but rather be comparatively better than the status quo.
  4. Are risk assessment tools an improvement over other possible reforms to the criminal justice system? Other reforms may address the same objectives (e.g., improving public safety, reducing the harm of detention, and reducing the costs and burdens of judicial process) at lower cost, greater ease of implementation, or without trading off civil rights concerns.

Baselines A and B are useful for fundamental research on algorithmic fairness and for empirical analysis of the performance of existing systems, but they necessarily produce ambiguous results due to the existence of highly defensible but incompatible definitions of fairness. Nonetheless, they can provide a useful framework for understanding the philosophical, legal, and technical issues with proposed tools.

Baseline C is one of the widely held perspectives by experts operating in the space. It is potentially appropriate for policymakers and jurisdictions purchasing tools under legislative mandates beyond their control, or in situations where political constraints mean that Baseline D is inapplicable. We should, however, stress that in all of the conversations convened by the Partnership on AI, Baseline D has been widely viewed as more fundamentally correct and appropriate as both a policymaking goal and an evaluation standard for risk assessment tools. Therefore, legislatures and judicial authorities should apply Baseline D whenever it is feasible for them to do so.