Explainable Machine Learning in Deployment

PAI Staff

Organizations and policymakers around the world are turning to Explainable AI (XAI) as a means of addressing a range of AI ethics concerns. PAI’s recent research paper, Explainable Machine Learning in Deployment, is the first to examine how ML explainability techniques are actually being used. We find that in its current state, XAI best serves as an internal resource for engineers and developers, rather than for providing explanations to end users. Additional improvements to XAI techniques are necessary in order for them to truly work as intended, and help end users, policymakers, and other external stakeholders understand and evaluate automated decisions.

READ THE BLOG POST  READ THE PAPER

1

Human-AI Collaboration Trust Literature Review: Key Insights and Bibliography

PAI Staff

Key Insights from a Multidisciplinary Review of Trust Literature

Key Insights from a Multidisciplinary Review of Trust Literature

Understanding trust between humans and AI systems is integral to promoting the development and deployment of socially beneficial and responsible AI. Successfully doing so warrants multidisciplinary collaboration.

In order to better understand trust between humans and artificially intelligent systems, the Partnership on AI (PAI), supported by members of its Collaborations Between People and AI Systems (CPAIS) Expert Group, conducted an initial survey and analysis of the multidisciplinary literature on AI, humans, and trust. This project includes a thematically-tagged Bibliography with 78 aggregated research articles, as well as an overview document presenting seven key insights.

These key insights, themes, and aggregated texts can serve as fruitful entry points for those investigating the nuances in the literature on humans, trust, and AI, and can help align understandings related to trust between people and AI systems. This work can also inform future research, which should investigate gaps in the research and our bibliography to improve our understanding of how human-AI trust facilitates, or sometimes hinders, the responsible implementation and application of AI technologies.

Key Insights

Several high-level insights emerged when reflecting on the bibliography of submitted articles:

  1. There is a presupposition that trust in AI is a good thing, with limited consideration of distrust’s value.
    The original project proposal emphasized a need to understand the literature on humans, AI, and trust in order to eventually determine appropriate levels of trust and distrust between AI and humans in different contexts. However, the articles included in the bibliography are largely framed with the need and motivation towards trust – not distrust – between AI systems and humans. While certain instances may warrant facilitated trust between humans and AI, others may actually enable more socially beneficial outcomes if they prompt distrust or caution. Future literature should explore distrust as related, but not necessarily directly opposite, to the concept of trust. For example, an AI system that helps doctors detect cancer cells is only useful if the human doctor and patient trust that information. In contrast, individuals should remain skeptical of AI systems designed to induce trust for malevolent purposes, such as AI-generated malware that may use data to more realistically mimic the conversational style of a target’s closest friends.
  2. Many of the articles were published before the Internet’s ubiquity/the social implications of AI became a central research focus.
    It is important to contextualize recent literature on intelligent systems and humans with literature focused on social and cognitive mechanisms undergirding human to human, or human to organizational, trust. Future work can put many of the foundational, conceptual articles that were written before the 21st century in conversation with those specifically focused on the context of AI systems, and their different use cases. It can also compare foundational, early articles’ exploration of trust with how trust is seen specifically in relation to humans interacting with AI.
  3. Trust between humans and AI is not monolithic: Context is vital.
    Trust is not all or nothing. There often exist varying degrees of trust, and the level of trust sufficient to deploy AI in different contexts is therefore an important question for future exploration. There might also be several layers of trust to secure before someone might trust and perhaps ultimately use an AI tool. For example, one might trust the data upon which an intelligent system was trained, but not the organization using that data, or one might trust a recommender system or algorithm’s ability to provide useful information, but not the specific platform upon which it is delivered. The implications of this multifaceted trust between human and AI systems, as well as its implications on adoption and use, should be explored in future research.
  4. Promoting trust is often presented simplistically in the literature.
    The majority of the literature appears to assert that not only are AI systems inherently deserving of trust, but also that people need guidance in order to trust the systems. The basic formula is that explanation will demonstrate trustworthiness, and once understood to be deserving of trust, people will use AI. Both of these conceptual leaps are contestable. While explaining the internal logic of AI systems does, in some instances, improve confidence for expert users, in general, providing simplified models of the internal workings of AI has not been shown to be helpful or to increase trust.
  5. Articles make different assumptions about why trust matters.
    Within our corpus, we found a range of implicit assumptions about why fostering and maintaining trust is important and valuable. The dominant stance is that trust is necessary to ensure that people will use AI. The link between trust and adoption is tenuous at best, as people often use technologies without trusting them. What is largely consistent across the corpus – with the exception of some papers concerned about the dangers of overtrust in AI – is the goal of fostering more trust in AI, or stated differently, that more trust is inherently better than less trust. This premise needs challenging. A more reasonable goal would be that people are able to make individual assessments about which AI they ought to trust and which they ought not trust, in the service of their goals for what specifically and in which circumstances. This connects to insight 1: There is a presupposition that trust in AI is a good thing. It is important to think about context, person-level motivations and preferences, as well as instances in which trust might not be a precondition for use or adoption.
  6. AI definitions differ between publications.
    The lack of consistent definitions for AI within our corpus makes it difficult to compare findings. Most articles do not present a formal definition of AI, as they are concerned with a particular intelligent system applied in a specific domain. The systems in question differ in significant ways, in terms of the types of users who may need to trust the system, the types of outputs that a person may need to trust, and the contexts in which the AI is operating (e.g., high- vs. low-stakes environments). It is likely that these entail different strategies as they relate to trust. There is a need to develop a framework for understanding how these different contributions relate to each other, potentially looking not at trust in AI, but at trust in different facets and applications of AI. For a more detailed analysis of what questions to ask to differentiate particular types of human-AI collaboration, see the PAI CPAI Human-AI Collaboration Framework.
  7. Institutional trust is underrepresented.
    Institutional trust might be especially relevant in the context of AI, where there is often a competence or knowledge gap between everyday users and those developing the AI technologies. Everyday users, lacking high levels of technical capital and knowledge, may find it difficult to make informed judgments of particular AI technologies; in the absence of this knowledge, they may rely on generalized feelings of institutional trust.

About the Bibliography

About the Bibliography

The CPAI Trust Literature Bibliography includes 78 thematically tagged research articles (with references and abstracts). The article selection process sourced content from a multidisciplinary community all aligned around an interest and expertise in human-AI collaboration. Submitted articles were evaluated for inclusion and analyzed by members of a smaller project group from within the PAI Partner community. An analysis of the almost 80 initial articles resulted in the development of four thematic tags, highlighting the ways the article abstracts approached the issue of trust. Specifically:

Understanding – lays out a conceptual framework for trust or is primarily a survey of trust-related issues.
Promoting – focuses on means for increasing trust
Receiving – focuses on the entity (e.g., a robot, a system, a website) that is trusted
Impacting – focuses on the nature of changes due to trust being present (e.g., the impact on a group or an organization when it experiences trust)
Two individuals from the smaller project group undertook a thematic tagging exercise to assess inter-rater reliability and the distribution of themes across articles. They tagged themes as primary and secondary (first and second order) for each article from the four thematic options above.

The CPAIS Trust Literature Bibliography identifies thematic tags for each article, at levels 1 and 2. The “themes” column lists the first order themes and the second order themes, where applicable. The total tags for each article (at both levels) are also provided. “Understanding trust” was the most frequent theme – used with 61 articles (78% of the total). 50 articles (64%) were tagged with “promoting trust,” and 29 articles (37%) were tagged with “receiving trust”. Finally, 13 articles (16%) were tagged with a focus on impacting trust.

This bibliography and thematic tags serve as fruitful entry points for those investigating the nuances in the literature on humans, trust, and AI, especially when contextualized with the insights drawn from the corpus presented above.

DOWNLOAD INSIGHTS            VIEW BIBLIOGRAPHY

1

Human-AI Collaboration Framework & Case Studies

PAI Staff

Overview

Overview

Best practices on collaborations between people and AI systems – including those for issues of transparency and trust, responsibility for specific decisions, and appropriate levels of autonomy – depend on a nuanced understanding of the nature of those collaborations.

With the support of the Collaborations Between People and AI Systems (CPAIS) Expert Group, PAI has developed a Human-AI Collaboration Framework, containing 36 questions that identify some characteristics that differentiate examples of human-AI collaborations. We have also prepared a collection of seven case studies that illustrate the Framework and its applications in the real world.

This project explores the relevant features one should consider when thinking about human-AI collaboration, and how these features present themselves in real-world examples. By drawing attention to the nuances – including the distinct implications and potential social impacts – of specific AI technologies, the Framework can serve as a helpful nudge toward responsible product/tool design, policy development, or even research processes on or around AI systems that interact with humans.

As a software engineer from a leading technology company suggested, this Framework would be useful to them because it would enable focused attention on the impact of their AI system design, beyond the typical parameters of how quickly it goes to market or how it performs technically.

“By thinking through this list, I will have a better sense of where I am responsible to make the tool more useful, safe, and beneficial for the people using it. The public can also be better assured that I took these parameters into consideration when working on the design of a system that they may trust and then embed in their everyday life.”

SOFTWARE ENGINEER, PAI RESEARCH PARTICIPANT

Case Studies

Case Studies

To illustrate the application of this Framework, PAI spoke with AI practitioners from a range of organizations, and collected seven case studies designed to highlight the variety of real world collaborations between people and AI systems. The case studies provide descriptions of the technologies and their use, followed by author answers to the questions in the Framework:

  1. Virtual Assistants and Users (Claire Leibowicz, Partnership on AI)
  2. Mental Health Chatbots and Users (Yoonsuck Choe, Samsung)
  3. Intelligent Tutoring Systems and Learners (Amber Story, American Psychological Association)
  4. Assistive Computing and Motor Neuron Disease Patients (Lama Nachman, Intel)
  5. AI Drawing Tools and Artists (Philipp Michel, University of Tokyo)
  6. Magnetic Resonance Imaging and Doctors (Bendert Zevenbergen, Princeton Center for Information Technology Policy)
  7. Autonomous Vehicles and Passengers (In Kwon Choi, Samsung)

 

VIEW THE FRAMEWORK AND CASE STUDIES        READ THE BLOG POST

1

Visa Laws, Policies, and Practices: Recommendations for Accelerating the Mobility of Global AI/ML Talent

PAI Staff


Executive Summary

Executive Summary

Immigration laws, policies, and practices are challenging the ability of many communities, including the artificial intelligence and machine learning (AI/ML) community, to incorporate diverse voices in their work. As a global, multi-stakeholder non profit committed to the creation and dissemination of best practices in artificial intelligence, the Partnership on AI (PAI) is uniquely positioned to address the impacts of immigration laws, policies, and practices on the AI/ML community.

PAI believes that bringing together experts from countries around the world that represent different cultures, socio-economic experiences, backgrounds, and perspectives is essential for AI/ML to flourish and help create the future we desire. In order to fulfill their talent goals and host conferences of international caliber, countries around the world will need to devise laws, policies, and practices that enable people around the world to contribute to these conversations.

Based on input from PAI Partners, and PAI’s own research, this paper offers recommendations to address these specific challenges. It highlights the importance of conferences and convenings for a variety of disciplines that are making important contributions to AI/ML, and makes recommendations for participants and organizers that may facilitate ease of travel for these events. It also presents recommendations for governments to improve the accessibility, evaluation and processing of visas for all types of potential visitors, including students, interns, and accompanying families. Appendices to the paper respond to potential questions, and provide an overview of the global demand for AI talent, as well as additional details on technical or expert visa, residence and work permit laws, policies and practices.

PAI’s recommendations are based on our area of expertise, and have been developed to help advance the mobility of innovative global AI/ML talent from a variety of disciplines. Many countries have already created visa classifications for other specialized occupations, including medical professionals, professional athletes, entertainers, religious workers, and entrepreneurs.

At the same time, we acknowledge the complex immigration debates taking place in countries around the world, and the challenges posed by global migration and the quest for basic human rights and dignity. These recommendations are in no way intended to minimize or replace opportunities for those affected by the ongoing immigration discussions and policymakers actions. We hope policymakers can create a path towards permanent residency or citizenship for these groups. In fact, while our recommendations target our field of expertise, we hope our paper can serve as a useful resource for the broader community, in support of balancing government public safety responsibilities with the benefits of immigration, freedom of movement, and collaboration.

Though this document incorporated suggestions from many of PAI’s partner organizations, it should not under any circumstances be read as representing the views of any specific member of the Partnership. Instead, it is an attempt to report the views of the artificial intelligence community as a whole.

Recommendations

Recommendations

Based on our investigations, PAI has developed the policy recommendations below for the global AI/ML community and policymakers around the world. Additional details on each of these recommendations are provided in the full text of the report.

I. Recommendations for the Global AI/ML Community:

  1. Use Plain Language Where Possible
    Consular and immigration officials may not be trained or familiar with the language used in the AI/ML community. PAI recommends that visa applicants explain technical terms using as much plain language as possible to describe the purpose of their visit and areas of expertise to facilitate the review of application documents and forms.
  2. Share Relevant Information with Host Countries in Advance
    Many governments evaluate visa applications on the basis of the applicant’s nationality and other factors, rather than the skills they will bring to the convening. Conference organizers will have to take extraordinary steps to facilitate the entry of their invited participants until laws, policies, and practices change in countries around the world. Conference organizers should contact host country government officials far in advance of the conference to share relevant information and facilitate government review of visa applications. Useful information includes a description of the conference, number of invited participants, and copies of invitation letter templates and other necessary paperwork.

II. Recommendations for Policymakers:

  1. Accelerate Reviews of Visa Applications
    Pass and implement laws, policies, and practices that accelerate review and favorably consider applications for visas, permits, and permanent legal status from highly skilled individuals. Visas should not be numerically limited or “capped.”
  2. Create AI/ML Visa Classifications within Existing Groups
    Members of existing intergovernmental groups, such as the Organization for Economic Cooperation and Development (OECD), should create visa classifications that enable AI/ML multidisciplinary experts to meet, convene, study, and work across member countries. The terms of the visa should be reciprocal across all countries.
  3. Publish Accessible Visa Application Information
    Visa application rules, processes & timelines should be clear, easily understood and accessible – published in plain language, in the applicants’ native languages on websites and in other publicly available locations. These processes should be fair, transparent, and clearly demonstrate that determinations for sponsor visas are based on skills.
  4. Establish Just Standards for Evaluating Visa Applications
    Eliminate nationality-based barriers in evaluating visa and permanent residence applications from highly skilled individuals. Security-based denials of applications should not be nationality based, but rather should be founded on specific and credible security and public safety threats, evidence of visa fraud, or indications of human trafficking.
  5. Train Officials in the Language of Emerging Technologies
    Train consular and immigration officials in the language of emerging technologies so they can quickly recognize and adjudicate applications from highly skilled experts.
  6. Assist Visa Applicants
    Empower select officials to assist applicants in correctly filling out visa paperwork, as well as clarifying and resolving any questions or discrepancies that may otherwise lead to a denial or delay in approval. Beneficiaries would include startups, small- and medium-sized enterprises, smaller colleges and universities, less affluent applicants, and students and interns.
  7. Students and Interns are the Future
    Pass laws that establish special categories of visas or permits for AI/ML students and interns. These laws should clearly identify a path for graduates to obtain a work permit (as necessary), or to obtain permanent legal status or citizenship.
  8. Redefine “Families”
    Adopt visa permissions that reflect a comprehensive definition of “family,” modeled on the Finnish Aliens Act and similar definitions in other European nations. Family visas should not be numerically limited. Legal spouses, partners, and those with family ties should also be permitted to work or study in the host country. Long-term caregivers should be permitted to accompany and remain with the main visa applicant and their family while employed in that capacity.
  9. Rely on Effective Policies and Systems to Protect Information
    Immigration restrictions do not adequately protect information and intellectual property rights. For example, trade negotiations can strengthen intellectual property laws and establish courts to protect and enforce intellectual property rights owned by individual rights holders, whereas implementing immigration policies and practices that broadly apply to all applicants from a particular country do not.

READ THE FULL PAPER

Frequently Asked Questions

Frequently Asked Questions

Why would PAI tackle a subject such as visas and immigration? This topic is not really related to artificial intelligence research.

PAI believes that bringing together experts from countries around the world that represent different cultures, socio-economic experiences, backgrounds, and perspectives is essential for AI/ML to flourish and help create the future we desire. Artificial intelligence is projected to affect all facets of society, and in some ways it already is having those effects. PAI’s work addresses a number of topics related to AI, such as criminal justice and labor and economy. Our work to address immigration challenges affecting the AI community is quite similar.

How does this document pertain to PAI’s mission and work?

This document makes visa policy recommendations that would improve the mobility of global AI/ML talent and enable companies, organizations and countries to benefit from their diverse perspectives. Fostering, cultivating, and preserving a culture of diversity and belonging in our work and in the people and organizations who contribute to our work is essential to our mission, and embedded in our Tenets. These include: committing to open research and dialogue on the ethical, social, economic, and legal implications of AI, ensuring that AI technologies benefit and empower as many people as possible, and striving to create a culture of cooperation, trust, and openness among AI scientists and engineers to help better achieve these goals.

Who benefits from this policy paper?

Unlike large, multinational companies and prominent, well-funded universities and colleges,  startups, small- and medium-sized enterprises, individuals traveling to conferences, less affluent applicants, students, and interns often lack the resources to hire experts to ensure their preferred candidates have the greatest chance to obtain visas for internships, to study, or to work in their organizations. These groups and individuals  often cannot successfully compete for visas, especially those that are numerically limited. They would be the greatest beneficiaries should governments implement these recommendations.

Why is PAI uniquely suited to address this issue?

As a multi-stakeholder non profit, PAI convenes over  100 global Partners, originating from 12 countries and four continents, and representing industry, civil society, and academic and research institutes. As such, we are uniquely qualified to describe the impacts of immigration laws, policies, and practices on the AI/ML community. The impetus for this document came from many of PAI’s Partners and colleagues, who have shared how certain visa laws, policies, and practices negatively affect their organizations’ abilities to benefit from global representatives and perspectives in their work.

Why is PAI focused on incorporating diverse voices in AI/ML?

Diverse perspectives are necessary to ensure that AI is developed in a responsible manner,  thoughtfully benefiting all people in society. Voices and contributions from global talent are also essential to reducing the unintended consequences that can arise from AI/ML development and deployment, including those related to safety and security. Due to the emergent and rapidly evolving nature of AI technology, AI in particular engenders high impact AI safety and security risks, which can be mitigated by increasing the diversity of participating voices Han, T. A., Pereira, L. M., Santos, F. C., & Lenaerts, T. (2019). Modelling the Safety and Surveillance of the AI Race. arXiv preprint. Diverse representation also serves to promote the safety of key members of the AI/ML community. Underrepresented voices, such as those of minorities and the LGBTQ community, are important as we design AI/ML systems to be inclusive of all populations.

Is PAI suggesting that AI/ML practitioners should be treated differently than other skilled workers? How is this different from other visa categories?

PAI’s recommendations would enable AI/ML practitioners, from a variety of disciplines, to travel and work more freely. In some cases, this could entail special visa classifications, similar to those that already exist for skilled workers in other specialized occupations, such as medical professionals, professional athletes, entertainers, religious workers, entrepreneurs, skilled laborers and trades workers.

This paper also highlights the many disciplines involved in the development and operations of AI/ML systems, above and beyond what is sometimes defined as “skilled technology work.” Responsible AI/ML systems involve input from researchers and practitioners in social sciences such as economics, sociology, philosophy, ethics, linguistics, and communications, and the “experiential expertise” offered by those working in labor and workers’ rights See discussion of “experiential expertise” in: Young, M., Magassa, L., & Friedman, B. (2019). Toward inclusive tech policy design: a method for underrepresented voices to strengthen tech policy documents. Ethics and Information Technology, 21(2), 89-103., in addition to technical fields such as mathematics, statistics, computer science, data science, neuroscience, and biology.

How does this work? Unlike medical professionals or engineers, AI/ML practitioners don’t have a certificate or license for governments to determine that they are experts.

Countries establish criteria for evaluating applications, whether for technical talent, a professional athlete, or someone skilled in trades or labor. Established eligibility criteria, and the process for evaluating this criteria, vary greatly from country to country. The PAI paper offers models for countries to consider and draw upon if they decide to create a classification for AI/ML practitioners.

For example, some countries require letters from a potential employer, or to have someone in the field attest to the applicant’s particular skills, or other supporting documentation that proves the applicant has the desired skills. Some examples:

  • An independent review board: The UK Tech Nation Visa, also known as the Tier 1 Exceptional Talent Visa, assigned an independent, “designated competent body,” to review and endorse applications. The Tech Nation Visa Guide outlines the skills and specialties typically exhibited in applications reviewed by this independent body, and the eligibility criteria.
  • Points-based systemCanada’s Express Entry Program, like other Canadian visas, evaluates applicants on the basis of the types of occupations and levels of skills they hope to attract. Certain occupations and skills, among other criteria, garner greater numbers of points. The higher the overall point total, the greater the likelihood of being admitted entry.
  • Government review: Japan’s Skilled Labor Visa program seeks documentation to support the visa application, and that documentation must prove, among other elements, that the applicant has a certain number of years of experience. The government will review the documentation, and issue a Certificate of Eligibility (COE) if they think the applicant possesses the necessary experience and skills. The existence of the COE in the application can accelerate the visa processing time.
  • Additional examples can be found in Recommendations for Policymakers #1 and Appendix C of the paper.

Visa Laws, Policies, and Practices: Recommendations for Accelerating the Mobility of Global AI/ML Talent

Executive Summary

Recommendations

Frequently Asked Questions

Recommendations

Frequently Asked Questions

Sources Cited

  1. Han, T. A., Pereira, L. M., Santos, F. C., & Lenaerts, T. (2019). Modelling the Safety and Surveillance of the AI Race. arXiv preprint.
  2. See discussion of “experiential expertise” in: Young, M., Magassa, L., & Friedman, B. (2019). Toward inclusive tech policy design: a method for underrepresented voices to strengthen tech policy documents. Ethics and Information Technology, 21(2), 89-103.
  3. Han, T. A., Pereira, L. M., Santos, F. C., & Lenaerts, T. (2019). Modelling the Safety and Surveillance of the AI Race. arXiv preprint.
  4. See discussion of “experiential expertise” in: Young, M., Magassa, L., & Friedman, B. (2019). Toward inclusive tech policy design: a method for underrepresented voices to strengthen tech policy documents. Ethics and Information Technology, 21(2), 89-103.
Table of Contents
1
2
3
4

AI, Labor, and the Economy Case Study Compendium

PAI Staff

Preface

Preface

The AI, Labor, and Economy Case Studies Compendium is a work product of the Partnership on AI’s “AI, Labor, and the Economy” (AILE) Working Group, formed through a collaborative process of research scoping and iteration. Though this work product reflects the inputs of many members of PAI, it should not be read as representing the views of any particular organization or individual within this Working Group, or an entity within PAI at-large.

The Partnership on AI (PAI) is a 501(c)3 nonprofit organization established to study and formulate best practices on AI technologies, to advance the public’s understanding of AI, and to serve as an open platform for discussion and engagement about AI and its influences on people and society.

One of PAI’s significant program lines is a series of Working Groups reflective of its Thematic Pillars, which are a driving force in research and best practice generation. The Partnership’s activities are deliberately determined by its coalition of over 80 members, including civil society groups, corporate users of AI, and numerous academic artificial intelligence research labs, but from the outset of the organization, the intention has been to create a place for open critique and reflection. Crucially, the Partnership is an independent organization; though supported and shaped by our Partner community, the Partnership is ultimately more than the sum of its parts and will make independent determinations to which its Partners will collectively contribute, but never individually dictate. PAI provides staff administrative and project management support to Working Groups, oversees project selection, and provides financial resources or direct research support to projects as needed.

AI, Labor, and the Economy Case Study Compendium

Preface

Objectives and Scope

Subject Diversity and Common Motifs 

Themes and Observations

Terms and AI techniques used

Methodology

Limitations and Further Work

Conclusion

Appendix

Sources Cited

  1. See Acknowledgements for more information
  2. Researchers have argued for the need for “more systematic collection of the use of these technologies at the firm level.” The case study project intends to provide quantitative and qualitative data at the firm level. For more, see “AI, Labor, Productivity and the Need for Firm-Level Data,”Manav Raj and Robert Seamans, April 2018.
  3. In business circles, many pre-established techniques such as pattern-matching heuristics, or linear regression and other forms of statistical data analysis, have recently been rebranded as “AI” (and in the case of statistical regression, also as “ML”). We accept these expansive definitions not because they are fashionable, but because they are more useful for understanding the economic consequences of present forms of automation. See section “Terms and AI techniques used” for more details on how these terms are defined.)
  4. Nils J. Nilsson, The Quest for Artificial Intelligence: A History of Ideas and Achievements, (Cambridge, UK: Cambridge University Press, 2010).
  5. Quoted figures are reported from subject organizations, not independent analyses
  6. As the case illustrates, the social and labor impacts can often cascade beyond the location of the AI implementation. Kate Crawford and Vladan Joler explore this concept extensively as it relates to the “vast planetary network” of labor, energy, and data to support small interactions with an Amazon Echo. See more at www.anatomyof.ai.
  7. An ‘AI-native’ refers to a company that was founded with a stated mission of leveraging artificial intelligence or machine learning as a key enabling technology. ‘AI-natives’ can build infrastructure from the ground-up without the need to shift from legacy systems (e.g., on-premise to cloud-based storage).
  8. For more, see “Is the Solow Paradox Back?”, McKinsey Quarterly, June 2018.
  9. We do not have a measure of hours worked to estimate the increase in labor productivity precisely.
  10. Some have argued that inequality could increase with the proliferation of AI in the long term. While we do not address this question, please see Joseph Stiglitz and Anton Korinek’s paper for more: “Artificial Intelligence and Its Implications for Income Distribution and Unemployment,” December 2017.
  11. It is not clear what the net-impact of AI on jobs will be in the near future. The McKinsey Global Institute estimates that “total full-time-equivalent-employment demand might remain flat, or even that there could be a slightly negative net impact on jobs by 2030,” yet demand for new types of jobs may increase, as seen with the advent of the personal computer in the late 20th century.
  12. This only includes scientists and research associates and does not account for data scientists, automation engineers, and lab technicians that support teams with their services.
  13. Zymergen is an “AI-native” company that was founded in 2013. As such, the company started its data storage in the cloud. All data infrastructure could be built with a clean slate and modern toolchains, making data exportation and analysis on cloud systems easier than it might be for an incumbent (such as Tata Steel Europe). The latter might be dependent on proprietary or embedded on-premise systems that were installed without these objectives in mind.
  14. Natural Language Processing, a popular subfield of AI
  15. CNN’s were tested as part of Zymergen’s broader recommendation engine and were also used in isolated cases within the lab (e.g. computer vision for plate readers).
  16. During the time of writing the case study in fall 2018, the company had raised $174M. On December 13, 2018, the company announced a $400M Series C round from multiple investors. See coverage of the announcement on Bloomberg and the Wall Street Journal.
  17. Our definition draws on the classic articulation of automation described by Parasuraman, Sheridan, and Wickens (2000): https://ieeexplore.ieee.org/document/844354
Table of Contents
1
2
3
4
5
6
7
8
9
10

Report on Algorithmic Risk Assessment Tools in the U.S. Criminal Justice System

PAI Staff

Overview

Overview 

This report was written by the staff of the Partnership on AI (PAI) and many of our Partner organizations, with particularly  input from the members of PAI’s Fairness, Transparency, and Accountability Working Group. Our work on this topic was initially prompted by California’s Senate Bill 10 (S.B. 10), which would mandate the purchase and use of statistical and machine learning risk assessment tools for pretrial detention decisions, but our work has subsequently expanded to assess the use of such software across the United States.

Though this document incorporated suggestions or direct authorship from around 30-40 of our partner organizations, it should not under any circumstances be read as representing the views of any specific member of the Partnership. Instead, it is an attempt to report the widely held views of the artificial intelligence research community as a whole.

The Partnership on AI is a 501(c)3 nonprofit organization established to study and formulate best practices on AI technologies, to advance the public’s understanding of AI, and to serve as an open platform for discussion and engagement about AI and its influences on people and society.

The Partnership’s activities are determined in collaboration with its coalition of over 80 members, including civil society groups, corporate developers and users of AI, and numerous academic artificial intelligence research labs. PAI aims to create a space for open conversation, the development of best practices, and coordination of technical research to ensure that AI is used for the benefit of humanity and society. Crucially, the Partnership is an independent organization; though supported and shaped by our Partner community, the Partnership is ultimately more than the sum of its parts and makes independent determinations to which its Partners collectively contribute, but never individually dictate. PAI provides administrative and project management support to Working Groups, oversees project selection, and provides financial resources or direct research support to projects as needs dictate.

The Partnership on AI is deeply grateful for the collaboration of so many colleagues in this endeavor and looks forward to further convening and undertaking the multi-stakeholder research needed to build best practices for the use of AI in this critical domain.

Report on Algorithmic Risk Assessment Tools in the U.S. Criminal Justice System

Overview

Executive Summary

Introduction

Minimum Requirements for the Responsible Deployment of Criminal Justice Risk Assessment Tools

Requirement 1: Training datasets must measure the intended variables

Requirement 2: Bias in statistical models must be measured and mitigated

Requirement 3: Tools must not conflate multiple distinct predictions

Requirement 4: Predictions and how they are made must be easily interpretable

Requirement 5: Tools should produce confidence estimates for their predictions

Requirement 6: Users of risk assessment tools must attend trainings on the nature and limitations of the tools

Requirement 7: Policymakers must ensure that public policy goals are appropriately reflected in these tools

Requirement 8: Tool designs, architectures, and training data must be open to research, review and criticismRequirement 8: Tool designs, architectures, and training data must be open to research, review and criticism

Requirement 9: Tools must support data retention and reproducibility to enable meaningful contestation and challenges

Requirement 10: Jurisdictions must take responsibility for the post-deployment evaluation, monitoring, and auditing of these tools

Conclusion

Sources Cited

  1. For example, many risk assessment tools assign individuals to decile ranks, converting their risk score into a rating from 1-10 which reflects whether they’re in the bottom 10% of risky individuals (1), the next highest 10% (2), and so on (3-10). Alternatively, risk categorization could be based on thresholds labeled as “low,” “medium,” or “high” risk.
  2. Whether this is the case depends on how one defines AI; it would be true under many but not all of the definitions surveyed for instance in Stuart Russell & Peter Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, 2010, at 2. PAI considers more expansive definitions, that include any automation of analysis and decision making by humans, to be most helpful.
  3. In California, the recently enacted California Bail Reform Act (S.B. 10) mandates the implementation of risk assessment tools while eliminating money bail in the state, though implementation of the law has been put on hold as a result of a 2020 ballot measure funded by the bail bonds industry to repeal it; see https://ballotpedia.org/California_Replace_Cash_Bail_with_Risk_Assessments_Referendum_(2020); Robert Salonga, Law ending cash bail in California halted after referendum qualifies for 2020 ballot, San Jose Mercury News (Jan. 17, 2019), https://www.mercurynews.com/2019/01/17/law-ending-cash-bail-in-california-halted-after-referendum-qualifies-for-2020-ballot/. In addition, a new federal law, the First Step Act of 2018 (S. 3649), requires the Attorney General to review existing risk assessment tools and develop recommendations for “evidence-based recidivism reduction programs” and to “develop and release” a new risk- and needs- assessment system by July 2019 for use in managing the federal prison population. The bill allows the Attorney General to use currently-existing risk and needs assessment tools, as appropriate, in the development of this system.
  4. In addition, many of our civil society partners have taken a clear public stance to this effect, and some go further in suggesting that only individual-level decision-making will be adequate for this application regardless of the robustness and validity of risk assessment instruments. See The Use of Pretrial ‘Risk Assessment’ Instruments: A Shared Statement of Civil Rights Concerns, http://civilrightsdocs.info/pdf/criminal-justice/Pretrial-Risk-Assessment-Full.pdf (shared statement of 115 civil rights and technology policy organizations, arguing that all pretrial detention should follow from evidentiary hearings rather than machine learning determinations, on both procedural and accuracy grounds); see also Comments of Upturn; The Leadership Conference on Civil and Human Rights; The Leadership Conference Education Fund; NYU Law’s Center on Race, Inequality, and the Law; The AI Now Institute; Color Of Change; and Media Mobilizing Project on Proposed California Rules of Court 4.10 and 4.40, https://www.upturn.org/static/files/2018-12-14_Final-Coalition-Comment-on-SB10-Proposed-Rules.pdf (“Finding that the defendant shares characteristics with a collectively higher risk group is the most specific observation that risk assessment instruments can make about any person. Such a finding does not answer, or even address, the question of whether detention is the only way to reasonably assure that person’s reappearance or the preservation of public safety. That question must be asked specifically about the individual whose liberty is at stake — and it must be answered in the affirmative in order for detention to be constitutionally justifiable.”) PAI notes that the requirement for an individualized hearing before detention implicitly includes a need for timeliness. Many jurisdictions across the US have detention limits at 24 or 48 hours without hearings. Aspects of this stance are shared by some risk assessment tool makers; see, Arnold Ventures’ Statement of Principles on Pretrial Justice and Use of Pretrial Risk Assessment, https://craftmediabucket.s3.amazonaws.com/uploads/AV-Statement-of-Principles-on-Pretrial-Justice.pdf.
  5. See Ecological Fallacy section and Baseline D for further discussion of this topic.
  6. Quantitatively, accuracy is usually defined as the fraction of correct answers the model produces among all the answers it gives. So a model that answers correctly in 4 out of 5 cases would have an accuracy of 80%. Interestingly, models which predict rare phenomena (like violent criminality) can be incredibly accurate without being useful for their prediction tasks. For example, if only 1% of individuals will commit a violent crime, a model that predicts that no one will commit a violent crime will have 99% accuracy even though it does not correctly identify any of the cases where someone actually commits a violent crime. For this reason and others, evaluation of machine learning models is a complicated and subtle topic which is the subject of active research. In particular, note that inaccuracy can and should be subdivided into errors of “Type I” (false positive) and “Type II” (false negative) – one of which may be more acceptable than the other, depending on the context.
  7. Calibration is a property of models such that among the group they predict a 50% risk for, 50% of cases recidivate. Note that this says nothing about the accuracy of the prediction, because a coin toss would be calibrated in that sense. All risk assessment tools should be calibrated, butthere are more specific desirable properties such as calibration within groups (discussed in Requirement 2 below) that not all tools will or should satisfy completely.
  8. Sarah L. Desmarais, Evan M. Lowder, Pretrial Risk Assessment Tools: A Primer for Judges, Prosecutors, and Defense Attorneys, MacArthur Safety and Justice Challenge (Feb 2019). The issue of cross-comparison applies not only to geography but to time. It may be valuable to use comparisons over time to assist in measuring the validity of tools, though such evaluations must be corrected for the fact that crime in the United States is presently a rapidly changing (and still on the whole rapidly declining) phenomenon.
  9. As a technical matter, a model can be biased for subpopulations while being unbiased on average for the population as a whole.
  10. Note here that the phenomenon of societal bias—the existence of beliefs, expectations, institutions, or even self-propagating patterns of behavior that lead to unjust outcomes for some groups—is not always the same as, or reflected in statistical bias, and vice versa. One can instead think of these as an overlapping Venn diagram with a large intersection. Most of the concerns about risk assessment tools are about biases that are simultaneously statistical and societal, though there are some that are about purely societal bias. For instance, if non-uniform access to transportation (which is a societal bias) causes higher rates of failure to appear for court dates in some communities, the problem is a societal bias, but not a statistical one. The inclusion of demographic parity measurements as part of model bias measurement (see Requirement 2) may be a way to measure this, though really the best solutions involve distinct policy responses (for instance, providing transportation assistance for court dates or finding ways to improve transit to underserved communities).
  11. For instance, Eckhouse et al. propose a 3-level taxonomy of biases. Laurel Eckhouse, Kristian Lum, Cynthia Conti-Cook, and Julie Ciccolini, Layers of Bias: A Unified Approach for Understanding Problems with Risk Assessment, Criminal Justice and Behavior, (Nov 2018).
  12. Some of the experts within the Partnership oppose the use of risk assessment tools specifically because of their pessimism that sufficient data exists or could practically be collected to meet purposes (a) and (b).
  13. Moreover, defining recidivism is difficult in the pretrial context. Usually, recidivism variables are defined using a set time period, e.g., whether someone is arrested within 1 year of their initial arrest or whether someone is arrested within 3 years of their release from prison. In the pretrial context, recidivism is defined as whether the individual is arrested during the time after their arrest (or pretrial detention) and before the individual’s trial. That period of time, however, can vary significantly from case to case, so it is necessary to ensure that each risk assessment tool predicts an appropriately defined measure of recidivism or public safety risk.
  14. See, e.g., Report: The War on Marijuana in Black and White, ACLU (2013), https://www.aclu.org/report/report-war-marijuana-black-and-white; ACLU submission to Inter-American Commission on Human Rights, Hearing on Reports of Racism in the Justice System of the United States, https://www.aclu.org/sites/default/files/assets/141027_iachr_racial_disparities_aclu_submission_0.pdf, (Oct 2017); Samuel Gross, Maurice Possley, Klara Stephens, Race and Wrongful Convictions in the United States, National Registry of Exonerations, https://www.law.umich.edu/special/exoneration/Documents/Race_and_Wrongful_Convictions.pdf; but see Jennifer L. Skeem and Christopher Lowenkamp, Risk, Race & Recidivism: Predictive Bias and Disparate Impact, Criminology 54 (2016), 690, https://risk-resilience.berkeley.edu/sites/default/files/journal-articles/files/criminology_proofs_archive.pdf (For some categories of crime in some jurisdictions, victimization and self-reporting surveys imply crime rates are comparable to arrest rates across demographic groups; an explicit and transparent reweighting process is procedurally appropriate even in cases where the correction it results in is small).
  15. See David Robinson and John Logan Koepke, Stuck in a Pattern: Early evidence on ‘predictive policing’ and civil rights, (Aug. 2016). https://www.upturn.org/reports/2016/stuck-in-a-pattern/ (“Criminologists have long emphasized that crime reports, and other statistics gathered by the police, are not an accurate record of the crime that happens in a community. In short, the numbers are greatly influenced by what crimes citizens choose to report, the places police are sent on patrol, and how police decide to respond to the situations they encounter. The National Crime Victimization Survey (conducted by the Department of Justice) found that from 2006-2010, 52 percent of violent crime victimizations went unreported to police and 60 percent of household property crime victimizations went unreported. Historically, the National Crime Victimization Survey ‘has shown that police are not notified of about half of all rapes, robberies and aggravated assaults.’”) See also Kristian Lum and William Isaac, To predict and serve? (2016): 14-19.
  16. Carl B. Klockars, Some Really Cheap Ways of Measuring What Really Matters, in Measuring What Matters: Proceedings From the Policing Research Meetings, 195, 195-201 (1999), https://www.ncjrs.gov/pdffiles1/nij/170610.pdf. [https://perma.cc/BRP3-6Z79] (“If I had to select a single type of crime for which its true level—the level at which it is reported—and the police statistics that record it were virtually identical, it would be bank robbery. Those figures are likely to be identical because banks are geared in all sorts of ways…to aid in the reporting and recording of robberies and the identification of robbers. And, because mostly everyone takes bank robbery seriously, both Federal and local police are highly motivated to record such events.”)
  17. ACLU, The War on Marijuana in Black and White: Billions of Dollars Wasted on Racially Biased Arrests, (2013), available at https://www.aclu.org/files/assets/aclu-thewaronmarijuana-rel2.pdf.
  18. Lisa Stoltenberg & Stewart J. D’Alessio, Sex Differences in the Likelihood of Arrest, J. Crim. Justice 32 (5), 2004, 443-454; Lisa Stoltenberg, David Eitle & Stewart J. D’Alessio, Race and the Probability of Arrest, Social Forces 81(4) 2003 1381-1387; Tia Stevens & Merry Morash, Racial/Ethnic Disparities in Boys’ Probability of Arrest and Court Actions in 1980 and 2000: The Disproportionate Impact of ‘‘Getting Tough’’ on Crime, Youth and Juvenile Justice 13(1), (2014).
  19. Delbert S. Elliott, Lies, Damn Lies, and Arrest Statistics, (1995), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.9427&rep=rep1&type=pdf, 11.
  20. Simply reminding people to appear improves appearance rates. Pretrial Justice Center for Courts, Use of Court Date Reminder Notices to Improve Court Appearance Rates, (Sept. 2017).
  21. There are a number of obstacles that risk assessment toolmakers have identified towards better predictions on this front. Firstly, there is a lack of consistent data and definitions to help disentangle willful flight from justice from failures to appear for reasons that are either unintentional or not indicative of public safety risk. Policymakers may need to take the lead in defining and collecting data on these reasons, as well as identifying interventions besides incarceration that may be most appropriate for responding to them.
  22. This is known in the algorithmic fairness literature as “fairness through unawareness”; see Moritz Hardt, Eric Price, & Nathan Srebro, Equality of Opportunity in Supervised Learning, Proc. NeurIPS 2016, https://arxiv.org/pdf/1610.02413.pdf, first publishing the term and citing earlier literature for proofs of its ineffectiveness, particularly Pedreshi, Ruggieri, & Turini, Discrimination-aware data mining, Knowledge Discovery & Data Mining, Proc. SIGKDD (2008), http://eprints.adm.unipi.it/2192/1/TR-07-19.pdf.gz. In other fields, blindness is the more common term for the idea of achieving fairness by ignoring protected class variables (e.g., “race-blind admissions” or “gender-blind hiring”).
  23. Another way of conceiving omitted variable bias is as follows: data-related biases as discussed in Requirement 1 are problems with the rows in a database or spreadsheet: the rows may contain asymmetrical errors, or not be a representative sample of events as they occur in the world. Omitted variable bias, in contrast, is a problem with not having enough or the right columns in a dataset.
  24. These specific examples are from the Equivant/Northpoint COMPAS risk assessment; see sample questionnaire at https://assets.documentcloud.org/documents/2702103/Sample-Risk-Assessment-COMPAS-CORE.pdf
  25. This list is by no means exhaustive. Another approach involves attempting to de-bias datasets by removing all information regarding the protected class variables. See, e.g., James E. Johndrow & Kristian Lum, An algorithm for removing sensitive information: application to race-independent recidivism prediction, (Mar. 15, 2017), https://arxiv.org/pdf/1703.04957.pdf. Not only would the protected class variable itself be removed but also variation in other variables that is correlated with the protected class variable. This would yield predictions that are independent of the protected class variables, but could have negative implications for accuracy. This method formalizes the notion of fairness known as “demographic parity,” and has the advantage of minimizing disparate impact, such that outcomes should be proportional across demographics. Similar to affirmative action, however, this approach would raise additional fairness questions given different baselines across demographics.
  26. See Moritz Hardt, Eric Price, & Nathan Srebro, Equality of Opportunity in Supervised Learning, Proc. NeurIPS 2016, https://arxiv.org/pdf/1610.02413.pdf.
  27. This is due to different baseline rates of recidivism for different demographic groups in U.S. criminal justice data. See J. Kleinberg, S. Mullainathan, M. Raghavan. Inherent Trade-Offs in the Fair Determination of Risk Scores. Proc. ITCS, (2017), https://arxiv.org/abs/1609.05807 and A. Chouldechova, Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Proc. FAT/ML 2016, https://arxiv.org/abs/1610.07524. Another caveat is that such a correction can reduce overall utility, as measured as a function of the number of individuals improperly detained or released. See, e.g., Sam Corbett-Davies et al., Algorithmic Decision-Making and the Cost of Fairness, (2017), https://arxiv.org/pdf/1701.08230.pdf.
  28. As long as the training data show higher arrest rates among minorities, statistically accurate scores must of mathematical necessity have a higher false positive rate for minorities. For a paper that outlines how equalizing FPRs (a measure of unfair treatment) requires creating some disparity in predictive accuracy across protected categories, see J. Kleinberg, S. Mullainathan, M. Raghavan. Inherent Trade-Offs in the Fair Determination of Risk Scores. Proc. ITCS, (2017), https://arxiv.org/abs/1609.05807; for arguments about the limitations of FPRs as a sole and sufficient metric, see e.g. Sam Corbett-Davies and Sharad Goel, The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning, working paper, https://arxiv.org/abs/1808.00023.
  29. Geoff Pleiss et al. On Fairness and Calibration (describing the challenges of using this approach when baselines are different), https://arxiv.org/pdf/1709.02012.pdf.
  30. The stance that unequal false positive rates represents material unfairness was popularized in a study by Julia Angwin et al. Machine Bias, ProPublica, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, (2016), and confirmed in further detail in e.g, Julia Dressel and Hany Farid, The accuracy, fairness and limits of predicting recidivism, Science Advances, 4(1), (2018), http://advances.sciencemag.org/content/advances/4/1/eaao5580.full.pdf. Whether or not FPRs are the right measure of fairness is disputed within the statistics literature.
  31. See, e.g., Alexandra Chouldechova, Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, Big Data 5(2), https://www.liebertpub.com/doi/full/10.1089/big.2016.0047, (2017).
  32. See, e.g., Niki Kilbertus et al., Avoiding Discrimination Through Causal Reasoning, (2018), https://arxiv.org/pdf/1706.02744.pdf.
  33. Formally, the toolmaker must distinguish “resolved” and “unresolved” discrimination. Unresolved discrimination results from a direct causal path between the protected class and predictor that is not blocked by a “resolving variable.” A resolving variable is one that is influenced by the protected class variable in a manner that we accept as nondiscriminatory. For example, if women are more likely to apply for graduate school in the humanities and men are more likely to apply for graduate school in STEM fields, and if humanities departments have lower acceptance rates, then women might exhibit lower acceptance rates overall even if conditional on department they have higher acceptance rates. In this case, the department variable can be considered a resolving variable if our main concern is discriminatory admissions practices. See, e.g., Niki Kilbertus et al., Avoiding Discrimination Through Causal Reasoning, (2018), https://arxiv.org/pdf/1706.02744.pdf.
  34. In addition to the trade-offs highlighted in this section, it should be noted that these methods require a precise taxonomy of protected classes. Although it is common in the United States to use simple taxonomies defined by the Office of Management and Budget (OMB) and the US Census Bureau, such taxonomies cannot capture the complex reality of race and ethnicity. See Revisions to the Standards for the Classification of Federal Data on Race and Ethnicity, 62 Fed. Reg. 210 (Oct 1997), https://www.govinfo.gov/content/pkg/FR-1997-10-30/pdf/97-28653.pdf. Nonetheless, algorithms for bias correction have been proposed that detect groups of decision subjects with similar circumstances automatically. For an example of such an algorithm, see Tatsunori Hashimoto et al., Fairness Without Demographics in Repeated Loss Minimization, Proc. ICML 2018, http://proceedings.mlr.press/v80/hashimoto18a/hashimoto18a.pdf. Algorithms have also been developed to detect groups of people that are spatially or socially segregated. See, e.g., Sebastian Benthall & Bruce D. Haynes, Racial categories in machine learning, Proc. FAT* 2019, https://dl.acm.org/authorize.cfm?key=N675470. Further experimentation with these methods is warranted. For one evaluation, see Jon Kleinberg, An Impossibility Theorem for Clustering, Advances in Neural Information Processing Systems 15, NeurIPS 2002.
  35. The best way to do this deserves further research on human-computer interaction. For instance, if judges are shown multiple predictions labelled “zero disparate impact for those who will not reoffend”, “most accurate prediction,” “demographic parity,” etc, will they understand and respond appropriately? If not, decisions about what bias corrections to use might be better made at the level of policymakers or technical government experts evaluating these tools.
  36. Cost benefit models require explicit tradeoff choices to be made between different objectives including liberty, safety, and fair treatment of different categories of defendants. These choices should be explicit, and must be made transparently and accountably by policymakers. For a macroscopic example of such a calculation see David Roodman, The Impacts of Incarceration on Crime, Open Philanthropy Project report, September 2017, p p131, at https://www.openphilanthropy.org/files/Focus_Areas/Criminal_Justice_Reform/The_impacts_of_incarceration_on_crime_10.pdf.
  37. Sandra G. Mayson, Dangerous Defendants, 127 Yale L.J. 490, 509-510 (2018).
  38. Id., at 510. (“The two risks are different in kind, are best predicted by different variables, and are most effectively managed in different ways.”)
  39. For instance, needing childcare increases the risk of failure to appear (see Brian H. Bornsein, Alan J. Thomkins & Elizabeth N. Neely, Reducing Courts’ Failure to Appear Rate: A Procedural Justice Approach, U.S. DOJ report 234370, available at https://www.ncjrs.gov/pdffiles1/nij/grants/234370.pdf ) but is less likely to increase the risk of recidivism.
  40. For example, if the goal of a risk assessment tool is to advance the twin public policy goals of reducing incarceration and ensuring defendants appear for their court dates, then the tool should not conflate a defendant’s risk of knowingly fleeing justice with their risk of unintentionally failing to appear, since the latter can be mitigated by interventions besides incarceration (e.g. giving the defendant the opportunity to sign up for phone calls or SMS-based reminders about their court date, or ensuring the defendant has transportation to court on the day they are to appear).
  41. Notably, part of the holding in Loomis, mandated a disclosure in any Presentence Investigation Report that COMPAS risk assessment information “was not developed for use at sentencing, but was intended for use by the Department of Corrections in making determinations regarding treatment, supervision, and parole,” Wisconsin v. Loomis (881 N.W.2d 749).
  42. M.L. Cummings, Automation Bias in Intelligent Time Critical Decision Support Systems, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.2634&rep=rep1&type=pdf.
  43. It is important to note, however, that there is also evidence of the opposite phenomenon, whereby users might simply ignore the risk assessment tools’ predictions. In Christin’s ethnography of risk assessment users, she notes that professionals often “buffer” their professional judgment from the influence of automated tools. She quotes a former prosecutor as saying of risk assessment, “When I was a prosecutor I didn’t put much stock in it, I’d prefer to look at actual behaviors. I just didn’t know how these tests were administered, in which circumstances, with what kind of data.” From Christin, A., 2017, Algorithms in practice: Comparing web journalism and criminal justice, Big Data & Society, 4(2).
  44. See Wisconsin v. Loomis (881 N.W.2d 749).
  45. “Specifically, any PSI containing a COMPAS risk assessment must inform the sentencing court about the following cautions regarding a COMPAS risk assessment’s accuracy: (1) the proprietary nature of COMPAS has been invoked to prevent disclosure of information relating to how factors are weighed or how risk scores are to be determined; (2) risk assessment compares defendants to a national sample, but no cross- validation study for a Wisconsin population has yet been completed; (3) some studies of COMPAS risk assessment scores have raised questions about whether they disproportionately classify minority offenders as having a higher risk of recidivism; and (4) risk assessment tools must be constantly monitored and re-normed for accuracy due to changing populations and subpopulations.” Wisconsin v. Loomis (881 N.W.2d 749).
  46. Computer interfaces, even for simple tasks, can be highly confusing to users. For example, one study found that users failed to notice anomalies on a screen designed to show them choices they had previously selected for confirmation over 50% of the time, even after carefully redesigning the confirmation screen to maximize the visibility of anomalies. See Campbell, B. A., & Byrne, M. D. (2009). Now do voters notice review screen anomalies? A look at voting system usability, Proceedings of the 2009 Electronic Voting Technology Workshop/Workshop on Trustworthy Elections (EVT/WOTE ’09).
  47. This point depends on the number of input variables used for prediction. With a model that has a large number of features (such as COMPAS), it might be appropriate to use a method like gradient-boosted decision trees or random forests, and then provide the interpretation using an approximation. See Zach Lipton, The Mythos of Model Interpretability, Proc. ICML 2016, available at https://arxiv.org/pdf/1606.03490.pdf, §4.1. For examples of methods for providing explanations of complex models, see, e.g., Gilles Louppe et al., Understanding the variable importances in forests of randomized trees, Proc. NIPS 2013, available at https://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf; Marco Ribeiro, LIME – Local Interpretable
  48. Laurel Eckhouse et al., Layers of Bias: A Unified Approach for Understanding Problems With Risk Assessment, 46(2) Criminal Justice and Behavior 185–209 (2018), https://doi.org/10.1177/0093854818811379
  49. See id.
  50. See id.
  51. The lowest risk category for the Colorado Pretrial Assessment Tool (CPAT) included scores 0-17, while the highest risk category included a much broader range of scores: 51-82. In addition, the highest risk category corresponded to a Public Safety Rate of 58% and a Court Appearance Rate of 51%. Pretrial Justice Institute, (2013). Colorado Pretrial Assessment Tool (CPAT): Administration, scoring, and reporting manual, Version 1. Pretrial Justice Institute. Retrieved from http://capscolorado.org/yahoo_site_admin/assets/docs/CPAT_Manual_v1_-_PJI_2013.279135658.pdf
  52. User and usability studies such as those from the human-computer interaction field can be employed to study the question of how much deference judges give to pretrial or pre-sentencing investigations. For example, a study could examine how error bands affect judges’ inclination to follow predictions or (when they have other instincts) overrule them.
  53. As noted in Requirement 4, these mappings of probabilities to scores or risk categories are not necessarily intuitive, i.e. they are often not linear or might differ for different groups.
  54. In a simple machine learning prediction model, the tool might simply produce an output like “35% chance of recidivism.” A bootstrapped tool uses many resampled versions of the training datasets to make different predictions, allowing an output like, “It is 80% likely that this individual’s chance of recidivating is in the 20% – 50% range.” Of course these error bars are still relative to the training data, including any sampling or omitted variable biases it may reflect.
  55. The specific definition of fairness would depend on the fairness correction used.
  56. Humans are not naturally good at understanding probabilities or confidence estimates, though some training materials and games exist that can teach these skills; see eg: https://acritch.com/credence-game/
  57. To inform this future research, DeMichele et al.’s study conducting interviews with judges using the PSA tool can provide useful context for how judges understand and interpret these tools. DeMichele, Matthew and Comfort, Megan and Misra, Shilpi and Barrick, Kelle and Baumgartner, Peter, The Intuitive-Override Model: Nudging Judges Toward Pretrial Risk Assessment Instruments, (April 25, 2018). Available at SSRN: https://ssrn.com/abstract=3168500 or http://dx.doi.org/10.2139/ssrn.3168500;
  58. See the University of Washington’s Tech Policy Lab’s Diverse Voices methodology for a structured approach to inclusive requirements gathering. Magassa, Lassana, Meg Young, and Batya Friedman, Diverse Voices, (2017), http://techpolicylab.org/diversevoicesguide/.
  59. Such disclosures support public trust by revealing the existence and scope of a system, and by enabling challenges to the system’s role in government. See Pasquale, Frank. The black box society: The secret algorithms that control money and information. Harvard University Press, (2015). Certain legal requirements on government use of computers demand such disclosures. At the federal level, the Privacy Act of 1974 requires agencies to publish notices of the existence of any “system of records” and provides individuals access to their records. Similar data protection rules exist in many states and in Europe under the General Data Protection Regulation (GDPR).
  60. Reisman, Dillon, Jason Schultz, Kate Crawford, Meredith Whittaker, Algorithmic Impact Assessments: A Practical Framework for Public Agency Accountability, AI Now Institute, (2018).
  61. See Cal. Crim. Code §§ 1320.24 (e) (7), 1320.25 (a), effective Oct 2020.
  62. First Step Act, H.R.5682 — 115th Congress (2017-2018).
  63. For further discussion on the social justice concerns related to using trade secret law to prevent the disclosure of the data and algorithms behind risk assessment tools, see Taylor R. Moore,Trade Secrets and Algorithms as Barriers to Social Justice, Center for Democracy and Technology (August 2017), https://cdt.org/files/2017/08/2017-07-31-Trade-Secret-Algorithms-as-Barriers-to-Social-Justice.pdf.
  64. Several countries already publish the details of their risk assessment models. See, e.g., Tollenaar, Nikolaj, et al. StatRec-Performance, validation and preservability of a static risk prediction instrument, Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique 129.1 (2016): 25-44 (in relation to the Netherlands); A Compendium of Research and Analysis on the Offender Assessment System (OaSys) (Robin Moore ed., Ministry of Justice Analytical Series, 2015) (in relation to the United Kingdom). Recent legislation also attempts to mandate transparency safeguards, see Idaho Legislature, House Bill No.118 (2019).
  65. See, e.g., Jeff Larson et al. How We Analyzed the COMPAS Recidivism Algorithm, ProPublica (May 23, 2016), https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm. For a sample of the research that became possible as a result of ProPublica’s data, see https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=propublica+fairness+broward. Data provided by Kentucky’s Administrative Office of the Courts has also enabled scholar’s to examine the impact of the implementation of the PSA tool in that state. Stevenson, Megan, Assessing Risk Assessment in Action (June 14, 2018). Minn. L. Rev, 103, Forthcoming; available at https://ssrn.com/abstract=3016088
  66. For an example of how a data analysis competition dealt with privacy concerns when releasing a dataset with highly sensitive information about individuals, see Ian Lundberg et al., Privacy, ethics, and data access: A case study of the Fragile Families Challenge (Sept. 1, 2018), https://arxiv.org/pdf/1809.00103.pdf.
  67. See Arvind Narayanan et al., A Precautionary Approach to Big Data Privacy (Mar. 19, 2015), http://randomwalker.info/publications/precautionary.pdf.
  68. See id. at p. 20 and 21 (describing how some sensitive datasets are only shared after the recipient completes a data use course, provides information about the recipient, and physically signs a data use agreement).
  69. For a discussion of the due process concerns that arise when information is withheld in the context of automated decision-making, see Danielle Keats Citron, Technological Due Process, 85 Wash. U. L. Rev. 1249 (2007), https://ssrn.com/abstract=1012360. See also, Paul Schwartz, Data Processing and Government Administration: The Failure of the American Legal Response to the Computer, 43 Hastings L. J. 1321 (1992).
  70. Additionally, the ability to reconstitute decisions evidences procedural regularity in critical decision processes and allows individuals to trust the integrity of automated systems even when they remain partially non-disclosed. See Joshua A. Kroll et al., Accountable algorithms, 165 U. Pa. L. Rev. 633 (2016).
  71. The ability to contest scores is not only important for defendant’s rights to adversarially challenge adverse information, but also for the ability of judges and other professionals to engage with the validity of the risk assessment outputs and develop trust in the technology. See Daniel Kluttz et al., Contestability and Professionals: From Explanations to Engagement with Algorithmic Systems (January 2019), https://dx.doi.org/10.2139/ssrn.3311894
  72. “Criteria tinkering” occurs when court clerks manipulate input values to obtain the score they think is correct for a particular defendant. See Hannah-Moffat, Kelly, Paula Maurutto, and Sarah Turnbull, Negotiated risk: Actuarial illusions and discretion in probation, 24.3 Canada J. of L. & Society/La Revue Canadienne Droit et Société 391 (2009). See also Angele Christin, Comparing Web Journalism and Criminal Justice, 4.2 Big Data & Society 1.
  73. For further guidance on how such audits and evaluations might be structured, see, AI Now Institute, Algorithmic Impact Assessments: A Practical Framework for Public Agency Accountability, https://ainowinstitute.org/aiareport2018.pdf; Christian Sandvig et al., Auditing algorithms: Research methods for detecting discrimination on internet platform (2014).
  74. See John Logan Koepke and David G. Robinson, Danger Ahead: Risk Assessment and the Future of Bail Reform, 93 Wash. L. Rev. 1725 (2018).
  75. For a discussion Latanya Sweeney & Ji Su Yoo, De-anonymizing South Korean Resident Registration Numbers Shared in Prescription Data, Technology Science, (Sept. 29, 2015), https://techscience.org/a/2015092901. Techniques exist that can guarantee that re-identification is impossible. See the literature on methods for provable privacy, notably differential privacy. A good introduction is in Kobbi Nissim, Thomas Steinke, Alexandra Wood, Mark Bun, Marco Gaboardi, David R. O’Brien, and Salil Vadhan, Differential Privacy: A Primer for a Non-technical Audience, http://privacytools.seas.harvard.edu/files/privacytools/files/pedagogical-document-dp_0.pdf.
  76. Brandon Buskey and Andrea Woods, Making Sense of Pretrial Risk Assessments, National Association of Criminal Defense Lawyers, (June 2018), https://www.nacdl.org/PretrialRiskAssessment. Human Rights Watch proposes a clear alternative: “The best way to reduce pretrial incarceration is to respect the presumption of innocence and stop jailing people who have not been convicted of a crime absent concrete evidence that they pose a serious and specific threat to others if they are released. Human Rights Watch recommends having strict rules requiring police to issue citations with orders to appear in court to people accused of misdemeanor and low-level, non-violent felonies, instead of arresting and jailing them. For people accused of more serious crimes, Human Rights Watch recommends that the release, detain, or bail decision be made following an adversarial hearing, with right to counsel, rules of evidence, an opportunity for both sides to present mitigating and aggravating evidence, a requirement that the prosecutor show sufficient evidence that the accused actually committed the crime, and high standards for showing specific, known danger if the accused is released, as opposed to relying on a statistical likelihood.” Human Rights Watch, Q & A: Profile Based Risk Assessment for US Pretrial Incarceration, Release Decisions, (June 1, 2018), https://www.hrw.org/news/2018/06/01/q-profile-based-risk-assessment-us-pretrial-incarceration-release-decisions.
Table of Contents
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16