b63c999 - Partnership on AI

Requirement 5: Tools should produce confidence estimates for their predictions

An important component of any statistical prediction is the uncertainty underlying it. In order for users of risk assessment tools to appropriately and correctly interpret their results, it is vital that reports of their predictions include error bars, confidence intervals, or other similar indications of reliability. For example, risk assessment tools often produce a score reflecting a probability of reoffending, or a mapping of those probabilities into levels (like “high,” “medium,” and “low” risk). As noted in Requirement 4, these mappings of probabilities to scores or risk categories are not necessarily intuitive, i.e. they are often not linear or might differ for different groups. This information alone, however, does not give the user an indication of the model’s confidence in its prediction. For example, even if a model is calibrated such that an output like “high risk” corresponds to “a 60% probability of reoffending,” it is unclear whether the tool is confident that the defendant has a probability of reoffending between 55% and 65%, with a mean of 60%, or if the tool is only confident that the defendant has a probability of reoffending between 30% and 90%, with a mean of 60%. In the former case, the interpretation that the defendant has a 60% probability of reoffending is far more reasonable than in the latter case, where there is overwhelming uncertainty around the prediction.

For this reason, risk assessment tools should not be used unless they are able to provide good measures of the certainty of their own predictions, both in general and for specific individuals on which they are used. There are many sources of uncertainty in recidivism predictions, and ideally disclosure of uncertainty in predictions should capture as many of these sources as possible. This includes the following:

Uncertainty due to sample size and the presence of outliers in datasets. This type of uncertainty can be measured by the use of bootstrapped confidence intervals, In a simple machine learning prediction model, the tool might simply produce an output like “35% chance of recidivism.” A bootstrapped tool uses many resampled versions of the training datasets to make different predictions, allowing an output like, “It is 80% likely that this individual’s chance of recidivating is in the 20% – 50% range.” Of course these error bars are still relative to the training data, including any sampling or omitted variable biases it may reflect. which are commonly used by technology companies for assessing the predictive power of models before deployment.
Uncertainty about the most appropriate mitigation for model bias, as discussed in Requirement 2. One possibility would be to evaluate the outcomes of different fairness corrections as expressing upper and lower bounds on possible “fair” predictions.The specific definition of fairness would depend on the fairness correction used.
Uncertainty as a result of sampling bias and other fundamental dataset problems, as discussed in Requirement 1. This is a complicated issue to address, but one way to approach this problem would be to find or collect new high quality secondary sources of data to estimate uncertainty due to sampling bias and other problems with training datasets.

User interfaces to satisfactorily display and convey uncertainty to users are in some respects also an open problem, so the training courses we suggest in Requirement 6 should specifically test and assist users in making judgments under simulations of this uncertainty.