Evaluation Evaluation

After a model is trained, it needs to be tested. The goal of documentation in this section is to help users of the model to understand how the model was checked for accuracy, bias, and other important properties.

The specific metrics that the model will be tested on depend on the particular use case, so it is helpful for this documentation to include examples of which metrics apply for which use case. Additionally, there will be contexts and use cases that are inappropriate for each model, so these limitations should be noted to avoid invalidating any fairness assertions.Selbst, Andrew D. and Boyd, Danah and Friedler, Sorelle and Venkatasubramanian, Suresh and Vertesi, Janet (2018). “Fairness and Abstraction in Sociotechnical Systems”, ACM Conference on Fairness, Accountability, and Transparency (FAT*). https://ssrn.com/abstract=3265913 Testing models should also take into account the broader system the model is embedded into, so it is valuable to test the downstream effects of the model. These anticipated downstream effects should be documented as a baseline to evaluate against on an ongoing basis.

Breaking down accuracy and other metrics by specific subpopulations can be useful for detecting possible biases and ethical issues. For example, if a model has much lower predictive accuracy or much higher false positive rates for one subpopulation, it might be problematic to use the model to make decisions for those subpopulations. The documentation should thus reflect those limitations clearly. Such disclosures would help ensure that the model is not later used to make unfair decisions for those subpopulations and could also give the developer some liability cushion against third-party misuse since they have clearly stated the limitations of their particular model. Part of the goal of this section is to convey to the user that whether AI works is not binary and instead is a nuanced question depending on the use case and metrics of success.

The documentation should also discuss how the model developers checked for overfitting. For example, how did they construct the test set? Did they draw new test sets each time they trained the model for cross-validation purposes? It is important for the documentation to be very specific and to explore potential shortcomings of the model and the test set. While the test set is meant to be a reflection of reality, data is never perfect. Failing to account for known imperfections creates a risk that this documentation could lead to overconfidence and misuse of the model.


Disclosing the details of evaluation puts the system at risk of being more easily manipulated by malicious actors, presenting a security risk. Indeed, the issue of Goodhart’s Law, as discussed previously, can be a concern if individuals deliberately change their behavior to try to change the model’s outcomes. That said, some activist groups engage in hacking to expose problematic aspects of an ML system with the goal of protecting vulnerable groups, so the increased potential for hacking can either be a positive or negative characteristic depending on one’s perspective. Explorations related to the following research questions could uncover insights into barriers to implementation along with mitigation strategies to overcome those barriers.

Sample Documentation Questions

  • Which datasets was the service tested on? (Arnold et al. 2018)
  • Describe the testing methodology. (Arnold et al. 2018)
  • Describe the test results. (Arnold et al. 2018)
  • Are you aware of possible examples of bias, ethical issues, or other safety risks as a result of using the service? (Arnold et al. 2018)
  • Are the service outputs explainable and/or interpretable? (Arnold et al. 2018)
  • Metrics. Metrics should be chosen to reflect potential real world impacts of the model. (Mitchell et al. 2018)
    • Model performance measures
    • Decision thresholds
    • Variation approaches
  • Evaluation data. Details on the dataset(s) used for the quantitative analyses in the card. (Mitchell et al. 2018)
    • Datasets
    • Motivation
    • Preprocessing

Promising Interventions

Those attempting documentation practices within any phase of the machine learning lifecycle can consider lessons learned from flawed ML systems currently in use by paying particular attention to:

  1. Instantiating more rigorous tests that include comparing system decisions with those a human would arrive at
  2. Putting in place a mitigation strategy for risks in the realm of possibility for homegrown systems Model Integration Model Integration

Model Integration

Integrating a model is the act of putting into production (deployment) and instantiating the monitoring process.

In a 2019 Towards Data Science blog post, Opeyemi notes that the deployment of an ML model “simply means the integration of the model into an existing production environment which can take in an input and return an output that can be used in making practical business decisions.”

Even if each portion of the model is thoroughly tested, validated, and documented, there are additional documentation and evaluation needs that arise when connecting a model into a broader ML system. Validating how the models interoperate is important because different models might not work well together. It is important to test for how errors from all the models interact to find the corner cases that could pose problems in production. Latency changes when connecting models into a system, and that changes usability and reliability for users as well. The logistics of how the model is run (e.g., on the cloud vs. on local machines) is another important factor to document because it changes how to handle client data in the pipeline, and clients in sensitive industries especially will be interested in the details of how that is handled. The system-level documentation should also include information about system logs, pre- and post-processing steps, and a summary of all the surrounding product and software design choices in the system which are not ML/AI related, but do impact how the ML/AI models operate in the system. Finally, there should be careful commentary on the system’s vulnerability to adversarial inputs and whether there are system-level mitigations that have been put into place. This information is obviously sensitive and should be carefully considered before release to minimize the risk of malicious use of this information.


System level information can be especially useful for consumers, helping them determine how the system can be used in their context. It can also help regulators examine otherwise black box systems for compliance on privacy and data storage regulations. Of course, this can also be seen as a liability for a company, but on the other hand, it can equally be a force for increasing compliance with regulations. An internal process that supports accurate documentation of the entire ML system will also mitigate the risk of inaccurate documentation creating legal liability. For example, if one team writes that the system does not log, but a different team upstream of them does, it would be inaccurate to publish that documentation. Fortunately, adhering to a rigorous internal auditing system before publishing documentation can catch this type of inaccuracy and will likely have the positive externalities of creating more cohesion between teams and preventing other issues that arise when functions become too siloed inside an organization. Explorations related to the following research questions could uncover insights into barriers to implementation along with mitigation strategies to overcome those barriers.

Sample Documentation Questions

  • What is the expected performance on unseen data or data with different distributions? (Arnold et al. 2018)
  • Was the service checked for robustness against adversarial attacks? (Arnold et al. 2018)
  • Quantitative analyses (Mitchell et al. 2018)
  • Unitary results
  • Intersectional results
  • Ethical considerations (Mitchell et al. 2018) Maintenance Maintenance

As data, techniques, and real-world needs change, most models must be updated to ensure continued usefulness. There are many parameters of a model update which would be useful to document, including how and why an update is triggered, whether old models or parameters can still be accessed, who owns maintenance and updating, and guidance on reasonable shelf life of the model, including performance expectations for old and new versions. The first question to answer is why the model is being updated: for underlying data changes, process changes, or to improve model performance? Most of the answers to the questions above flow from this first one.

Updates can be triggered automatically based on time or specific metrics, or be active decisions evaluated from time to time. It is important to be clear about this so that people using the model can plan their workflows accordingly.

When models are updated, it can be useful to maintain access to previous versions or parameters.

Along with information on who owns the maintenance and update process, it is useful to know what the fallback plan is in case of personnel turnover or organizational changes. Having this documentation well-known makes it easier to make sure that model maintenance does not fall through the cracks during transitions.

The shelf life of a model, similar to the update schedule itself, can be dependent on time only or a combination of other metrics or factors. For example, it could be that when the underlying data distribution looks X% different, then the model should be evaluated on Y criteria for whether it is still valid.


Completing all these steps of documenting an update process can be incredibly valuable both for people who will use the model and for the people building the model. For people using the model, they get more reliability and can better plan around the model based on update processes. For the people building the model, planning ahead for the update process can encourage an intentional approach and schedule up front. It also may encourage more precaution if the model developers have to document changes later, in the tradeoff between speed of progress and a precautionary principle.

There may be liability or timeline risks with thoroughly documenting the update process if the team misses deadlines or is locked into deadlines. One way to mitigate these risks would be to build in criteria/metric gates for updates rather than a hard timeline. For example, the update process might begin when X metric reaches Y level or one year from the publication date, whichever is earlier. Building in sufficient time to do the update is also important because some updates are simple but others can take a long time if they are larger updates. Explorations related to the following research questions could uncover insights into barriers to implementation along with mitigation strategies to overcome those barriers.

Sample Documentation Questions

  • Does the service implement and perform any bias detection and remediation? (Arnold et al. 2018)
  • When were the models last updated? (Arnold et al. 2018)

Promising Interventions

Those attempting documentation practices within any phase of the machine learning lifecycle can consider potential misuse by paying particular attention to:

  1. Determining the auditing method
  2. Archiving the old model (versioning for safety and recovery)