ABOUT ML Process Guide

Because documentation is as much a process as it is a set of artifacts, transparency and documentation need to be an explicit part of the discussion at each step of the workflow. This page provides suggested documentation questions and considerations for each phase of the ML system lifecycle — from design and setup to observation and maintenance — compiled from the ABOUT ML Reference Document and academic literature.

Phase 1

Model system design & setup

Answer these questions

For what purpose was the dataset created? 1Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumeé III, H., & Crawford, K. (2018). Datasheets for datasets. https://arxiv.org/abs/1803.09010


Who created this dataset and on behalf of which entity? 1Gebru et al. 2018


Who funded the creation of the dataset? 1Gebru et al. 2018


What was included in the dataset and why? 2Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604. https://aclweb.org/anthology/papers/Q/Q18/Q18-1041/


Was ethics approval sought/granted by an institutional review board?

Ensure that you are…

Developing real-world examples


Establishing Individual context

Answer these questions

Was any preprocessing/cleaning/labeling of the data done? 1Gebru et al. 2018


Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? 1Gebru et al. 2018


Which data instances were filtered out of the raw dataset and why? What proportion of the “raw” dataset was filtered out during cleaning?


What are the demographic characteristics of the annotators and annotation guidelines given to developers? 1Gebru et al. 2018


What labeling guidelines were used? What instructions were given to the labelers?
What data does each instance consist of? 1Gebru et al. 2018


Is there a label or target associated with each instance? 1Gebru et al. 2018


Are there recommended data splits (e.g., training, development/validation, testing)? 1Gebru et al. 2018


Are there any errors, sources of noise, or redundancies in the dataset? 1Gebru et al. 2018


Have you provided detail about the source, author contact information, and version history? 3Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2018). The dataset nutrition label: A framework to drive higher data quality standards. arXiv preprint arXiv:1805.03677. https://arxiv.org/abs/1805.03677


Have you documented the ground truth correlations? There are linear correlations between a chosen variable in the dataset and variables from other datasets considered to be “ground truth.” 3Holland et al. 2019


What mechanisms or procedures were used to collect the data? What mechanisms or procedures were used to correct the data for sampling error? How were these mechanisms or procedures validated? 1Gebru et al. 2018


If the dataset is a sample from a larger set, what was the sampling strategy? 1Gebru et al. 2018


Who was involved in the data collection process, in what kind of working environment, and how were they compensated, if at all? 1Gebru et al. 2018


Over what timeframe was the data collected? 1Gebru et al. 2018


Who is the data being collected from? Would you be comfortable if that data was being collected from you and used for the intended purpose?


Does this data collection undermine individual autonomy or self-determination in any way?


How is the data being retained? How long will it be kept?


Have we considered, explored, and specified how both genre and topic influence the vocabulary and structural characteristics of tests? 4Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge, UK: Cambridge University Press


Have we though through the nature of data sources and how that may affect whether or not the data is a suitable representation of the world? 2Bender and Friedman 2018

Answer these questions

Is there a repository that links to any or all papers or systems that use the dataset? 1Gebru et al. 2018


Are there tasks for which the dataset should not be used? 1Gebru et al. 2018


Was there consent obtained by data subjects, and does that consent place limits on the use of the data?

Answer these questions

What is the intended use of the service (model) output? 5Arnold, M., Bellamy, R.K.E., Hind, M., Houde, S., Mehta, S., Mojsilovic, A., Nair, R., Ramamurthy, K. N., Reimer, D. Olteanu, A., Piorkowski, D., Tsay, J., & Varshney, K. R. (2018). FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. arXiv preprint arXiv:1808.07261. https://arxiv.org/abs/1808.07261

  • Primary intended uses
  • Primary intended users
  • Out-of-scope and under-represented use cases

What algorithms or techniques does this service implement? 5Arnold et al. 2018


What are the basic model details? 6Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer E., Raji, I.D., & Gebru, T. (2019, January). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220-229). ACM. https://arxiv.org/abs/1810.03993

  • Person or organization developing model and contact information
  • Model date
  • Model version
  • Model type
  • Information about training algorithms, parameters, fairness constraints or other applied approaches, and features
  • Paper or other resource for more information
  • Citation details
  • License

Phase 2

Model Development

Answer these questions

What training data is used?
This may not be possible to provide in practice. When possible, this section should mirror evaluation data. If such detail is not possible, minimal allowable information should be provided here, such as details of the distribution over various factors in the training datasets. 6Mitchell et al. 2018


What type of algorithm is used to train the model?


What are the details of the algorithm’s architecture? (e.g., a ResNet neural net). Include a diagram if possible.

Answer these questions

Which datasets was the service tested on? 5Arnold et al. 2018


Have you clearly described the testing methodology? 5Arnold et al. 2018


Have you clearly documented the test results? 5Arnold et al. 2018


Are you aware of possible examples of bias, ethical issues, or other safety risks as a result of using the service? 5Arnold et al. 2018


Are the service outputs explainable and/or interpretable? 5Arnold et al. 2018


Have you chosen metrics that reflect potential real-world impacts of the model? Be sure to consider model performance measures, decision thresholds, and variation approaches. 6Mitchell et al. 2018

  • Model performance measures
  • Decision thresholds
  • Variation approaches

Have you recorded details on the dataset(s) that were used for the quantitative analyses in the model card? Be sure to consider datasets, motivations, and preprocessing. 6Mitchell et al. 2018

Ensure that you are…

Instantiating more rigorous tests that include comparing system decisions with those a human would arrive at


Putting in place a mitigation strategy for risks in the realm of possibility for homegrown systems

Phase 3

Model Deployment

Answer these questions

What is the expected performance on unseen data or data with different distributions? 5Arnold et al. 2018


Was the service checked for robustness against adversarial attacks? 5Arnold et al. 2018


Have you conducted quantitative analysis and documented unitary and intersectional results? 6Mitchell et al. 2018


Have you brainstormed possible ethical considerations? 6Mitchell et al. 2018

Phase 4

Observation & Maintenance

Answer these questions

Is there an erratum (list of mistakes)? 1Gebru et al. 2018


Will the dataset be updated? If so, how often, by whom, and how will updates be communicated to users? 1Gebru et al. 2018


When will the dataset expire? Is there a set time limit after which the data should be considered obsolete?

Ensure that you are…

Establishing ethics scores and approvals


Developing clear objectives during data collection with benchmarks and constraints for review (i.e. in 5, 7, or 10 years)


Ensuring ease of contact for participants to whom the data belongs, asking questions such as “If you are doing longitudinal processes with a fairly transient population, how do you ensure you can find that person later to re-establish consent?” and “Is that information still relevant for our use?”


Committing to, after finding issues with a dataset, discontinuing use and/or putting mitigation practices in place


Considering your team’s ability to fix, update, or remove any model or data released from distribution


Replacing problematic benchmarks and encouraging use of better alternatives

Answer these questions

Does the service implement and perform any bias detection and remediation? 5Arnold et al. 2018


When were the models last updated? 5Arnold et al. 2018

Ensure that you are…

Determining the auditing method


Archiving the old model (versioning for safety and recovery)