Because documentation is as much a process as it is a set of artifacts, transparency and documentation need to be an explicit part of the discussion at each step of the workflow. This page provides suggested documentation questions and considerations for each phase of the ML system lifecycle — from design and setup to observation and maintenance — compiled from the ABOUT ML Reference Document and academic literature.
Phase 1
Model system design & setup
Answer these questions
For what purpose was the dataset created? 1Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumeé III, H., & Crawford, K. (2018). Datasheets for datasets. https://arxiv.org/abs/1803.09010
Who created this dataset and on behalf of which entity? 1Gebru et al. 2018
Who funded the creation of the dataset? 1Gebru et al. 2018
What was included in the dataset and why? 2Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604. https://aclweb.org/anthology/papers/Q/Q18/Q18-1041/
Was ethics approval sought/granted by an institutional review board?
Ensure that you are…
Developing real-world examples
Establishing Individual context
Answer these questions
Was any preprocessing/cleaning/labeling of the data done? 1Gebru et al. 2018
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? 1Gebru et al. 2018
Which data instances were filtered out of the raw dataset and why? What proportion of the “raw” dataset was filtered out during cleaning?
What are the demographic characteristics of the annotators and annotation guidelines given to developers? 1Gebru et al. 2018
What labeling guidelines were used? What instructions were given to the labelers?
What data does each instance consist of? 1Gebru et al. 2018
Is there a label or target associated with each instance? 1Gebru et al. 2018
Are there recommended data splits (e.g., training, development/validation, testing)? 1Gebru et al. 2018
Are there any errors, sources of noise, or redundancies in the dataset? 1Gebru et al. 2018
Have you provided detail about the source, author contact information, and version history? 3Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2018). The dataset nutrition label: A framework to drive higher data quality standards. arXiv preprint arXiv:1805.03677. https://arxiv.org/abs/1805.03677
Have you documented the ground truth correlations? There are linear correlations between a chosen variable in the dataset and variables from other datasets considered to be “ground truth.” 3Holland et al. 2019
What mechanisms or procedures were used to collect the data? What mechanisms or procedures were used to correct the data for sampling error? How were these mechanisms or procedures validated? 1Gebru et al. 2018
If the dataset is a sample from a larger set, what was the sampling strategy? 1Gebru et al. 2018
Who was involved in the data collection process, in what kind of working environment, and how were they compensated, if at all? 1Gebru et al. 2018
Over what timeframe was the data collected? 1Gebru et al. 2018
Who is the data being collected from? Would you be comfortable if that data was being collected from you and used for the intended purpose?
Does this data collection undermine individual autonomy or self-determination in any way?
How is the data being retained? How long will it be kept?
Have we considered, explored, and specified how both genre and topic influence the vocabulary and structural characteristics of tests? 4Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge, UK: Cambridge University Press
Have we though through the nature of data sources and how that may affect whether or not the data is a suitable representation of the world? 2Bender and Friedman 2018
Answer these questions
Is there a repository that links to any or all papers or systems that use the dataset? 1Gebru et al. 2018
Are there tasks for which the dataset should not be used? 1Gebru et al. 2018
Was there consent obtained by data subjects, and does that consent place limits on the use of the data?
Answer these questions
What is the intended use of the service (model) output? 5Arnold, M., Bellamy, R.K.E., Hind, M., Houde, S., Mehta, S., Mojsilovic, A., Nair, R., Ramamurthy, K. N., Reimer, D. Olteanu, A., Piorkowski, D., Tsay, J., & Varshney, K. R. (2018). FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. arXiv preprint arXiv:1808.07261. https://arxiv.org/abs/1808.07261
- Primary intended uses
- Primary intended users
- Out-of-scope and under-represented use cases
What algorithms or techniques does this service implement? 5Arnold et al. 2018
What are the basic model details? 6Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer E., Raji, I.D., & Gebru, T. (2019, January). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220-229). ACM. https://arxiv.org/abs/1810.03993
- Person or organization developing model and contact information
- Model date
- Model version
- Model type
- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features
- Paper or other resource for more information
- Citation details
- License
Phase 2
Model Development
Answer these questions
What training data is used?
This may not be possible to provide in practice. When possible, this section should mirror evaluation data. If such detail is not possible, minimal allowable information should be provided here, such as details of the distribution over various factors in the training datasets. 6Mitchell et al. 2018
What type of algorithm is used to train the model?
What are the details of the algorithm’s architecture? (e.g., a ResNet neural net). Include a diagram if possible.
Answer these questions
Which datasets was the service tested on? 5Arnold et al. 2018
Have you clearly described the testing methodology? 5Arnold et al. 2018
Have you clearly documented the test results? 5Arnold et al. 2018
Are you aware of possible examples of bias, ethical issues, or other safety risks as a result of using the service? 5Arnold et al. 2018
Are the service outputs explainable and/or interpretable? 5Arnold et al. 2018
Have you chosen metrics that reflect potential real-world impacts of the model? Be sure to consider model performance measures, decision thresholds, and variation approaches. 6Mitchell et al. 2018
- Model performance measures
- Decision thresholds
- Variation approaches
Have you recorded details on the dataset(s) that were used for the quantitative analyses in the model card? Be sure to consider datasets, motivations, and preprocessing. 6Mitchell et al. 2018
Ensure that you are…
Instantiating more rigorous tests that include comparing system decisions with those a human would arrive at
Putting in place a mitigation strategy for risks in the realm of possibility for homegrown systems
Phase 3
Model Deployment
Answer these questions
What is the expected performance on unseen data or data with different distributions? 5Arnold et al. 2018
Was the service checked for robustness against adversarial attacks? 5Arnold et al. 2018
Have you conducted quantitative analysis and documented unitary and intersectional results? 6Mitchell et al. 2018
Have you brainstormed possible ethical considerations? 6Mitchell et al. 2018
Phase 4
Observation & Maintenance
Answer these questions
Is there an erratum (list of mistakes)? 1Gebru et al. 2018
Will the dataset be updated? If so, how often, by whom, and how will updates be communicated to users? 1Gebru et al. 2018
When will the dataset expire? Is there a set time limit after which the data should be considered obsolete?
Ensure that you are…
Establishing ethics scores and approvals
Developing clear objectives during data collection with benchmarks and constraints for review (i.e. in 5, 7, or 10 years)
Ensuring ease of contact for participants to whom the data belongs, asking questions such as “If you are doing longitudinal processes with a fairly transient population, how do you ensure you can find that person later to re-establish consent?” and “Is that information still relevant for our use?”
Committing to, after finding issues with a dataset, discontinuing use and/or putting mitigation practices in place
Considering your team’s ability to fix, update, or remove any model or data released from distribution
Replacing problematic benchmarks and encouraging use of better alternatives
Answer these questions
Does the service implement and perform any bias detection and remediation? 5Arnold et al. 2018
When were the models last updated? 5Arnold et al. 2018
Ensure that you are…
Determining the auditing method
Archiving the old model (versioning for safety and recovery)