Section 3: Preliminary Synthesized Documentation Suggestions

Section 3: Preliminary Synthesized Documentation Suggestions

The following is a more detailed discussion of documentation recommendations for specific ML lifecycle stages. In v0, these recommendations came from current research literature. Future versions will incorporate more feedback informed by current practices and pilots.

Documentation should contain different information to meet the needs of different stakeholders as well as the practical constraints of information privacy, security, and compliance requirements. There are many axes to consider. Below are a few:

  • Level of technical detail. The level of technical detail should be higher for stakeholders directly using or changing the ML system (like designers, product managers, engineers etc.) and lower for stakeholders who do not need the full technical details (like policymakers, end users, impacted users, and civil society organizations).
  • Amount of detail in general. The amount of detail can be higher for internal audiences because there are fewer risks with disclosure inside an organization compared to external disclosure.
  • Length. Documentation can be longer for audiences whose workflow requires reading it, but should be simplified to support the accomplishment of a stakeholder’s main objectives with minimum friction. Documentation should be short for audiences like end users because this will increase the likelihood that they read and understand the key portions most relevant to them. It is important for the documentation to serve the goals of each stakeholder, rather than becoming a cumbersome document that is difficult for people to understand and which then serves only to protect the ML system builder from liability.

Those attempting documentation practices within any phase of the machine learning lifecycle can consider procurement-specific insights and pay particular attention to:

  1. Ensuring accountability and safety of personally identifiable information.
  2. Practicing software quality assurance.
  3. Engaging in open source community development.
  4. Prioritizing change management processes.

3.4.1 Suggested Documentation Sections for Datasets

3.4.1 Suggested Documentation Sections for Datasets

A dataset is any collection of individual units of information.Floridi, L. (2010, February). Information: A Very Short Introduction. In computer science, a dataset is a collection of information held in electronic form, usually “numeric or encoded — along with the documentation files (such as a codebook, technical or methodology report, data dictionary) which explain their production or use.”Data Information Specialists Committee UK, 2007. Documentation of different datasets varies greatly because datasets may be created to accomplish different goals and data may originate from different sources. In this section, we specifically consider data that interfaces with a machine learning model. There are several types of datasets involved in a machine learning lifecycle to train, test, and validate a model, all of which are important to document to offer the most complete understanding of an ML system (see sidebar for more info).

A dataset may include information about anything, from individual humans to the natural world to simulated environments. The need for documentation in the ML process is often correlated with the level of “human-involved” information in a dataset that is personally identifiable to a particular individual. Sometimes the human involvement in a dataset is obvious, such as with national census data, but other times it can be difficult to identify, like in the case of commercial flight arrival time data (people are on the flight and are affected by the flight’s timeliness).

Many legal regimes, such as Europe’s General Data Protection Regulation (GDPR), distinguish between “personal” data that could either directly or (in combination with another feasibly obtainable dataset) indirectly identify a specific individual and “non-personal” data that does not. The more personal and sensitive information is in the dataset, the more security and explanation is needed to address concerns about privacy and potential misuse of the information. There may still be “privacy”-esque issues worth thinking about for any use of a human-derived dataset, even where that data is considered to be “anonymous.” Those issues could ethical (e.g. how was the dataset sourced), practical (e.g. how was the dataset anonymized), or reputational (even if anonymous, is the dataset or the source of the dataset such that an organization would want to be associated with its use). Thus, it is important for dataset authors to think carefully about and document how humans might be involved in the data collection and processing process and how to protect and use that data responsibly.

This ABOUT ML Reference Document is a reference and foundational resource. Future contributions of the ABOUT ML work will include a PLAYBOOK of specifications, guides, recommendations, templates, and other meaningful artifacts to support the work to be done for ML documentation. The recommended documentation sections for datasets are as follows and will be described in detail below:

  • Data Specification
    • Motivation
  • Data Curation
    • Composition
    • Collection
    • Processing
  • Data Integration
    • Uses
    • Distribution
    • Maintenance
  • Types and Sources of Judgement Calls


ML system developers/deployers are encouraged to do a deep dive exploration of this entire section and use it to highlight any gaps in their current understanding of both data- and model-related documentation and planning needs. This group will most benefit from further participation in the ABOUT ML effort by engaging with the community in the forthcoming online forum and by testing the efficacy and applicability of templates and specifications to be published in the PLAYBOOK and PILOTS, which will be developed based on use cases as an opportunity to implement ML documentation processes within an organization.

When a machine learning system is deployed, there is enormous social and cultural impact beyond just the technological advancement. The social and cultural elements make the documentation process even more crucial since these deployments can negatively impact human lives. For example, in the case of facial recognition software that performed poorly on people of color, there was significant societal impact.Harwell, Drew. “Federal Study Confirms Racial Bias of Many Facial-Recognition Systems, Casts Doubt on Their Expanding Use.” The Washington Post, WP Company, 21 Dec. 2019, Detailed documentation on the algorithmic procedures as well as implications often spark the need for a fair discussion around disparity in inclusion when designing algorithms and enable a dialogue among the general audience not always familiar with the impacts of the algorithms.

The documentation process is not just beneficial for describing algorithms and technical procedures but also for enhancing public awareness and promoting diversity and inclusion in machine learning. For example: WiMLDS (Women in Machine Learning and Data Science) published a documentation summary of efforts of all the organization’s events and impact on a global scale. This documentation in conjunction with visualizations is an example of a report that helps readers to understand the impact of diversity in ML at a glance across various geographic variables. It is a good example of how documentation could function to define progress in an ongoing field with huge cultural/societal consequences.

Those attempting documentation practices within any phase of the machine learning lifecycle can consider how documentation processes and artifacts would function when interacting with cultural, social, normative, and policy forces by paying particular attention to:

  1. Impact
  2. Human effect
  3. Ethical consideration Data Specification Data Specification Motivation Motivation 


D’Amour et. al. (2020) note that “underspecification in ML pipelines is a key obstacle to reliably training models that behave as expected in deployment” since an ML pipeline is underspecified “when it can return many predictors with equivalently strong held-out performance in the training domain.”

Documenting the motivation for producing a dataset provides a lever for accountability as the project proceeds, enabling stakeholders to go back to the original intention for why the dataset was created and make sure the current trajectory tracks with the original goal. Sharing the original motivation openly can reduce the risk that the dataset will be repurposed for inappropriate uses down the line. If changing or expanding the motivation for a dataset, the team should carry out a new review to ensure the framework and data are suited for the new task.


Writing down the motivation captures the context for the data, which can help downstream users make better-informed decisions about how to use it. Data that was not captured for a particular purpose may be problematic when repurposed.Hildebrandt, M. (2019) ‘Privacy as Protection of the Incomputable Self: From Agnostic to Agonistic Machine Learning’, Theoretical Inquiries in Law, 20(1) 83–121.

One potential risk that businesses may see with documenting the motivation behind a dataset would be revealing too much information about their strategic goals for making the dataset.D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., … & Sculley, D. (2020). Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395. Explorations related to the following research questions could uncover insights into barriers to implementation along with mitigation strategies to overcome those barriers.

Sample Documentation Questions
  • For what purpose was the dataset created? (Gebru et al. 2018)
  • Who created this dataset and on behalf of which entity? (Gebru et al. 2018)
  • Who funded the creation of the dataset? (Gebru et al. 2018)
  • What was included in the dataset and why? (Bender and Friedman 2018)
  • Was ethics approval sought/granted by an institutional review board?

Those attempting documentation practices within any phase of the machine learning lifecycle can target ethical and human-focused considerations by paying particular attention to:

  1. Real-world examples
  2. Individual context

Acknowledging that the power imbalance may always exist, the panelists in the Diverse Voices process advocated for approaches that are respectful to communities including: 1) Transparent, early, and frequent communication with communities including about the purpose and uses of data collection and ML systems; 2) equitable ways of enabling feedback from communities; 3) looking at how ML systems and data impact the community positively and negatively while communicating that to the community.