ABOUT ML Quick Guides

Throughout the ABOUT ML Reference Document, coral “Quick Guide” callout boxes are included with text to further explain a concept. These are meant to make the content more accessible and consumable to lay users of machine learning systems.

Below are the collected “Quick Guide” callout boxes, arranged according to the section of the Reference Document they appear in.

ABOUT ML

The ABOUT ML initiative was presented at the Human-Centric Machine Learning workshop at the 2019 Neural Information Processing Systems conference.

In this work, Deb Raji and Jingying Yang note that “transparency through documentation is a promising practical intervention that can integrate into existing workflows to provide clarity in decision making for users, external auditors, procurement departments, and other stakeholders alike.”

ABOUT ML Reference Document

The ABOUT ML Reference Document will, going forward, continue to evolve. A few guides, specifications, and other useful artifacts contained within the Reference Document will also be accessible as standalone resources.

Further work to operationalize practices noted in the ABOUT ML Reference Document will be showcased on the ABOUT ML website in the form of a PLAYBOOK. This PLAYBOOK will serve as an evolving repository of resources for stakeholders within the ML documentation community to use.

Machine Learning

Deepai.org defines machine learning as an “approach to data analysis that involves building and adapting models, which allows programs to ‘learn’ through experience. Machine learning involves the construction of algorithms that adapt their models to improve their ability to make predictions.”

There exists an inherent and default opaqueness in this approach and the ABOUT ML effort aims to remedy that by bringing transparency to the development process, datasets, and other connections around the model through the use of documentation.

Documentation

Documentation directly addresses the goal of transparency but can also provide an infrastructure for making progress towards other AI ethics goals.

Methods for Inclusion

The goal of the Methods for Inclusion work done in conjunction with ABOUT ML is to create and curate resources for AI/ML researchers and developers to more effectively work with communities — especially those who have historically been excluded from the process — when developing AI/ML-driven technology or solutions. Learn more

MLLC vs. SDLC

Dblue.ai distinguishes between the machine learning lifecycle (MLLC) and the software development lifecycle (SDLC) by noting that software is:

“built based on the requirements provided during the first phase of SDLC. But in machine learning, a model is built based on a specific dataset […] [T]he underlying characteristics of the data might change and your models may not be giving the right result.”

Additionally, important distinctions in drift, maintenance, and monitoring between the two highlight the need for differences in documentation cadence and audience.

Underspecification

D’Amour et. al. (2020) note that “underspecification in ML pipelines is a key obstacle to reliably training models that behave as expected in deployment” since an ML pipeline is underspecified “when it can return many predictors with equivalently strong held-out performance in the training domain.”

Consent vs. Consentability

On page 7 of Kim (2019), the author notes that:

“Consent is distinct from consentability […] The first [meaning of consentability] involves possibility. An act which is consentable means it is possible for there to be consent given the nature of the proposed activity. The second meaning of consentability involves legality. An act which is consentable is (or under the right circumstances, can be) legal. The possibility of valid consent is essential to consentability but it is not sufficient.”

Datasets

In a 2017 blog post titled “What is the Difference Between Test and Validation Datasets?,” Brownlee describes the distinctions between datasets as follows:

Training Dataset: The sample of data used to fit the model.

Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.”

GDPR

General Data Protection Regulation (GDPR) is a legal framework that sets guidelines for the collection and processing of personal information from individuals who live in the European Union (EU).

This and other international, federal, state and local regulations could impact the generalizability of documentation recommendations.