ABOUT ML ‘How We Define’

Throughout the ABOUT ML Reference Document, blue “How We Define” callout boxes are included to showcase accepted definitions of terms and phrases. These are meant to give foundational background information to readers and also provide a baseline of understanding for any artifacts that may be derived from the Reference Document.

Below are the collected “How We Define” callout boxes, arranged according to the section of the Reference Document they appear in. Additional terms can be found in the Reference Document’s glossary section.


As noted in Jobin et al. (2019), the “interpretation, justification, domain of application, and mode of achievement” of AI transparency vary from one publication to another.

For this document, we adopt a meaning for transparency that includes all “efforts to increase explainability, interpretability or other acts of communication and disclosure.”


We follow the lexicon of algorithms research by Kohli et. al. (2018) in defining accountability as “the answerability of actors for outcomes” and the tracing and verification of system action as well as those who take responsibility for those actions.


Our working definition for “AI incidents” is the following:

AI incidents are events or occurrences in real life that caused or had the potential to cause physical, financial, or emotional harm to people, animals, or the environment.

This is the basis of the AIID project we are evolving in conjunction with our ABOUT ML documentation efforts and is meant to add to the compendium of resources to support the reduction of harm from AI systems.

Data Documentation

Data can be documented using any artifact supporting the explanation and/or clarification of the collection, curation, cleaning, processing, composition, integration, distribution or maintenance of data used in the model/algorithms of the ML system.


According to Lipton (2017), interpretability holds no agreed upon meaning. However, we see the benefit of interpreting “opaque models after-the-fact” and are comfortable using the post-hoc interpretation approach which includes “natural language explanations, visualization of learned representations or models, and explanations by example.”

Data Curation

We adopt Divya Singh’s definition from the article entitled “The Role of Data Curation in Big Data”:

“Curation is the end-to-end process of creating good data through the identification and formation of resources with long-term value. In information technology, it refers mainly to the management of data throughout its lifecycle, from creation and initial storage to the time when it is archived for future research and analysis, or becomes obsolete and is deleted.”

Demographic Data

As noted in Bogen et. al. (2019), “[m]any machine learning fairness practitioners rely on awareness of sensitive attributes — that is, access to labeled data about people’s race, ethnicity, sex, or similar demographic characteristics — to test the efficacy of debiasing techniques or directly implement fairness interventions.”

Fairness, Transparency, and Accountability research within PAI is focused on examining the challenges organizations face around the collection and use of demographic data to help address algorithmic bias. Learn more

Data Integration

We use the tenets of the Doan et. al (2012) definition which infers that proper data integration offers “uniform access to a set of autonomous and heterogeneous data sources.”

The task is often challenging for system logical, social, and administrative reasons.


We borrow from Vogelsang and Borg (2019) and note that model specifications can include information about:

  • Quantitative targets
  • Data requirements
  • Explainability
  • Freedom from discrimination
  • Legal and regulatory constraints
  • Quality requirements


Generalization usually refers to the ability of an algorithm to be effective across a range of inputs and applications. It is related to repeatability in that we expect a consistent outcome based on the inputs.

To create good predictive models in machine learning that are capable of generalizing, one needs to know when to stop training the model so that it doesn’t overfit.

Model Integration

Integrating a model is the act of putting into production (deployment) and instantiating the monitoring process.

In a 2019 Towards Data Science blog post, Opeyemi notes that the deployment of an ML model “simply means the integration of the model into an existing production environment which can take in an input and return an output that can be used in making practical business decisions.”