Data Curation Data Curation Collection Collection

Data Curation

We adopt Divya Singh’s definition from the article entitled “The Role of Data Curation in Big Data”:

“Curation is the end-to-end process of creating good data through the identification and formation of resources with long-term value. In information technology, it refers mainly to the management of data throughout its lifecycle, from creation and initial storage to the time when it is archived for future research and analysis, or becomes obsolete and is deleted.”

The process of data collection should be well-documented for end users of the system, purchasers of the system, as well as any collaborators contributing to the development of the overall ML system. Potential intellectual property issues about data provenance should be flagged at this stage, such as whether any third party could claim that their data was improperly included in this data set at any point in its history. When data is collected from human subjects, the documentation should include information about the consent and notification process or alternatively why consent was not necessary for this use of the personal data. For example, are the subjects aware of all the data being collected about them, are they able to opt out or opt in to the data collection, and have they been notified of the exact uses of that collected data? Decisions around what constitutes meaningful informed consent should consider what Nancy Kim refers to as “consentability.” Questions that ML practitioners should consider in their documentation include how aware the data subject is of what information will be collected about them, whether the data subject has clear alternatives they can exercise to not have their data collected, whether those choices can be exercised without fear of penalty, and whether a data subject can reasonably understand the downstream uses and effects of their data.Selinger, E. (2019). ‘Why You Can’t Really Consent to Facebook’s Facial Recognition’, One Zero. https://onezero.medium.com/why-you-cant-really-consent-to-facebook-s-facial-recognition-6bb94ea1dc8f In the highest risk use cases, teams should take pains to ensure data subjects are fully aware of the exact uses of their data.

Consent vs. Consentability

On page 7 of Kim (2019), the author notes that:

“Consent is distinct from consentability […] The first [meaning of consentability] involves possibility. An act which is consentable means it is possible for there to be consent given the nature of the proposed activity. The second meaning of consentability involves legality. An act which is consentable is (or under the right circumstances, can be) legal. The possibility of valid consent is essential to consentability but it is not sufficient.”

In addition, potential issues of sampling bias should be thoroughly evaluated and noted to the extent possible, recognizing that this is currently quite difficult in many domains. For example, studies have found that certain minority communities are disproportionately targeted by police for arrestLum, K., & Isaac, W. (2016). To predict and serve?. Significance, 13(5), 14-19. https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2016.00960.x which means that they are in effect being over-sampled in the data. As a result, arrest data would over-represent these communities even if they have similar crime rates to other communities.


These disclosures help users of the dataset assess potential issues of biased sampling and representation in a dataset. In addition, greater transparency around the collection process and whether the proper consent was obtained can give data subjects more assurance that their privacy is respected. These disclosures can also allow companies to indicate that they have complied with relevant data privacy laws. Companies that make this information available might see a reputational or competitive advantage as consumers might prefer using products built with models where the underlying data collection process is known because of strong consumer preference for transparency.LabelInsight (2016). “Drive Long-Term Trust & Loyalty Through Transparency”. https://www.labelinsight.com/Transparency-ROI-Study Documentation also provides a lever of internal accountability for internal teams to use as it creates a paper trail to detect or identify misused models or data. For a company, more detailed documentation could protect them from liability stemming from third-party misuse by clarifying the intended context of use. Finally, documenting the data collection process enhances replicability by demonstrating how a scientific conclusion was met, a core step in the process of scientific advancement and ML research.

Some potential negative effects of such disclosures, however, include possible legal, privacy, and intellectual property concerns depending on the level of granularity of the disclosures and whether any questionable practices were used for data collection. For example, documentation of the decision-making process for a dataset could be discoverable in potential litigation. This is an area that needs further research to understand the legal ramifications of information disclosure within ML documentation and whether existing or new policy can help to ensure that companies have incentive to share vital information that would be to the public benefit without incurring undue risk or harm. Explorations related to the following research questions could uncover insights into barriers to implementation along with mitigation strategies to overcome those barriers.

Sample Documentation Questions

  • What mechanisms or procedures were used to collect the data? What mechanisms or procedures were used to correct the data for sampling error? How were these mechanisms or procedures validated? (Gebru et al. 2018)
  • If the dataset is a sample from a larger set, what was the sampling strategy? (Gebru et al. 2018)
  • Who was involved in the data collection process, in what kind of working environment, and how were they compensated, if at all? (Gebru et al. 2018)
  • Over what timeframe was the data collected? (Gebru et al. 2018)
  • Who is the data being collected from? Would you be comfortable if that data was being collected from you and used for the intended purpose?
  • Does this data collection undermine individual autonomy or self-determination in any way?
  • How is the data being retained? How long will it be kept?
  • Both genre and topic influence the vocabulary and structural characteristics of texts (Biber 1995) and should be specified. Think of the nature of data sources and how that may affect whether or not the data is a suitable representation of the world. (Bender and Friedman 2018)

Readers are encouraged to explore Section of this document to incorporate ethical considerations of consent into their data collection processes including important tenets of consent over time, derivative future use of consent, rescinding consent, and other topics. Processing Processing


In a 2017 blog post titled “What is the Difference Between Test and Validation Datasets?,” Brownlee describes the distinctions between datasets as follows:

Training Dataset: The sample of data used to fit the model.

Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.”

Although data processing can seem like a very straight-forward task, there is significant potential for bias to enter the dataset through it. Datasets are “political interventions” that are not neutral or natural; the act of collecting, categorizing, and labeling data “is itself a form of politics filled with questions about who gets to decide what [data] mean.”Crawford and Paglen, https://www.excavating.ai/ Additionally, it is important to document what steps were taken to de-identify datasets pertaining to people and how that fits in with relevant policies.

There are many factors to document when human labeling is involved in the dataset creation process. Given that data processing often involves some degree of human labeling, it is important to be conscious of the assumptions or choices that must be made in labeling. Some of these choices will be influenced by cultural biases that ML practitioners and labelers may bring into their decision-making processes.Geva, Mor & Goldberg, Yoav & Berant, Jonathan. (2019). Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. https://arxiv.org/pdf/1908.07898.pdf Disclosures should include information about the number of labelers used and an analysis of demographics of the labelers (e.g., languages spoken for NLP models) to help users gauge the potential for biased labeling or blindspots. There is both bias from the labelers themselves and bias from the choice of labels to include. For example, if sex is considered a binary variable, non-binary individuals are effectively unidentifiable in the data. Defining the taxonomy for the data is thus an important step in establishing the ground truth. In addition, ensuring inter-rater reliability is one important step to addressing the potential for bias from human labelers. Lastly, making labels transparent and auditable is a necessary step to facilitate debugging.

Other datasets collect labels and/or metadata in a more automatic manner. One method is by linking other data sources to collect labels (e.g., age for individuals in a face recognition dataset collected by scraping Wikipedia). In this case, the source of the label should be disclosed. Other datasets use models to predict labels (e.g., gender for individuals in a face recognition dataset by a face analysis model). In this case, a link to the documentation of the model should be provided. In addition, details of any audits assessing the model for bias in its predictions for intersectional groups should be noted.


Benefits of such disclosures include replicability and clarifying potential biases. Model developers using the dataset can better understand what they can or cannot do with the data, and any leaps in logic become more apparent. The transparency created by these disclosures can also encourage data collectors to ensure that their data labeling practices align with the original purposes they envisioned for the data.

Some potential downsides include the fact that there might be some privacy concerns for the labelers depending on how much information about them is disclosed. In addition, data cleaning and labeling can be a complex and multi-layered process, so accurately relaying the process can be difficult. Explorations related to the following research questions could uncover insights into barriers to implementation, examples of potential bias, and levels of stakeholder comfort regarding privacy.

Sample Documentation Questions

  • Was any preprocessing/cleaning/labeling of the data done? (Gebru et al. 2018)
  • Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? (Gebru et al. 2018)
  • Which data instances were filtered out of the raw dataset and why? What proportion of the “raw” dataset was filtered out during cleaning?
  • What are the demographic characteristics of the annotators and annotation guidelines given to developers? (Gebru et al. 2018)
  • What labeling guidelines were used? What instructions were given to the labelers? Composition Composition 

It is vital to make it clear to users what is in the dataset. This reflects the operationalization of the motivation section above. In addition to a list of the label annotations and metadata in the dataset, it is important to include information about how representative the dataset is and the potential for sampling bias or other forms of statistical bias. For example, in the context of natural language processing (NLP), it would be relevant to include information about the original language of the text,Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604. especially because label quality can vary greatly if annotators lack the context for making meaningful, accurate labels. In general, background on the demographics reflected in the data can be useful for users to assess potential bias issues.

Demographic Data

As noted in Bogen et. al. (2019), “[m]any machine learning fairness practitioners rely on awareness of sensitive attributes — that is, access to labeled data about people’s race, ethnicity, sex, or similar demographic characteristics — to test the efficacy of debiasing techniques or directly implement fairness interventions.”

Fairness, Transparency, and Accountability research within PAI is focused on examining the challenges organizations face around the collection and use of demographic data to help address algorithmic bias. Learn more

A key area to document is how developers are using your test, training, and validation data sets. It is important to ensure there is no overlap between the test, training, and validation subsets.


The benefits of making composition clear are that users can know what to expect from the dataset and how models trained on the data might perform in different domains. In disclosing composition, it is important for developers to refer back to their motivations for creating the dataset to ensure that the composition appropriately reflects those objectives.

Depending on the granularity of the description of composition, privacy could be an issue. As a general rule, developers should distinguish between what information is appropriate to share with whom and be very mindful of disclosing any metadata or labels that might make the dataset personally identifiable.

When including information on the demographic composition of the data, developers should keep in mind that demographic taxonomies are not well-defined. Developers can look to the fields of sociology and psychology for existing standards, but should be aware that some taxonomies might still be problematic in context. For example, a binary gender classification might not be appropriate.Katta Spiel, Oliver L. Haimson, and Danielle Lottridge. (2019). How to do better with gender on surveys: a guide for HCI researchers. Interactions. 26, 4 (June 2019), 62-65. DOI: https://doi.org/10.1145/3338283 In addition, it might not make sense to apply the American racial construct in another region’s cultural context.

Finally, there are still open research questions around both the definition of “representativeness” in datasets and what sample sizes and quality qualify datasets to be used for models that make decisions about subgroups. Representativeness depends on the context of the specific systems where the data is being used. Documentation should assist users with determining what the appropriate contexts are for use of the particular dataset. Explorations related to the following research questions could uncover insights into barriers to implementation along with mitigation strategies to overcome those barriers.

Sample Documentation Questions

  • What data does each instance consist of? (Gebru et al. 2018)
  • Is there a label or target associated with each instance? (Gebru et al. 2018)
  • Are there recommended data splits (e.g., training, development/validation, testing)? (Gebru et al. 2018)
  • Are there any errors, sources of noise, or redundancies in the dataset? (Gebru et al. 2018)
  • Detail source, author contact information, and version history. (Holland et al. 2019)
  • Ground truth correlations: linear correlations between a chosen variable in the dataset and variables from other datasets considered to be “ground truth.” (Holland et al. 2019) Types and Sources of Judgement Calls Types and Sources of Judgement Calls

In deciding what kinds of data to collect, how to collect it, and how to store it, it is important for teams to document major judgment calls that the team made. Judgement calls are often made when creating, correcting, annotating, striking out, weighing, or enriching data. The extent and nature of these human judgement calls should be made explicit, and the resources or faculties brought to bear to make those judgements should be made explicit as well. Judgment calls can include why the team chose to collect and process data in a particular way and the method for doing so. While it would be overly burdensome to document all decisions reached, research and product teams should determine what constitutes a major judgement call by consulting expected norms that exist in their field and determine how much their decisions deviate from those norms.

Common points of judgement include:

  • Study Design
    • Question Choice
    • Language Used
  • Data Collection
    • Source of Data
    • Subject Selection
  • Data Processing
    • Filtering and Exclusion
    • Bucketing and Thresholds
  • Application and Usage
    • Expansion of Context
    • Usage as a Proxy for Another Feature