Overview
Data labelers, data cleaners, and others who contribute human judgment to artificial intelligence (AI) systems play a critical role in developing this technology. Drawing from a diverse range of perspectives, the Responsible Sourcing workstream aims to develop recommendations and actionable resources to improve the working conditions of these professionals.
Research
Updates
Testimonials
Background
With many businesses pursuing automation and personalization with their technology investments, AI applications are becoming an increasingly common feature of industry. Alongside this boom has been the expansion of data enrichment work.
Despite being an essential component of AI development, data enrichment work has for too long been both out of sight and out of mind for AI developers. Without knowledge of (and appreciation for) how it is produced, enriched data can be too easily treated as a simple commodity. This disconnect leads to a devaluing of data enrichment work, poor working conditions for data enrichment workers, and, often, worse outcomes for AI development itself.
Increasingly, AI practitioners are recognizing the importance of data enrichment work and the people behind this critical enabling step in the AI development process. Unfortunately, too many AI developers still aren’t aware of the ways they are precipitating harmful and precarious working conditions and those who are don’t know what they can do to help. From AI developers we’ve heard sentiments like “We feel we must care about the transparency of our supply chain. But there is no transparency in data labeling. Guidelines on how to navigate this would be very useful.” Similarly, data enrichment providers express that they “would love the buyers [of data labeling] to be more educated and have realistic expectations when they set the price and terms of tasks.”
The Responsible Sourcing workstream addresses these questions by working to provide actionable guidance for data scientists, AI engineers, and product managers, to empower these critical ecosystem players to do their part in ensuring healthy and fair working conditions across the data supply line.
What Is Data Enrichment?
The concepts of machine learning have been around for more than half a century, but most of the major advances have taken place in the last five to ten years. This is thanks to improvements in hardware performance and the affordability of computing power which have made it possible to collect and analyze data at an unprecedented scale. As Aaron Courville, Ian Goodfellow, and Yoshua Bengio wrote in their 2015 book Deep Learning, “The most important new development is that today we can provide these algorithms with the resources they need to succeed.” Those resources are data.
But today’s AI systems cannot be built with just any data. They require enriched data. Data enrichment is a broadly defined term that encapsulates various types of data preparation and cleaning as well as human-review processes. Enriched data is essential for the training and validation of supervised learning models, the dominant form of applied AI. Examples of data enrichment work include:
Data preparation and cleaning:
- Data annotation
- Intent recognition
- Sentiment analysis
- Image recognition
- Speech to text validation
Human-review/human in the loop work, which may include:
- Content moderation
- Creating a continuous feedback loop
- Validating algorithmic outputs and models
Events
At PAI, we have been working to highlight the precarious working conditions faced by a key group that make AI possible: data enrichment professionals. Our recently published white paper Responsible Sourcing of Data Enrichment Services covers how data sourcing decisions impact workers and proposes avenues for AI practitioners to improve their working conditions. We were thrilled to see this issue explored The Gig Is Up.