Data labelers, data cleaners, and others who contribute human judgment to artificial intelligence (AI) systems play a critical role in developing this technology. Drawing from a diverse range of perspectives, the Responsible Sourcing workstream aims to develop recommendations and actionable resources to improve the working conditions of these professionals.
“Getting the most out of big data depends on recognizing how dependent we are on workers–often temporary, working offsite–who clean, structure, and manage datasets. The future of advancing AI hinges on investing in work conditions that enhance rather than undermine how data are handled. I’m glad to see PAI’s initiative calling on AI organizations to firmly commit to this vision of responsible tech in all their data pipeline decisions.”
“As AI becomes more mainstream, it is important to acknowledge the invisible work and workers that enable the technology. We hope this paper can help contribute to the dialogue around worker wellbeing in the AI supply chain.”
“Ethical AI is usually focused on the use of AI but the development of AI also involves significant decisions of responsibility. It takes a large human workforce to train, scale, and sustain AI and yet there are very few worker centric resources to help AI companies make these responsible decisions.”
“This whitepaper is a must-read for AI companies which want to practice responsible procurement. We hope its recommendations will be adopted widely across the AI industry, for the benefit of both clients and data enrichment workers”
“Artificial intelligence is driven by human intelligence, and we have an opportunity for AI to be a force for good in the most overlooked talent pools.”
With many businesses pursuing automation and personalization with their technology investments, AI applications are becoming an increasingly common feature of industry. Alongside this boom has been the expansion of data enrichment work.
Despite being an essential component of AI development, data enrichment work has for too long been both out of sight and out of mind for AI developers. Without knowledge of (and appreciation for) how it is produced, enriched data can be too easily treated as a simple commodity. This disconnect leads to a devaluing of data enrichment work, poor working conditions for data enrichment workers, and, often, worse outcomes for AI development itself.
Increasingly, AI practitioners are recognizing the importance of data enrichment work and the people behind this critical enabling step in the AI development process. Unfortunately, too many AI developers still aren’t aware of the ways they are precipitating harmful and precarious working conditions and those who are don’t know what they can do to help. From AI developers we’ve heard sentiments like “We feel we must care about the transparency of our supply chain. But there is no transparency in data labeling. Guidelines on how to navigate this would be very useful.” Similarly, data enrichment providers express that they “would love the buyers [of data labeling] to be more educated and have realistic expectations when they set the price and terms of tasks.”
The Responsible Sourcing workstream addresses these questions by working to provide actionable guidance for data scientists, AI engineers, and product managers, to empower these critical ecosystem players to do their part in ensuring healthy and fair working conditions across the data supply line.
What Is Data Enrichment?
The concepts of machine learning have been around for more than half a century, but most of the major advances have taken place in the last five to ten years. This is thanks to improvements in hardware performance and the affordability of computing power which have made it possible to collect and analyze data at an unprecedented scale. As Aaron Courville, Ian Goodfellow, and Yoshua Bengio wrote in their 2015 book Deep Learning, “The most important new development is that today we can provide these algorithms with the resources they need to succeed.” Those resources are data.
But today’s AI systems cannot be built with just any data. They require enriched data. Data enrichment is a broadly defined term that encapsulates various types of data preparation and cleaning as well as human-review processes. Enriched data is essential for the training and validation of supervised learning models, the dominant form of applied AI. Examples of data enrichment work include:
Data preparation and cleaning:
- Data annotation
- Intent recognition
- Sentiment analysis
- Image recognition
- Speech to text validation
Human-review/human in the loop work, which may include:
- Content moderation
- Creating a continuous feedback loop
- Validating algorithmic outputs and models