Our Blog

Working to Address Algorithmic Bias? Don’t Overlook the Role of Demographic Data


Collecting and using demographic data in service of detecting algorithmic bias is a challenge fraught with many legal and ethical implications. How might those concerned about bias within algorithmic models address this?

A lack of clarity around the acceptable uses for demographic data has frequently been cited by PAI partners as a barrier to addressing algorithmic bias in practice.

This has led us to ask the question, “When and how should demographic data be collected and used in service of algorithmic bias detection and mitigation?”

To field preliminary responses to this question, PAI hosted a Chatham House convening co-located with the 2020 FAT* (FAccT) Conference in Barcelona. By bringing together a diverse group of machine learning, legal, policy, and social science experts from industry, academia, and civil society, PAI hoped to identify some key gaps that future multistakeholder convenings and research projects could address.

Below, we share highlights from the convening conversation, as well as an invitation to participate in a new PAI research project exploring access to and usage of demographic data as a barrier to detecting bias. For a more in-depth recap of our conversation at FAT*, check out the full convening report here.

Four Key Tensions Around Using Demographic Data for Bias Detection 

During the course of the conversation at FAT*, a set of central tensions arose around demographic data use and collection:

How Should Demographic Data Be Defined?

Algorithmic bias refers to the ways in which algorithms might perform more poorly for certain demographic groups or produce disparate outcomes across such groups. Thus, knowledge of which demographic groups individuals belong to is vital for measuring and mitigating such biases.

“Demographic data” is an umbrella term used to house class-categories that the US refers to as “protected class data” and some of the categories the EU’s GDPR calls “sensitive personal data.”

Convening participants were quick to point out, however, that regulated categories of data in the US and EU are quite different. Participants also discussed how even within specific jurisdictions, different variables are protected in different contexts by a wide array of applicable legislation.

This tension is further complicated by the potential for proxy variables to be used as stand-ins for nearly any demographic category. As some participants mentioned, the EU’s GDPR specifies that data processed to predict sensitive personal data becomes sensitive itself. Many types of data, however, could contribute towards predicting sensitive data or constructing a close proxy for demographic categories, rendering the task of determining what data is (or should be) protected an increasingly difficult one.

How Can I Align My Organization Around Responsible Demographic Data Collection and Use?

A major concern that participants brought up was the issue of deciding when and how to collect and use demographic data within an organization.

Collecting and using demographic data is a topic that is often fraught with legal and ethical dilemmas given concerns around the highly personal and private nature of such data and the potential for such data to be misused.

Due to the potential privacy encroachments inherent to collecting personal information about users, it is not always clear when measuring and mitigating bias should be the highest organizational priority. Furthermore, leadership, compliance, legal, and communications teams can be justifiably concerned about the potential for demographic data collection to increase both the corporation’s liability and the potential for public backlash. Thus, it can be difficult to build alignment on demographic data collection and usage within teams, let alone across an organization.

Is My Approach to Measuring Bias… Biased Itself?

In cases where demographic data is being collected for the purpose of better understanding system bias, practitioners must be extremely cautious about peering through distorted lenses.

Many participants discussed how measurement bias, observational challenges with various demographic attributes, and class misrepresentation can all work to obfuscate and mischaracterize algorithmic biases. Self-selection bias (i.e. only retrieving data from those who want to give it to you) is one example of an especially crucial consideration in this area. Mandated reporting of demographic characteristics is neither probable nor desirable in most domains, but self-reporting is very likely to generate systemic biases in the data that gets collected.

Can We Ensure Accountable Use of Demographic Data?

Once demographic data has been collected, it is important that it is actually used towards the goals that motivated its collection. This is key to both garnering the public support needed for demographic data collection in the first place and to ensuring that bias detection and mitigation efforts remain a possibility into the future.

One potential accountability mechanism that participants discussed involved setting up a third-party data storage arrangement where demographic data is metered out by a not-for-profit, publicly accountable organization.

We expand on these four tensions, and additional conversation from the FAT* convening, in this convening recap.

Paths Forward for Algorithmic Bias Detection:
Two New PAI Research Projects

PAI has two active research projects seeking to bring greater clarity to the question of how demographic data should be used in service of algorithmic fairness goals.

In this law review article from PAI, we explore the legal tensions between on the one hand, seeking to ensure that algorithmic decision-making is not driven by protected class attributes like race and sex and on the other hand, seeking to actively mitigate algorithmic bias through techniques that might be conscious of such attributes.

In addition, PAI is launching an interview-based research project looking specifically at whether and how demographic data gets collected and used in practice by AI developers seeking to detect  bias.

By using our unique position at the intersection of corporate AI developers and civil society groups representing different aspects of the public interest, we hope to clarify paths forward for bias detection and mitigation efforts that are squared with data regulations and best practices for user protection.

We are currently recruiting participants for our interview study, and we would love to speak with you! We are interested in talking to a range of folks who are involved in decisions around potentially sensitive data (or proxies) that may be relevant to bias detection (e.g. age, race, ethnicity, gender, socioeconomic status, zip code, etc.).

We’d especially like to talk with you if:

  • You have been involved in efforts to detect biases in a model or product
  • You are familiar with internal company policies around the usage of demographic data
  • You are familiar with external regulations that may impact the collection or usage of demographic data.

We are looking for perspectives from a variety of roles, including engineers, researchers, product managers, lawyers, compliance specialists, and others. We’re also looking for perspectives from a variety of sectors where bias detection may be relevant.

We know that collecting and using demographic data to detect algorithmic bias is a challenge for many individuals and organizations. We hope the results of this interview study will give you some insights into how other organizations face this challenge, and follow-on work will include a multi-stakeholder process of coming together to envision creative solutions.

Your input is crucial to helping the field better and more responsibly detect algorithmic bias!

Please visit the project overview if you are interested in learning more, and sign up here if you are interested in being interviewed.