A Sociotechnical Framework for Assessing Demographic Data Collection

A Sociotechnical Framework for Assessing Demographic Data Collection

This paper provides a sociotechnicalSee Appendix 5 for a more detailed definition. examination of differentially private federated statistics, bringing these concerns of individual and community-level social risks and harms alongside considerations of technical accuracy. To do so, this paper looks at the use of differentially private federated statistics within the context of a broader algorithmic fairness assessment strategy undertaken by a team or organization, rather than assessing differentially private federated statistics independent of this context. This type of analysis reveals that when integrating differentially private federated statistics, there are a number of design choices organizations and teams must make to ensure the overall strategy to collect and use sensitive demographic data is conducted responsibly. If organizations neglect to think critically about the various design choices provided to them when using this technique, they risk further entrenching historical discrimination and introducing new forms of bias, rendering their bias mitigation efforts moot. Based on these findings, we provide recommendations for organizations interested in using this technique to achieve their fairness goals.

The challenges individual AI developers and organizations face when attempting to conduct an algorithmic fairness assessment have been enumerated by PAI and other AI ethics researchers. Privacy-preserving techniques, such as differentially private federated statistics, hold the promise of addressing some of these challenges so AI developers may collect the necessary demographic data to identify potential algorithmic bias. For example, by obfuscating the link between a user and their demographic data, reducing the likelihood of re-identification, and minimizing opportunities for data breaches, differentially private federated statistics allow data collection and analysis to be aligned with legal and regulatory requirements protecting individual data privacy. Such protections also lessen the risks to organizations by reducing the likelihood of data mismanagement which often results in reputational damage, erosion of consumer trust, and significant legal implications.

However, other concerns and risks remain. For example, at an organizational level, data privacy preservation does not inherently mitigate challenges with identifying the most appropriate demographic data categories to model and capture the potential algorithmic bias (selecting appropriate demographic measurements). The psycho-social risks of datafication,See Appendix 4 for a more detailed definition. both at the individual and the community level, are also not fully and automatically resolved through the application of privacy-preserving data techniques. MisrepresentationSee Appendix 4 for a more detailed definition. and miscategorizationSee Appendix 4 for a more detailed definition. can result in psychological harm for individuals, causing them to feel as if they need to alter their behavior or appearance in order to “fit the mold” of the category with which they identify. The likelihood of re-identification may be significantly reduced, thereby minimizing the risk of being targeted for specific social identity markers. However, those contributing their data (data subjects), particularly those of marginalized social identities, may be asked to contribute even more data tracking even greater minutia of their lives in the name of algorithmic fairness assessment. This expanded data collection may unintentionally result in increased surveillance if the data collected is not protected from misuse or undisclosed uses.

Any proposed innovation to the collection and use of demographic data should be assessed through a sociotechnical framework because these kinds of risks and harms are not always evident when assessed through a single risk or harm factor. Risks related to consumer database breaches may limit what an organization chooses to collect, store, and analyze, even if for algorithmic fairness purposes. However, avoidance of any demographic characteristics may lead to the inability to assess for bias (e.g., “race-blind” algorithms). For this reason, a robust algorithmic fairness strategy that involves the collection and use of demographic data is defined as one that overcomes the organizational and legal barriers while also mitigating social risks. A summary of these challenges and risks associated with demographic data collection and analysis is provided in Table 1 by which we consider differentially private federated statistics.

TABLE 1: Challenges and Risks Associated with Demographic Data Collection and Analysis

Organizational Concerns

  • Organizational priorities
  • Public relations risk
  • Discomfort (or lack of expertise) with identifying appropriate demographic groups

Legal Barriers

  • Anti-discrimination law
  • Privacy policies

Social Risks to Individuals

  • Unique privacy risks associated with the sharing of sensitive attributes likely to be the target of fairness analysis
  • Possible harms stemming from miscategorizing and misrepresenting individuals in the data collection process
  • Use of sensitive data beyond data subjects’ expectations

Social Risks to Communities

  • Expansion of surveillance infrastructure in the name of fairness
  • Misrepresenting and miscategorizing what it means to be part of a demographic group or to hold a certain identity
  • Data subjects ceding the ability to define for themselves what constitutes biased or unfair treatment
Appendix 5 provides an expanded version of this table with a more detailed definition and illustrative examples. See “‘What We Can’t Measure, We Can’t Understand’: Challenges to Demographic Data Procurement in the Pursuit of Fairness” and “Fairer Algorithmic Decision-Making and Its Consequences: Interrogating the Risks and Benefits of Demographic Data Collection, Use, and Non-Use” for a complete description of challenges and risks associated with demographic data collection and usage.