Differentially Private Federated Statistics

Differentially private federated statistics is a privacy-preserving technique that combines two approaches, differential privacy and federated statistics, in order to enable large-scale, interactive data analysis while helping to ensure that an individual’s sensitive information remains private and secure. Both differential privacy and federated statistics can be implemented independent of one another. In this section, we explore these two approaches first individually, and then as complementary techniques, before discussing the motivations for combining them in support of data analysis efforts.

Differential Privacy

As Cynthia Dwok and Aaron Roth write, “[d]ifferential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population.” Differential privacy, first defined in 2006, is not an algorithm or fixed system, but rather an analytical approach that can be constructed in various ways with the common aim of preventing a set of (anonymized) personal data from being re-identified. This data processing framework proposes to provide a strong privacy guarantee to individuals by enforcing privacy constraints locally (e.g., on an individual’s device) or centrally (e.g., the server after data has been collected from individuals) by adding random statistical noise. This infusion of random statistical noise makes it difficult to identify any given individual who has contributed their data. The noise, however, is designed to not impede group-level analysis. Essentially, differential privacy works to provide data analysts with trends or patterns across groups as opposed to individual-level information.

This approach addresses organizational and legal concerns surrounding privacy held by organizations, particularly related to the consequences of accidentally revealing private user data, while allowing them to obtain insights about their users or consumers. It also eases concerns held by those contributing their data (data subjects), as the risk of re-identification or losing their anonymity may be minimized.

Differential privacy has been most commonly used in instances where organizations, such as government agencies or healthcare providers, wish to publish datasets while maintaining the privacy and confidentiality of those who contributed their data. Many companies, including Google, Meta, and Apple, also employ this technique when collecting data from their users. More recently, differential privacy has received attention due to a debate sparked by its use by the US Census Bureau. This debate focused mainly on the tradeoff between privacy and accuracy, and the impact of this tradeoff for marginalized groups. Critics of differential privacy argued that while differential privacy ensures strong privacy protections against de-anonymization, the infusion of statistical noise can reduce the accuracy of analyses performed on the dataset particularly for groups that make up a statistical minority.＊See Appendix 4 for a more detailed definition.

Federated Statistics

Federated statistics (drawn from the more commonly known federated learning) is a machine learning (ML) technique that enables organizations to access and use data from multiple, discrete devices without the need to collect and store this data in a centralized database. In doing so, federated statistics provides some privacy protection and data security, since any personal data used to train an ML model does not have to leave an individual device and be at risk of data breaches — either during data transfer or due to being stored on a central server. This technique is also scalable as it allows for multiple organizations to collaborate on a given ML task without requiring them to share large volumes of raw data with each other. In the case of algorithmic fairness, this allows for bias assessments to take place across a large volume and wide array of users to understand the full impact of any potential bias issues.

Federated statistics necessitates the use of discrete, individual systems or devices where analysis can be performed. Limiting factors for this technique, therefore, include the computational capabilities, storage, network connectivity, and power of these devices. Federated learning, on its own, is also widely known to be vulnerable to privacy and security issues since the data provided to the central server can be used to identify individuals. This technique is also susceptible to model poisoning attacks＊See Appendix 4 for a more detailed definition. since malicious users can directly influence the global model by infusing incorrect or messy data from their device.

Federated statistics have been most commonly used to train ML models. For example, AI-developing organizations can send a centralized ML model (trained on publicly available data or untrained entirely) to each individual device, and each device will train a copy of that model locally before sending back the training results to a central server where results are aggregated and the centralized model is updated. In the context of algorithmic fairness problems, instead of sending large volumes of user data to a central server, data scientists can send queries (specific questions that can be answered through data) to individual devices and receive an aggregate report back that allows them to identify trends across groups.＊It should be noted that as of the writing of this white paper, there is no formal definition for federated algorithms and thus others scholars’ definitions may slightly vary.

Differentially Private Federated Statistics

Combining differential privacy and federated statistics allows for large-scale, interactive data analysis while ensuring an individual’s sensitive information remains private and secure. Data scientists are able to gain insight into aggregate trends by sending queries to each data source (federated statistics). By adding statistical noise to the data, analysts can ensure individual privacy. Additionally, a secure aggregation protocol＊See Appendix 4 for a more detailed definition. can also be used to ensure that only the aggregate of individual reports generated from each query is visible to those conducting the data analysis (differential privacy). In other words, this technique ensures that no raw individual data is stored or visible to data analysts at the central level, restricting visibility to only the aggregate of reports generated.

One of the key strengths that differentially private federated statistics has over other privacy-enhancing techniques, such as Secure Multi-Party Computation (SMPC),＊See Appendix 4 for a more detailed definition. is its ability to scale across numerous external actors, vendors, and contexts. This is particularly useful for an organization that is seeking to assess the fairness of a system that operates in multiple and highly varied social, political, and economic contexts as highly localized and specific trends can be identified.

Like federated statistics, differentially private federated statistics require specific infrastructure in order to operate: discrete devices where an individual’s data is collected, stored, and analyzed. Conducting such analyses, depending on their complexity, may require a high local memory capacity and a large amount of computing power. Internet connection is needed for the local device to transfer its data reports to a central server and for queries to be sent to the devices. These hardware requirements may limit the types of AI/ML organizations that are able to take advantage of differentially private federated statistics as part of their algorithmic fairness assessment strategy.

While layering these two approaches effectively mitigates the privacy and security vulnerabilities of federated statistics, the use of federated statistics does not address the challenge of tradeoffs inherent to differential privacy, namely the loss of data utility due to increased privacy. In the following discussion, we review some of the key considerations for the design of a robust algorithmic fairness assessment which relies on the use of differentially private federated statistics as a post-deployment technique to preserve the privacy of individual users while collecting and analyzing demographic data.