-
Privately Answering Queries on Skewed Data via Per Record Differential Privacy
Authors:
Jeremy Seeman,
William Sexton,
David Pujol,
Ashwin Machanavajjhala
Abstract:
We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms w…
▽ More
We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to different records, PzCDP's privacy loss depends explicitly on the confidential data. We define our formalism, derive its properties, and propose mechanisms which satisfy PzCDP that are uniquely suited to publishing skewed or heavy-tailed statistics, where a small number of records contribute substantially to query answers. This targeted relaxation helps overcome the difficulties of applying standard DP to these data products.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy
Authors:
Skye Berghel,
Philip Bohannon,
Damien Desfontaines,
Charles Estes,
Sam Haney,
Luke Hartman,
Michael Hay,
Ashwin Machanavajjhala,
Tom Magerlein,
Gerome Miklau,
Amritha Pai,
William Sexton,
Ruchit Shrestha
Abstract:
In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service.
In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Bayesian and Frequentist Semantics for Common Variations of Differential Privacy: Applications to the 2020 Census
Authors:
Daniel Kifer,
John M. Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Philip Leclerc,
Ashwin Machanavajjhala,
William Sexton,
Pavel Zhuravlev
Abstract:
The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation…
▽ More
The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation of the confidentiality protections for the 2020 Census Public Law 94-171 Redistricting Data Summary File released August 12, 2021, which, for the first time, were produced with formal privacy guarantees.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
The 2020 Census Disclosure Avoidance System TopDown Algorithm
Authors:
John M. Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Simson Garfinkel,
Micah Heineck,
Christine Heiss,
Robert Johns,
Daniel Kifer,
Philip Leclerc,
Ashwin Machanavajjhala,
Brett Moran,
William Sexton,
Matthew Spence,
Pavel Zhuravlev
Abstract:
The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another ke…
▽ More
The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another key aspect of the TDA are invariants, statistics that the Census Bureau has determined, as matter of policy, to exclude from the privacy-loss accounting. The TDA post-processes the measurements together with the invariants to produce a Microdata Detail File (MDF) that contains one record for each person and one record for each housing unit enumerated in the 2020 Census. The MDF is passed to the 2020 Census tabulation system to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File. This paper describes the mathematics and testing of the TDA for this purpose.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
An Uncertainty Principle is a Price of Privacy-Preserving Microdata
Authors:
John Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Simson Garfinkel,
Daniel Kifer,
Philip Leclerc,
William Sexton,
Ashley Simpson,
Christine Task,
Pavel Zhuravlev
Abstract:
Privacy-protected microdata are often the desired output of a differentially private algorithm since microdata is familiar and convenient for downstream users. However, there is a statistical price for this kind of convenience. We show that an uncertainty principle governs the trade-off between accuracy for a population of interest ("sum query") vs. accuracy for its component sub-populations ("poi…
▽ More
Privacy-protected microdata are often the desired output of a differentially private algorithm since microdata is familiar and convenient for downstream users. However, there is a statistical price for this kind of convenience. We show that an uncertainty principle governs the trade-off between accuracy for a population of interest ("sum query") vs. accuracy for its component sub-populations ("point queries"). Compared to differentially private query answering systems that are not required to produce microdata, accuracy can degrade by a logarithmic factor. For example, in the case of pure differential privacy, without the microdata requirement, one can provide noisy answers to the sum query and all point queries while guaranteeing that each answer has squared error $O(1/ε^2)$. With the microdata requirement, one must choose between allowing an additional $\log^2(d)$ factor ($d$ is the number of point queries) for some point queries or allowing an extra $O(d^2)$ factor for the sum query. We present lower bounds for pure, approximate, and concentrated differential privacy. We propose mitigation strategies and create a collection of benchmark datasets that can be used for public study of this problem.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
Differentially Private Algorithms for 2020 Census Detailed DHC Race \& Ethnicity
Authors:
Sam Haney,
William Sexton,
Ashwin Machanavajjhala,
Michael Hay,
Gerome Miklau
Abstract:
This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribe…
▽ More
This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribes at varying levels of geography. We describe two differentially private algorithmic strategies, one based on adding noise drawn from a two-sided Geometric distribution that satisfies "pure"-DP, and another based on adding noise from a Discrete Gaussian distribution that satisfied a well studied variant of differential privacy, called Zero Concentrated Differential Privacy (zCDP). We analytically estimate the privacy loss parameters ensured by the two algorithms for comparable levels of error introduced in the statistics.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Suboptimal Provision of Privacy and Statistical Accuracy When They are Public Goods
Authors:
John M. Abowd,
Ian M. Schmutte,
William Sexton,
Lars Vilhuber
Abstract:
With vast databases at their disposal, private tech companies can compete with public statistical agencies to provide population statistics. However, private companies face different incentives to provide high-quality statistics and to protect the privacy of the people whose data are used. When both privacy protection and statistical accuracy are public goods, private providers tend to produce at…
▽ More
With vast databases at their disposal, private tech companies can compete with public statistical agencies to provide population statistics. However, private companies face different incentives to provide high-quality statistics and to protect the privacy of the people whose data are used. When both privacy protection and statistical accuracy are public goods, private providers tend to produce at least one suboptimally, but it is not clear which. We model a firm that publishes statistics under a guarantee of differential privacy. We prove that provision by the private firm results in inefficiently low data quality in this framework.
△ Less
Submitted 21 June, 2019;
originally announced June 2019.