Label differential privacy and private training data release
Article No.: 132, Pages 3233 - 3251
Abstract
We study differentially private mechanisms for sharing training data in machine learning settings. Our goal is to enable learning of an accurate predictive model while protecting the privacy of each user's label. Previous work established privacy guarantees that assumed the features are public and given exogenously, a setting known as label differential privacy. In some scenarios, this can be a strong assumption that removes the interplay between features and labels from the privacy analysis. We relax this approach and instead assume the features are drawn from a distribution that depends on the private labels. We first show that simply adding noise to the label, as in previous work, can lead to an arbitrarily weak privacy guarantee, and also present methods for estimating this privacy loss from data. We then present a new mechanism that replaces some training examples with synthetically generated data, and show that our mechanism has a much better privacy-utility tradeoff if the synthetic data is realistic, in a certain quantifiable sense. Finally, we empirically validate our theoretical analysis.
References
[1]
R. Bassily, O. Thakkar, and A. Guha Thakurta. Model-agnostic private learning. Advances in Neural Information Processing Systems, 31, 2018.
[2]
A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 363-378. Springer, 2013.
[3]
R. M. Bell and Y. Koren. Lessons from the netflix prize challenge. Acm Sigkdd Explorations Newsletter, 9(2): 75-79, 2007.
[4]
J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet, and J. Wang. Variational representations and neural network estimation of rnyi divergences, 2020. URL https://arxiv.org/abs/2007.03814.
[5]
O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. In Proceedings of the 2010 International Conference on Yahoo! Learning to Rank Challenge - Volume 14, YLRC'10, page 124. JMLR.org, 2010.
[6]
K. Chaudhuri and D. J. Hsu. Sample complexity bounds for differentially private learning. In S. M. Kakade and U. von Luxburg, editors, COLT 2011 - The 24th Annual Conference on Learning Theory, June 9-11, 2011, Budapest, Hungary, volume 19 of JMLR Proceedings, pages 155-186. JMLR.org, 2011.
[7]
C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404-413, 1934.
[8]
Criteo. http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/, 2014.
[9]
L. Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141-142, 2012.
[10]
E. Diemert, R. Fabre, A. Gilotte, F. Jia, B. Leparmentier, J. Mary, Z. Qu, U. Tanielian, and H. Yang. Lessons from the adkdd'21 privacy-preserving ml challenge, 2022. URL https://arxiv.org/abs/2201.13123.
[11]
C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3-4):211-407, 2014.
[12]
H. Esfandiari, V. Mirrokni, U. Syed, and S. Vassilvitskii. Label differential privacy via clustering. In International Conference on Artificial Intelligence and Statistics, pages 7055-7075. PMLR, 2022.
[13]
B. Ghazi, N. Golowich, R. Kumar, P. Manurangsi, and C. Zhang. Deep learning with label differential privacy. Advances in Neural Information Processing Systems, 34: 27131-27145, 2021.
[14]
Kaggle. Kaggle competitions. https://www.kaggle.com/competitions, 2022.
[15]
I. Mironov. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263-275. IEEE, 2017.
[16]
M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. OpenAI. Gpt-4 technical report, 2023.
[17]
H. Reeve and Kabán. Classification with unknown class-conditional label noise on non-compact feature spaces. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2624-2651, Phoenix, USA, 25-28 Jun 2019. PMLR.
[18]
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
[19]
N.-M. A. R. S. G. G. Stamper, J. and K. Koedinger. Algebra i 2008-2009. challenge data set from kdd cup 2010 educational data mining challenge. find it at http://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp, 2010.
[20]
T. Steinke and J. Ullman. The pitfalls of average-case differential privacy, 2020. https://differentialprivacy.org/average-case-dp/.
[21]
A. Triastcyn and B. Faltings. Bayesian differential privacy for machine learning. In International Conference on Machine Learning, pages 9583-9592. PMLR, 2020.
[22]
D. Wang and J. Xu. On sparse linear regression in the local differential privacy model. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6628-6637. PMLR, 09-15 Jun 2019. URL https://proceedings.mlr.press/v97/wang19m.html.
[23]
R. Wu, J. P. Zhou, K. Q. Weinberger, and C. Guo. Does label differential privacy prevent label inference attacks? arXiv preprint arXiv:2202.12968, 2022.
Recommendations
Protecting privacy in data release
Foundations of security analysis and design VIThe evolution of the Information and Communication Technology has radically changed our electronic lives, making information the key driver for today's society. Every action we perform requires the collection, elaboration, and dissemination of personal ...
A differentially private algorithm for location data release
The rise of mobile technologies in recent years has led to large volumes of location information, which are valuable resources for knowledge discovery such as travel patterns mining and traffic analysis. However, location dataset has been confronted ...
Comments
Information & Contributors
Information
Published In
July 2023
43479 pages
Copyright © 2023.
Publisher
JMLR.org
Publication History
Published: 23 July 2023
Qualifiers
- Research-article
- Research
- Refereed limited
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 0Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024