research-article

Label differential privacy and private training data release

AUTHORs:

Róbert Busa-Fekete,

Andres Muñoz Medina,

Sergei VassilvitskiiAuthors Info & Claims

ICML'23: Proceedings of the 40th International Conference on Machine Learning

Article No.: 132, Pages 3233 - 3251

Published: 23 July 2023 Publication History

Abstract

We study differentially private mechanisms for sharing training data in machine learning settings. Our goal is to enable learning of an accurate predictive model while protecting the privacy of each user's label. Previous work established privacy guarantees that assumed the features are public and given exogenously, a setting known as label differential privacy. In some scenarios, this can be a strong assumption that removes the interplay between features and labels from the privacy analysis. We relax this approach and instead assume the features are drawn from a distribution that depends on the private labels. We first show that simply adding noise to the label, as in previous work, can lead to an arbitrarily weak privacy guarantee, and also present methods for estimating this privacy loss from data. We then present a new mechanism that replaces some training examples with synthetically generated data, and show that our mechanism has a much better privacy-utility tradeoff if the synthetic data is realistic, in a certain quantifiable sense. Finally, we empirically validate our theoretical analysis.

References

[1]

R. Bassily, O. Thakkar, and A. Guha Thakurta. Model-agnostic private learning. Advances in Neural Information Processing Systems, 31, 2018.

[2]

A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 363-378. Springer, 2013.

[3]

R. M. Bell and Y. Koren. Lessons from the netflix prize challenge. Acm Sigkdd Explorations Newsletter, 9(2): 75-79, 2007.

Digital Library

[4]

J. Birrell, P. Dupuis, M. A. Katsoulakis, L. Rey-Bellet, and J. Wang. Variational representations and neural network estimation of rnyi divergences, 2020. URL https://arxiv.org/abs/2007.03814.

[5]

O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. In Proceedings of the 2010 International Conference on Yahoo! Learning to Rank Challenge - Volume 14, YLRC'10, page 124. JMLR.org, 2010.

[6]

K. Chaudhuri and D. J. Hsu. Sample complexity bounds for differentially private learning. In S. M. Kakade and U. von Luxburg, editors, COLT 2011 - The 24th Annual Conference on Learning Theory, June 9-11, 2011, Budapest, Hungary, volume 19 of JMLR Proceedings, pages 155-186. JMLR.org, 2011.

[7]

C. J. Clopper and E. S. Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404-413, 1934.

[8]

Criteo. http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/, 2014.

[9]

L. Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141-142, 2012.

[10]

E. Diemert, R. Fabre, A. Gilotte, F. Jia, B. Leparmentier, J. Mary, Z. Qu, U. Tanielian, and H. Yang. Lessons from the adkdd'21 privacy-preserving ml challenge, 2022. URL https://arxiv.org/abs/2201.13123.

Digital Library

[11]

C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3-4):211-407, 2014.

Digital Library

[12]

H. Esfandiari, V. Mirrokni, U. Syed, and S. Vassilvitskii. Label differential privacy via clustering. In International Conference on Artificial Intelligence and Statistics, pages 7055-7075. PMLR, 2022.

[13]

B. Ghazi, N. Golowich, R. Kumar, P. Manurangsi, and C. Zhang. Deep learning with label differential privacy. Advances in Neural Information Processing Systems, 34: 27131-27145, 2021.

[14]

Kaggle. Kaggle competitions. https://www.kaggle.com/competitions, 2022.

[15]

I. Mironov. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263-275. IEEE, 2017.

[16]

M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. OpenAI. Gpt-4 technical report, 2023.

[17]

H. Reeve and Kabán. Classification with unknown class-conditional label noise on non-compact feature spaces. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2624-2651, Phoenix, USA, 25-28 Jun 2019. PMLR.

[18]

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.

[19]

N.-M. A. R. S. G. G. Stamper, J. and K. Koedinger. Algebra i 2008-2009. challenge data set from kdd cup 2010 educational data mining challenge. find it at http://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp, 2010.

[20]

T. Steinke and J. Ullman. The pitfalls of average-case differential privacy, 2020. https://differentialprivacy.org/average-case-dp/.

[21]

A. Triastcyn and B. Faltings. Bayesian differential privacy for machine learning. In International Conference on Machine Learning, pages 9583-9592. PMLR, 2020.

[22]

D. Wang and J. Xu. On sparse linear regression in the local differential privacy model. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6628-6637. PMLR, 09-15 Jun 2019. URL https://proceedings.mlr.press/v97/wang19m.html.

[23]

R. Wu, J. P. Zhou, K. Q. Weinberger, and C. Guo. Does label differential privacy prevent label inference attacks? arXiv preprint arXiv:2202.12968, 2022.

Recommendations

Protecting privacy in data release
Foundations of security analysis and design VI

The evolution of the Information and Communication Technology has radically changed our electronic lives, making information the key driver for today's society. Every action we perform requires the collection, elaboration, and dissemination of personal ...
Protecting Privacy in Data Release
A differentially private algorithm for location data release

The rise of mobile technologies in recent years has led to large volumes of location information, which are valuable resources for knowledge discovery such as travel patterns mining and traffic analysis. However, location dataset has been confronted ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'23: Proceedings of the 40th International Conference on Machine Learning

July 2023

43479 pages

Copyright © 2023.

Publisher

JMLR.org

Publication History

Published: 23 July 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents