Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

AIM: an adaptive and iterative mechanism for differentially private synthetic data

Published: 01 July 2022 Publication History

Abstract

We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.

References

[1]
Nazmiye Ceren Abay, Yan Zhou, Murat Kantarcioglu, Bhavani M. Thuraisingham, and Latanya Sweeney. 2018. Privacy Preserving Synthetic Data Release Using Deep Learning. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2018, Dublin, Ireland, September 10--14, 2018, Proceedings, Part I (Lecture Notes in Computer Science), Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim (Eds.), Vol. 11051. Springer, 510--526.
[2]
Hassan Jameel Asghar, Ming Ding, Thierry Rakotoarivelo, Sirine Mrabet, and Mohamed Ali Kâafar. 2019. Differentially Private Release of High-Dimensional Datasets using the Gaussian Copula. CoRR abs/1902.01499 (2019). arXiv:1902.01499 http://arxiv.org/abs/1902.01499
[3]
Sergul Aydore, William Brown, Michael Kearns, Krishnaram Kenthapadi, Luca Melis, Aaron Roth, and Ankit A Siva. 2021. Differentially Private Query Release Through Adaptive Projection. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, 457--467. https://proceedings.mlr.press/v139/aydore21a.html
[4]
Vincent Bindschaedler, Reza Shokri, and Carl A. Gunter. 2017. Plausible Deniability for Privacy-Preserving Data Synthesis. Proceedings of the VLDB Endowment 10, 5 (2017), 481--492.
[5]
Mark Bun and Thomas Steinke. 2016. Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds. In Theory of Cryptography Conference. Springer, 635--658.
[6]
Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth, and Steven White. 2000. Visualization of navigation patterns on a web site using model-based clustering. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 280--284.
[7]
Kuntai Cai, Xiaoyu Lei, Jianxin Wei, and Xiaokui Xiao. 2021. Data synthesis via differentially private markov random fields. Proceedings of the VLDB Endowment 14, 11 (2021), 2190--2202.
[8]
Clément L. Canonne, Gautam Kamath, and Thomas Steinke. 2020. The Discrete Gaussian for Differential Privacy. In NeurIPS. https://proceedings.neurips.cc/paper/2020/hash/b53b3a3d6ab90ce0268229151c9bde11-Abstract.html
[9]
Mark Cesar and Ryan Rogers. 2021. Bounding, Concentrating, and Truncating: Unifying Privacy Loss Composition for Data Analytics. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory (Proceedings of Machine Learning Research), Vitaly Feldman, Katrina Ligett, and Sivan Sabato (Eds.), Vol. 132. PMLR, 421--457. https://proceedings.mlr.press/v132/cesar21a.html
[10]
Anne-Sophie Charest. 2011. How Can We Analyze Differentially-Private Synthetic Datasets? Journal of Privacy and Confidentiality 2, 2 (2011).
[11]
Rui Chen, Qian Xiao, Yu Zhang, and Jianliang Xu. 2015. Differentially private high-dimensional data publication via sampling-based inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 129--138.
[12]
Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. 2011. Differentially private data cubes: optimizing noise sources and consistency. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, Timos K. Sellis, Renée J. Miller, Anastasios Kementsietsidis, and Yannis Velegrakis (Eds.). ACM, 217--228.
[13]
Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the Twenty-Second ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 9--12, 2003, San Diego, CA, USA, Frank Neven, Catriel Beeri, and Tova Milo (Eds.). ACM, 202--210.
[14]
Cynthia Dwork, Frank McSherry Kobbi Nissim, and Adam Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. In TCC. 265--284.
[15]
Vitaly Feldman and Tijana Zrnic. 2021. Individual privacy accounting via a renyi filter. Advances in Neural Information Processing Systems 34 (2021), 28080--28091.
[16]
James S Frame. 1945. Mean deviation of the binomial distribution. The American Mathematical Monthly 52, 7 (1945), 377--379.
[17]
Thomas Cason Frank E. Harrell Jr. [n.d.]. Encyclopedia Titanica.
[18]
Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F. Ilyas. 2021. Kamino: Constraint-Aware Differentially Private Data Synthesis. Proceedings of the VLDB Endowment 14, 10 (2021), 1886--1899. http://www.vldb.org/pvldb/vol14/p1886-ge.pdf
[19]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8--13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 2672--2680. https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
[20]
William H Greene. 2003. Econometric analysis. Pearson Education India.
[21]
Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A Simple and Practical Algorithm for Differentially Private Data Release. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 2348--2356. https://proceedings.neurips.cc/paper/2012/hash/208e43f0e45c4c78cafadb83d2888cb6-Abstract.html
[22]
Joachim Hartung, Guido Knapp, Bimal K Sinha, and Bimal K Sinha. 2008. Statistical meta-analysis with applications. Vol. 6. Wiley Online Library.
[23]
Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang. 2016. Principled evaluation of differentially private algorithms using dpbench. In Proceedings of the 2016 International Conference on Management of Data. 139--154.
[24]
Zhiqi Huang, Ryan McKenna, George Bissias, Gerome Miklau, Michael Hay, and Ashwin Machanavajjhala. 2019. PSynDB: accurate and accessible private data generation. VLDB Demo (2019). https://people.cs.umass.edu/~miklau/assets/pubs/dp/huang19psyndata-demo.pdf
[25]
Norman L Johnson, Adrienne W Kemp, and Samuel Kotz. 2005. Univariate discrete distributions. Vol. 444. John Wiley & Sons.
[26]
James Jordon, Jinsung Yoon, and Mihaela van der Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019. OpenReview.net. https://openreview.net/forum?id=S1zk9iRqF7
[27]
Gary King. [n.d.]. Noisy Data from the Noisy Census. ([n.d.]).
[28]
Ron Kohavi et al. 1996. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, Vol. 96. 202--207.
[29]
Haoran Li, Li Xiong, and Xiaoqian Jiang. 2014. Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions. In Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 24--28, 2014, Sihem Amer-Yahia, Vassilis Christophides, Anastasios Kementsietsidis, Minos N. Garofalakis, Stratos Idreos, and Vincent Leroy (Eds.). OpenProceedings.org, 475--486.
[30]
Fang Liu. 2016. Model-based differentially private data synthesis. arXiv preprint arXiv:1606.08052 (2016). https://arxiv.org/abs/1606.08052
[31]
Jingcheng Liu and Kunal Talwar. 2019. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 298--309.
[32]
Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan R. Ullman, and Zhiwei Steven Wu. 2021. Leveraging Public Data for Practical Private Query Release. In ICML. 6968--6977. http://proceedings.mlr.press/v139/liu21w.html
[33]
Terrance Liu, Giuseppe Vietri, and Steven Wu. 2021. Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (Eds.).
[34]
Kenneth G. Manton. 2010. National Long-Term Care Survey: 1982, 1984, 1989, 1994, 1999, and 2004.
[35]
Ryan McKenna and Terrance Liu. 2022. A simple recipe for private synthetic data generation. DifferentialPrivacy.org
[36]
Ryan McKenna, Gerome Miklau, Michael Hay, and Ashwin Machanavajjhala. 2018. Optimizing error of high-dimensional statistical queries under differential privacy. Proceedings of the VLDB Endowment 11, 10 (2018), 1206--1219.
[37]
Ryan McKenna, Gerome Miklau, and Daniel Sheldon. 2021. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data. Journal of Privacy and Confidentiality 11, 3 (2021).
[38]
Ryan McKenna, Brett Mullins, Daniel Sheldon, and Gerome Miklau. 2022. AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data. (2022).
[39]
Ryan McKenna, Siddhant Pradhan, Daniel R Sheldon, and Gerome Miklau. 2021. Relaxed Marginal Consistency for Differentially Private Query Answering. Advances in Neural Information Processing Systems 34 (2021).
[40]
Ryan McKenna, Daniel Sheldon, and Gerome Miklau. 2019. Graphical-model based estimation and inference for differential privacy. In International Conference on Machine Learning. 4435--4444. http://proceedings.mlr.press/v97/mckenna19a.html
[41]
Aleksandar Nikolov, Kunal Talwar, and Li Zhang. 2013. The geometry of differential privacy: the sparse and approximate cases. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. 351--360.
[42]
Nicolas Papernot and Thomas Steinke. 2021. Hyperparameter Tuning with Renyi Differential Privacy. arXiv preprint arXiv:2110.03620 (2021).
[43]
Diane Ridgeway, Mary Theofanos, Terese Manley, Christine Task, et al. 2021. Challenge Design and Lessons Learned from the 2018 Differential Privacy Challenges. (2021).
[44]
Ryan M Rogers, Aaron Roth, Jonathan Ullman, and Salil Vadhan. 2016. Privacy odometers and filters: Pay-as-you-go composition. Advances in Neural Information Processing Systems 29 (2016), 1921--1929.
[45]
Uthaipon Tantipongpipat, Chris Waites, Digvijay Boob, Amaresh Ankit Siva, and Rachel Cummings. 2019. Differentially Private Mixed-Type Data Generation For Unsupervised Learning. CoRR abs/1912.03250 (2019). arXiv:1912.03250 http://arxiv.org/abs/1912.03250
[46]
Yuchao Tao, Ryan McKenna, Michael Hay, Ashwin Machanavajjhala, and Gerome Miklau. 2021. Benchmarking Differentially Private Synthetic Data Generation Algorithms. Third AAAI Privacy-Preserving Artificial Intelligence (PPAI-22) workshop (2021).
[47]
Amirsina Torfi, Edward A Fox, and Chandan K Reddy. 2022. Differentially private synthetic medical data generation using convolutional gans. Information Sciences 586 (2022), 485--500.
[48]
Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. 2019. DP-CGAN: Differentially Private Synthetic Data and Label Generation. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 98--104.
[49]
Michail Tsagris, Christina Beneki, and Hossein Hassani. 2014. On the folded normal distribution. Mathematics 2, 1 (2014), 12--28.
[50]
Giuseppe Vietri, Grace Tian, Mark Bun, Thomas Steinke, and Zhiwei Steven Wu. 2020. New Oracle-Efficient Algorithms for Private Synthetic Data Release. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event (Proceedings of Machine Learning Research), Vol. 119. PMLR, 9765--9774. http://proceedings.mlr.press/v119/vietri20b.html
[51]
Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. 2018. Differentially Private Generative Adversarial Network. CoRR abs/1802.06739 (2018). arXiv:1802.06739 http://arxiv.org/abs/1802.06739
[52]
Chugui Xu, Ju Ren, Yaoxue Zhang, Zhan Qin, and Kui Ren. 2017. DPPro: Differentially Private High-Dimensional Data Release via Random Projection. IEEE Transactions on Information Forensics and Security 12, 12 (2017), 3081--3093.
[53]
Dmitri V Zaykin. 2011. Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. Journal of evolutionary biology 24, 8 (2011), 1836--1841.
[54]
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2017. PrivBayes: Private Data Release via Bayesian Networks. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 25:1--25:41.
[55]
Wei Zhang, Jingwen Zhao, Fengqiong Wei, and Yunfang Chen. 2019. Differentially Private High-Dimensional Data Publication via Markov Network. EAI Endorsed Trans. Security Safety 6, 19 (2019), e4.
[56]
Xinyang Zhang, Shouling Ji, and Ting Wang. 2018. Differentially private releasing via deep generative model (technical report). arXiv preprint arXiv:1801.01594 (2018). https://arxiv.org/abs/1801.01594
[57]
Zhikun Zhang, Tianhao Wang, Ninghui Li, Jean Honorio, Michael Backes, Shibo He, Jiming Chen, and Yang Zhang. 2021. PrivSyn: Differentially Private Data Synthesis. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 929--946. https://www.usenix.org/conference/usenixsecurity21/presentation/zhang-zhikun

Cited By

View all
  • (2024)Differentially Private Data Generation with Missing DataProceedings of the VLDB Endowment10.14778/3659437.365945517:8(2022-2035)Online publication date: 1-Apr-2024
  • (2024)Performance Truthfulness of Differential Privacy for DB TestingProceedings of the Tenth International Workshop on Testing Database Systems10.1145/3662165.3662762(30-35)Online publication date: 9-Jun-2024
  • (2024)Continual Release of Differentially Private Synthetic Data from Longitudinal Data CollectionsProceedings of the ACM on Management of Data10.1145/36515952:2(1-26)Online publication date: 14-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 11
July 2022
980 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2022
Published in PVLDB Volume 15, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)8
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Differentially Private Data Generation with Missing DataProceedings of the VLDB Endowment10.14778/3659437.365945517:8(2022-2035)Online publication date: 1-Apr-2024
  • (2024)Performance Truthfulness of Differential Privacy for DB TestingProceedings of the Tenth International Workshop on Testing Database Systems10.1145/3662165.3662762(30-35)Online publication date: 9-Jun-2024
  • (2024)Continual Release of Differentially Private Synthetic Data from Longitudinal Data CollectionsProceedings of the ACM on Management of Data10.1145/36515952:2(1-26)Online publication date: 14-May-2024
  • (2024)NetDPSyn: Synthesizing Network Traces under Differential PrivacyProceedings of the 2024 ACM on Internet Measurement Conference10.1145/3646547.3689011(545-554)Online publication date: 4-Nov-2024
  • (2024)ABSyn: An Accurate Differentially Private Data Synthesis Scheme With Adaptive Selection and Batch ProcessesIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.345317519(8338-8352)Online publication date: 1-Jan-2024
  • (2024)A Comparison of SynDiffix Multi-table Versus Single-table Synthetic DataPrivacy in Statistical Databases10.1007/978-3-031-69651-0_11(161-177)Online publication date: 25-Sep-2024
  • (2023)DP-PQD: Privately Detecting Per-Query Gaps in Synthetic Data Generated by Black-Box MechanismsProceedings of the VLDB Endowment10.14778/3617838.361784417:1(65-78)Online publication date: 1-Sep-2023
  • (undefined)EdgeSyn: Privacy-preserving Data Publishing on Edge Network over Infinite Multimedia Data StreamACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3696421

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media