research-article

PrivateClean: Data Cleaning and Differential Privacy

Authors:

Sanjay Krishnan,

Michael J. Franklin,

Tim KraskaAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 937 - 951

https://doi.org/10.1145/2882903.2915248

Published: 14 June 2016 Publication History

Abstract

Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.

References

[1]

Exclusive: Apple ups hiring, but faces obstacles to making phones smarter. http://www.reuters.com/article/2015/09/07/us-apple-machinelearning-idUSKCN0R71H020150907.

[2]

Netflix prize. http://www.netflixprize.com/.

[3]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013.

Digital Library

[4]

C. C. Aggarwal and P. S. Yu. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining - Models and Algorithms. 2008.

[5]

R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, 2000.

Digital Library

[6]

P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.

Digital Library

[7]

R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, and L. Xiong. Publishing set-valued data via differential privacy. PVLDB, 4(11), 2011.

[8]

Z. Chen and M. J. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In KDD, 2014.

Digital Library

[9]

Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3, 2013.

[10]

W. Du and J. Z. Zhan. Using randomized response techniques for privacy-preserving data mining. In KDD, 2003.

Digital Library

[11]

C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3--4), 2014.

Digital Library

[12]

P. Flajolet, D. Gardy, and L. Thimonier. Birthday paradox, coupon collectors, caching algorithms and self-organizing search. Discrete Applied Mathematics, 39(3), 1992.

Digital Library

[13]

B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv., 42(4), 2010.

Digital Library

[14]

M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001.

Digital Library

[15]

L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice and open challenges. PVLDB, 5(12), 2012.

Digital Library

[16]

D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12), 2015.

Digital Library

[17]

P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3), 1996.

Digital Library

[18]

Z. Huang, W. Du, and B. Chen. Deriving private information from randomized data. In SIGMOD, 2005.

Digital Library

[19]

H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi. Big data and its technical challenges. Commun. ACM, 57(7), 2014.

Digital Library

[20]

G. Jagannathan and R. N. Wright. Privacy-preserving imputation of missing data. Data Knowl. Eng., 65(1), 2008.

Digital Library

[21]

S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions on, 18(12), 2012.

Digital Library

[22]

S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg. A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In RecSys, 2014.

Digital Library

[23]

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. PVLDB, 8(12), 2015.

Digital Library

[24]

S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3), 2015.

[25]

S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. In Arxiv: http://arxiv.org/pdf/1601.03797.pdf, 2015.

[26]

N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE. IEEE, 2007.

[27]

N. Li, W. H. Qardaji, D. Su, and J. Cao. Privbasis: Frequent itemset mining with differential privacy. PVLDB, 5(11), 2012.

Digital Library

[28]

A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. TKDD, 1(1), 2007.

Digital Library

[29]

S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. Tinydb: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst., 30(1), 2005.

Digital Library

[30]

F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9), 2010.

Digital Library

[31]

P. Mohan, A. Thakurta, E. Shi, D. Song, and D. E. Culler. GUPT: privacy preserving data analysis made easy. In SIGMOD, 2012.

Digital Library

[32]

A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 2008.

Digital Library

[33]

S. Nirkhiwale, A. Dobra, and C. M. Jermaine. A sampling algebra for aggregate estimation. PVLDB, 6(14), 2013.

Digital Library

[34]

G. W. Oehlert. A note on the delta method. The American Statistician, 46(1), 1992.

[35]

F. Olken. Random sampling from databases. PhD thesis, University of California, 1993.

[36]

H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, 2014.

Digital Library

[37]

S. Peng, Y. Yang, Z. Zhang, M. Winslett, and Y. Yu. Dp-tree: indexing multi-dimensional data under differential privacy. In SIGMOD, 2012.

Digital Library

[38]

R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan. Cryptdb: protecting confidentiality with encrypted query processing. In Symposium on Operating Systems Principles, Cascais, Portugal, 2011.

Digital Library

[39]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.

[40]

S. P. Reiss, M. J. Post, and T. Dalenius. Non-reversible privacy transformations. In PODS, 1982.

Digital Library

[41]

I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Security and privacy for mapreduce. In NSDI, 2010.

Digital Library

[42]

L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 2002.

Digital Library

[43]

N. Talukder, M. Ouzzani, A. K. Elmagarmid, and M. Yakout. Detecting inconsistencies in private data with secure function evaluation. 2011.

[44]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.

Digital Library

[45]

S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309), 1965.

[46]

X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng., 23(8), 2011.

Digital Library

[47]

K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.

Digital Library

[48]

E. Zheleva and L. Getoor. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In WWW, Madrid, Spain, 2009.

Digital Library

[49]

M. Zhou, A. Cliff, A. Huang, S. Krishnan, B. Nonnecke, K. Uchino, S. Joseph, A. Fox, and K. Goldberg. M-cafe: Managing mooc student feedback with collaborative filtering. In Learning@ Scale. ACM, 2015.

Digital Library

Cited By

Sei Y(2024)Privacy-Preserving Data Collection and Analysis for Smart CitiesHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_5(157-209)Online publication date: 5-May-2024
https://doi.org/10.1007/978-981-97-0779-9_5
Blass EKerschbaum F(2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179396(1419-1434)Online publication date: May-2023
https://doi.org/10.1109/SP46215.2023.10179396
Blass EKerschbaum F(2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179337(1419-1434)Online publication date: May-2023
https://doi.org/10.1109/SP46215.2023.10179337
Show More Cited By

Index Terms

PrivateClean: Data Cleaning and Differential Privacy
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Data cleaning
2. Security and privacy
  1. Security services
    1. Privacy-preserving protocols

Recommendations

Privacy preservation in the internet of vehicles using local differential privacy and IOTA ledger
Abstract
With the growth in Vehicular Ad Hoc Network (VANET) technology, many vehicular devices are communicating with each other and with the edge nodes, generating a massive amount of data. One of the biggest challenges is to preserve users’ privacy as ...
Privacy-preserving Data Mining in Industry
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Preserving privacy of users is a key requirement of web-scale data mining applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and has witnessed a renewed focus in light of recent data ...
Data Sharing via Differentially Private Coupled Matrix Factorization

We address the privacy-preserving data-sharing problem in a distributed multiparty setting. In this setting, each data site owns a distinct part of a dataset and the aim is to estimate the parameters of a statistical model conditioned on the complete ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

DARPA XData Award
NSF CISE Expeditions Award
DOE Award

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
1,011
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)2

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sei Y(2024)Privacy-Preserving Data Collection and Analysis for Smart CitiesHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_5(157-209)Online publication date: 5-May-2024
https://doi.org/10.1007/978-981-97-0779-9_5
Blass EKerschbaum F(2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179396(1419-1434)Online publication date: May-2023
https://doi.org/10.1109/SP46215.2023.10179396
Blass EKerschbaum F(2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179337(1419-1434)Online publication date: May-2023
https://doi.org/10.1109/SP46215.2023.10179337
Clifton CHanson EMerrill KMerrill S(2022)Differentially Private k-Nearest Neighbor Missing Data ImputationACM Transactions on Privacy and Security10.1145/350795225:3(1-23)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3507952
Sei YOhsuga A(2022)Private True Data Mining: Differential Privacy Featuring Errors to Manage Internet-of-Things DataIEEE Access10.1109/ACCESS.2022.314381310(8738-8757)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3143813
Ma LPei QZhou LZhu HWang LJi Y(2021)Federated Data Cleaning: Collaborative and Privacy-Preserving Data Cleaning for Edge IntelligenceIEEE Internet of Things Journal10.1109/JIOT.2020.30279808:8(6757-6770)Online publication date: 15-Apr-2021
https://doi.org/10.1109/JIOT.2020.3027980
Su XWang H(2021)MISS: finding optimal sample sizes for approximate analyticsDistributed and Parallel Databases10.1007/s10619-021-07376-540:1(165-200)Online publication date: 21-Oct-2021
https://dl.acm.org/doi/10.1007/s10619-021-07376-5
Huang YMilani MChiang F(2020)Privacy-aware data cleaning-as-a-serviceInformation Systems10.1016/j.is.2020.10160894(101608)Online publication date: Dec-2020
https://doi.org/10.1016/j.is.2020.101608
Dun WZhu Y(2020)Efficient Discrete Distribution Estimation Schemes Under Local Differential PrivacyFrontiers in Cyber Security10.1007/978-981-15-9739-8_38(508-523)Online publication date: 4-Nov-2020
https://doi.org/10.1007/978-981-15-9739-8_38
Murakami TKawamoto YHeninger NTraynor P(2019)Utility-optimized local differential privacy mechanisms for distribution estimationProceedings of the 28th USENIX Conference on Security Symposium10.5555/3361338.3361468(1877-1894)Online publication date: 14-Aug-2019
https://dl.acm.org/doi/10.5555/3361338.3361468
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents