Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2915248acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

PrivateClean: Data Cleaning and Differential Privacy

Published: 14 June 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Recent advances in differential privacy make it possible to guarantee user privacy while preserving the main characteristics of the data. However, most differential privacy mechanisms assume that the underlying dataset is clean. This paper explores the link between data cleaning and differential privacy in a framework we call PrivateClean. PrivateClean includes a technique for creating private datasets of numerical and discrete-valued attributes, a formalism for privacy-preserving data cleaning, and techniques for answering sum, count, and avg queries after cleaning. We show: (1) how the degree of privacy affects subsequent aggregate query accuracy, (2) how privacy potentially amplifies certain types of errors in a dataset, and (3) how this analysis can be used to tune the degree of privacy. The key insight is to maintain a bipartite graph relating dirty values to clean values and use this graph to estimate biases due to the interaction between cleaning and privacy. We validate these results on four datasets with a variety of well-studied cleaning techniques including using functional dependencies, outlier filtering, and resolving inconsistent attributes.

    References

    [1]
    Exclusive: Apple ups hiring, but faces obstacles to making phones smarter. http://www.reuters.com/article/2015/09/07/us-apple-machinelearning-idUSKCN0R71H020150907.
    [2]
    Netflix prize. http://www.netflixprize.com/.
    [3]
    S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013.
    [4]
    C. C. Aggarwal and P. S. Yu. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining - Models and Algorithms. 2008.
    [5]
    R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, 2000.
    [6]
    P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005.
    [7]
    R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, and L. Xiong. Publishing set-valued data via differential privacy. PVLDB, 4(11), 2011.
    [8]
    Z. Chen and M. J. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In KDD, 2014.
    [9]
    Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3, 2013.
    [10]
    W. Du and J. Z. Zhan. Using randomized response techniques for privacy-preserving data mining. In KDD, 2003.
    [11]
    C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3--4), 2014.
    [12]
    P. Flajolet, D. Gardy, and L. Thimonier. Birthday paradox, coupon collectors, caching algorithms and self-organizing search. Discrete Applied Mathematics, 39(3), 1992.
    [13]
    B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv., 42(4), 2010.
    [14]
    M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001.
    [15]
    L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice and open challenges. PVLDB, 5(12), 2012.
    [16]
    D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12), 2015.
    [17]
    P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3), 1996.
    [18]
    Z. Huang, W. Du, and B. Chen. Deriving private information from randomized data. In SIGMOD, 2005.
    [19]
    H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi. Big data and its technical challenges. Commun. ACM, 57(7), 2014.
    [20]
    G. Jagannathan and R. N. Wright. Privacy-preserving imputation of missing data. Data Knowl. Eng., 65(1), 2008.
    [21]
    S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions on, 18(12), 2012.
    [22]
    S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg. A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In RecSys, 2014.
    [23]
    S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. PVLDB, 8(12), 2015.
    [24]
    S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3), 2015.
    [25]
    S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. In Arxiv: http://arxiv.org/pdf/1601.03797.pdf, 2015.
    [26]
    N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In ICDE. IEEE, 2007.
    [27]
    N. Li, W. H. Qardaji, D. Su, and J. Cao. Privbasis: Frequent itemset mining with differential privacy. PVLDB, 5(11), 2012.
    [28]
    A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. TKDD, 1(1), 2007.
    [29]
    S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. Tinydb: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst., 30(1), 2005.
    [30]
    F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9), 2010.
    [31]
    P. Mohan, A. Thakurta, E. Shi, D. Song, and D. E. Culler. GUPT: privacy preserving data analysis made easy. In SIGMOD, 2012.
    [32]
    A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 2008.
    [33]
    S. Nirkhiwale, A. Dobra, and C. M. Jermaine. A sampling algebra for aggregate estimation. PVLDB, 6(14), 2013.
    [34]
    G. W. Oehlert. A note on the delta method. The American Statistician, 46(1), 1992.
    [35]
    F. Olken. Random sampling from databases. PhD thesis, University of California, 1993.
    [36]
    H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, 2014.
    [37]
    S. Peng, Y. Yang, Z. Zhang, M. Winslett, and Y. Yu. Dp-tree: indexing multi-dimensional data under differential privacy. In SIGMOD, 2012.
    [38]
    R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan. Cryptdb: protecting confidentiality with encrypted query processing. In Symposium on Operating Systems Principles, Cascais, Portugal, 2011.
    [39]
    E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.
    [40]
    S. P. Reiss, M. J. Post, and T. Dalenius. Non-reversible privacy transformations. In PODS, 1982.
    [41]
    I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Security and privacy for mapreduce. In NSDI, 2010.
    [42]
    L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 2002.
    [43]
    N. Talukder, M. Ouzzani, A. K. Elmagarmid, and M. Yakout. Detecting inconsistencies in private data with secure function evaluation. 2011.
    [44]
    J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014.
    [45]
    S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309), 1965.
    [46]
    X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng., 23(8), 2011.
    [47]
    K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.
    [48]
    E. Zheleva and L. Getoor. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In WWW, Madrid, Spain, 2009.
    [49]
    M. Zhou, A. Cliff, A. Huang, S. Krishnan, B. Nonnecke, K. Uchino, S. Joseph, A. Fox, and K. Goldberg. M-cafe: Managing mooc student feedback with collaborative filtering. In Learning@ Scale. ACM, 2015.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
    June 2016
    2300 pages
    ISBN:9781450335317
    DOI:10.1145/2882903
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data cleaning
    2. differential privacy
    3. local differential privacy

    Qualifiers

    • Research-article

    Funding Sources

    • DARPA XData Award
    • NSF CISE Expeditions Award
    • DOE Award

    Conference

    SIGMOD/PODS'16
    Sponsor:
    SIGMOD/PODS'16: International Conference on Management of Data
    June 26 - July 1, 2016
    California, San Francisco, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)92
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Privacy-Preserving Data Collection and Analysis for Smart CitiesHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_5(157-209)Online publication date: 5-May-2024
    • (2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179396(1419-1434)Online publication date: May-2023
    • (2023)Private Collaborative Data Cleaning via Non-Equi PSI2023 IEEE Symposium on Security and Privacy (SP)10.1109/SP46215.2023.10179337(1419-1434)Online publication date: May-2023
    • (2022)Differentially Private k-Nearest Neighbor Missing Data ImputationACM Transactions on Privacy and Security10.1145/350795225:3(1-23)Online publication date: 9-Apr-2022
    • (2022)Private True Data Mining: Differential Privacy Featuring Errors to Manage Internet-of-Things DataIEEE Access10.1109/ACCESS.2022.314381310(8738-8757)Online publication date: 2022
    • (2021)Federated Data Cleaning: Collaborative and Privacy-Preserving Data Cleaning for Edge IntelligenceIEEE Internet of Things Journal10.1109/JIOT.2020.30279808:8(6757-6770)Online publication date: 15-Apr-2021
    • (2021)MISS: finding optimal sample sizes for approximate analyticsDistributed and Parallel Databases10.1007/s10619-021-07376-540:1(165-200)Online publication date: 21-Oct-2021
    • (2020)Privacy-aware data cleaning-as-a-serviceInformation Systems10.1016/j.is.2020.10160894(101608)Online publication date: Dec-2020
    • (2020)Efficient Discrete Distribution Estimation Schemes Under Local Differential PrivacyFrontiers in Cyber Security10.1007/978-981-15-9739-8_38(508-523)Online publication date: 4-Nov-2020
    • (2019)Utility-optimized local differential privacy mechanisms for distribution estimationProceedings of the 28th USENIX Conference on Security Symposium10.5555/3361338.3361468(1877-1894)Online publication date: 14-Aug-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media