Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2791347.2791380acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Privacy-preserving big data publishing

Published: 29 June 2015 Publication History
  • Get Citation Alerts
  • Abstract

    The problem of privacy-preserving data mining has been studied extensively in recent years because of its importance as a key enabler in the sharing of massive data sets. Most of the work in privacy has focussed on issues involving the quality of privacy preservation and utility, though there has been little focus on the issue of scalability in privacy preservation. The reason for this is that anonymization has generally been seen as a batch and one-time process in the context of data sharing. However, in recent years, the sizes of data sets have grown tremendously to a point where the effective application of the current algorithms is becoming increasingly difficult. Furthermore, the transient nature of recent data sets has resulted in an increased need for the repeated application of such methods on the newer data sets which have been collected. Repeated application demands even greater computational efficiency in order to be practical. For example, an algorithm with quadratic complexity is unlikely to be implementable in reasonable time over terabyte scale data sets. A bigger issue is that larger data sets are likely to be addressed by distributed frameworks such as MapReduce. In such frameworks, one has to address the additional issue of minimizing data transfer across different nodes, which is the bottleneck. In this paper, we discuss the first approach towards privacy-preserving data mining of very massive data sets using MapReduce. We study two most widely-used privacy models k-anonymity and l-diversity for anonymization, and present experimental results illustrating the effectiveness of the approach.

    References

    [1]
    S. Fienberg, M. Martin, and M. Straf, Sharing research data. The national academis press, 1985.
    [2]
    B. Yolles, J. Connors, and S. Grufferman, Obtaining access to data from government-sponsored medical research. New England Journal of Medicine (NEJM), 1986.
    [3]
    Organization for Economic Co-oper. and Development. Science, Technology and Innovation for the 21th Century, 2004.
    [4]
    Organization for Economic Co-oper. and Dev. Promoting Access to Public Research Data for Scientific, Economic, and Social Development: OECD Follow-Up Group on Issues of Access to Publicly Funded Research Data. 2003.
    [5]
    Canadian Institutes of Health Research. CIHR Open Access Policy. 2013. Available at: http://www.cihr-irsc.gc.ca/e/46068.html.
    [6]
    J. Dean and S. Ghemawat, "Mapreduce: simplified data processing on large clusters," OSDI, 2004.
    [7]
    R. Lämmel, "Google's mapreduce programming model-revisited," Science of Computer Programming, vol. 70, no. 1, pp. 1--30, 2008.
    [8]
    C. C. Aggarwal and P. S. Yu, eds., Privacy-Preserving Data Mining - Models and Algorithms, vol. 34 of Advances in Database Systems. Springer, 2008.
    [9]
    H. Zakerzadeh and S. L. Osborn, "Delay-sensitive approaches for anonymizing numerical streaming data," International Journal of Information Security, vol. 12, no. 5, pp. 1--15, 2013.
    [10]
    B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia, "Continuous privacy preserving publishing of data streams," EDBT, 2009.
    [11]
    R. Agrawal and R. Srikant, "Privacy-preserving data mining," SIGMOD, 2000.
    [12]
    P. Samarati, "Protecting respondents' identities in microdata release," IEEE Trans. on Knowl. and Data Eng., vol. 13, no. 6, pp. 1010--1027, 2001.
    [13]
    K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, "Mondrian multidimensional k-anonymity," ICDE, 2006.
    [14]
    K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, "Incognito: efficient full-domain k-anonymity," SIGMOD, 2005.
    [15]
    M. Nergiz, C. Clifton, and A. Nergiz, "Multirelational k-anonymity," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 1104--1117, 2009.
    [16]
    W. K. Wong, N. Mamoulis, and D. W. L. Cheung, "Non-homogeneous generalization in privacy preserving data publishing," SIGMOD, 2010.
    [17]
    K. Liu and E. Terzi, "Towards identity anonymization on graphs," SIGMOD, 2008.
    [18]
    M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis, "Resisting structural re-identification in anonymized social networks," VLDB, 2008.
    [19]
    M. Xue, P. Karras, C. Raïssi, J. Vaidya, and K.-L. Tan, "Anonymizing set-valued data by nonreciprocal recoding," KDD, 2012.
    [20]
    A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, "L-diversity: Privacy beyond k-anonymity," ACM Trans. Knowl. Discov. Data, vol. 1, no. 1, 2007.
    [21]
    N. Li, T. Li, and S. Venkatasubramanian, "t-closeness: Privacy beyond k-anonymity and l-diversity," ICDE, 2007.
    [22]
    M. E. Nergiz, M. Atzori, and C. Clifton, "Hiding the presence of individuals from shared databases," SIGMOD, 2007.
    [23]
    C. Dwork, "Differential privacy: a survey of results," TAMC, 2008.
    [24]
    T. Iwuchukwu and J. F. Naughton, "K-anonymization as spatial indexing: Toward scalable and incremental anonymization," VLDB, 2007.
    [25]
    K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, "Workload-aware anonymization techniques for large-scale datasets," ACM Transactions on Database Systems (TODS), vol. 33, no. 3, pp. 17--64, 2008.
    [26]
    M. E. Nergiz, A. Tamersoy, and Y. Saygin, "Instant anonymization," ACM Transactions on Database Systems (TODS), vol. 36, no. 1, 2011.
    [27]
    M. Solé, V. Muntés-Mulero, and J. Nin, "Efficient microaggregation techniques for large numerical data volumes," International Journal of Information Security, vol. 80, no. 11, pp. 1866--1878, 2012.
    [28]
    X. Zhang, L. T. Yang, C. Liu, and J. Chen, "A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud," IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 2, pp. 363--373, 2014.
    [29]
    B. C. M. Fung, K. Wang, and P. S. Yu, "Top-down specialization for information and privacy preservation," ICDE, 2005.
    [30]
    R. L. Ferreira Cordeiro, C. Traina Junior, A. J. Machado Traina, J. López, U. Kang, and C. Faloutsos, "Clustering very large multi-dimensional datasets with mapreduce," KDD, 2011.

    Cited By

    View all
    • (2024)Blockchain-Based Privacy-Preservation Platform for Data Storage and Query ProcessingUbiquitous Security10.1007/978-981-97-1274-8_25(380-400)Online publication date: 13-Mar-2024
    • (2023)Fortified MapReduce Layer: Elevating Security and Privacy in Big DataICST Transactions on Scalable Information Systems10.4108/eetsis.3859Online publication date: 2-Oct-2023
    • (2023)Dimensions and Hadoop of Big Data. A ReviewFuturistic Projects in Energy and Automation Sectors: A Brief Review of New Technologies Driving Sustainable Development10.2174/9789815080537123010020(323-333)Online publication date: 18-May-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management
    June 2015
    390 pages
    ISBN:9781450337090
    DOI:10.1145/2791347
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SSDBM 2015

    Acceptance Rates

    Overall Acceptance Rate 56 of 146 submissions, 38%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Blockchain-Based Privacy-Preservation Platform for Data Storage and Query ProcessingUbiquitous Security10.1007/978-981-97-1274-8_25(380-400)Online publication date: 13-Mar-2024
    • (2023)Fortified MapReduce Layer: Elevating Security and Privacy in Big DataICST Transactions on Scalable Information Systems10.4108/eetsis.3859Online publication date: 2-Oct-2023
    • (2023)Dimensions and Hadoop of Big Data. A ReviewFuturistic Projects in Energy and Automation Sectors: A Brief Review of New Technologies Driving Sustainable Development10.2174/9789815080537123010020(323-333)Online publication date: 18-May-2023
    • (2023)CryptoDataMR: Enhancing the Data Protection Using Cryptographic Hash and Encryption/Decryption Through MapReduce Programming ModelInternational Conference on Innovative Computing and Communications10.1007/978-981-99-3315-0_9(95-115)Online publication date: 23-Jul-2023
    • (2023)Privacy Preservation in Big Data AnalyticsGranular, Fuzzy, and Soft Computing10.1007/978-1-0716-2628-3_755(649-669)Online publication date: 30-Mar-2023
    • (2022)Improved l-diversityJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2019.08.00634:4(1423-1430)Online publication date: 1-Apr-2022
    • (2022)QAPPInformation Fusion10.1016/j.inffus.2022.07.01188:C(281-295)Online publication date: 1-Dec-2022
    • (2022)DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache SparkThe Journal of Supercomputing10.1007/s11227-021-03958-378:2(2616-2650)Online publication date: 1-Feb-2022
    • (2022)A Survey on Algorithms in Game Theory in Big DataInternational Conference on Computing, Communication, Electrical and Biomedical Systems10.1007/978-3-030-86165-0_15(155-166)Online publication date: 28-Feb-2022
    • (2022)A Collaborative Data Publishing Model with Privacy Preservation Using Group‐Based Classification and AnonymityMachine Learning Paradigm for Internet of Things Applications10.1002/9781119763499.ch3(53-66)Online publication date: 4-Apr-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media