research-article

Privacy-preserving big data publishing

Authors:

Hessam Zakerzadeh,

Charu C. Aggarwal,

Ken BarkerAuthors Info & Claims

SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management

Article No.: 26, Pages 1 - 11

https://doi.org/10.1145/2791347.2791380

Published: 29 June 2015 Publication History

Abstract

The problem of privacy-preserving data mining has been studied extensively in recent years because of its importance as a key enabler in the sharing of massive data sets. Most of the work in privacy has focussed on issues involving the quality of privacy preservation and utility, though there has been little focus on the issue of scalability in privacy preservation. The reason for this is that anonymization has generally been seen as a batch and one-time process in the context of data sharing. However, in recent years, the sizes of data sets have grown tremendously to a point where the effective application of the current algorithms is becoming increasingly difficult. Furthermore, the transient nature of recent data sets has resulted in an increased need for the repeated application of such methods on the newer data sets which have been collected. Repeated application demands even greater computational efficiency in order to be practical. For example, an algorithm with quadratic complexity is unlikely to be implementable in reasonable time over terabyte scale data sets. A bigger issue is that larger data sets are likely to be addressed by distributed frameworks such as MapReduce. In such frameworks, one has to address the additional issue of minimizing data transfer across different nodes, which is the bottleneck. In this paper, we discuss the first approach towards privacy-preserving data mining of very massive data sets using MapReduce. We study two most widely-used privacy models k-anonymity and l-diversity for anonymization, and present experimental results illustrating the effectiveness of the approach.

References

[1]

S. Fienberg, M. Martin, and M. Straf, Sharing research data. The national academis press, 1985.

[2]

B. Yolles, J. Connors, and S. Grufferman, Obtaining access to data from government-sponsored medical research. New England Journal of Medicine (NEJM), 1986.

[3]

Organization for Economic Co-oper. and Development. Science, Technology and Innovation for the 21th Century, 2004.

[4]

Organization for Economic Co-oper. and Dev. Promoting Access to Public Research Data for Scientific, Economic, and Social Development: OECD Follow-Up Group on Issues of Access to Publicly Funded Research Data. 2003.

[5]

Canadian Institutes of Health Research. CIHR Open Access Policy. 2013. Available at: http://www.cihr-irsc.gc.ca/e/46068.html.

[6]

J. Dean and S. Ghemawat, "Mapreduce: simplified data processing on large clusters," OSDI, 2004.

Digital Library

[7]

R. Lämmel, "Google's mapreduce programming model-revisited," Science of Computer Programming, vol. 70, no. 1, pp. 1--30, 2008.

Digital Library

[8]

C. C. Aggarwal and P. S. Yu, eds., Privacy-Preserving Data Mining - Models and Algorithms, vol. 34 of Advances in Database Systems. Springer, 2008.

Digital Library

[9]

H. Zakerzadeh and S. L. Osborn, "Delay-sensitive approaches for anonymizing numerical streaming data," International Journal of Information Security, vol. 12, no. 5, pp. 1--15, 2013.

Digital Library

[10]

B. Zhou, Y. Han, J. Pei, B. Jiang, Y. Tao, and Y. Jia, "Continuous privacy preserving publishing of data streams," EDBT, 2009.

Digital Library

[11]

R. Agrawal and R. Srikant, "Privacy-preserving data mining," SIGMOD, 2000.

Digital Library

[12]

P. Samarati, "Protecting respondents' identities in microdata release," IEEE Trans. on Knowl. and Data Eng., vol. 13, no. 6, pp. 1010--1027, 2001.

Digital Library

[13]

K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, "Mondrian multidimensional k-anonymity," ICDE, 2006.

Digital Library

[14]

K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, "Incognito: efficient full-domain k-anonymity," SIGMOD, 2005.

Digital Library

[15]

M. Nergiz, C. Clifton, and A. Nergiz, "Multirelational k-anonymity," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, pp. 1104--1117, 2009.

Digital Library

[16]

W. K. Wong, N. Mamoulis, and D. W. L. Cheung, "Non-homogeneous generalization in privacy preserving data publishing," SIGMOD, 2010.

Digital Library

[17]

K. Liu and E. Terzi, "Towards identity anonymization on graphs," SIGMOD, 2008.

Digital Library

[18]

M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis, "Resisting structural re-identification in anonymized social networks," VLDB, 2008.

Digital Library

[19]

M. Xue, P. Karras, C. Raïssi, J. Vaidya, and K.-L. Tan, "Anonymizing set-valued data by nonreciprocal recoding," KDD, 2012.

Digital Library

[20]

A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, "L-diversity: Privacy beyond k-anonymity," ACM Trans. Knowl. Discov. Data, vol. 1, no. 1, 2007.

Digital Library

[21]

N. Li, T. Li, and S. Venkatasubramanian, "t-closeness: Privacy beyond k-anonymity and l-diversity," ICDE, 2007.

[22]

M. E. Nergiz, M. Atzori, and C. Clifton, "Hiding the presence of individuals from shared databases," SIGMOD, 2007.

Digital Library

[23]

C. Dwork, "Differential privacy: a survey of results," TAMC, 2008.

Digital Library

[24]

T. Iwuchukwu and J. F. Naughton, "K-anonymization as spatial indexing: Toward scalable and incremental anonymization," VLDB, 2007.

Digital Library

[25]

K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, "Workload-aware anonymization techniques for large-scale datasets," ACM Transactions on Database Systems (TODS), vol. 33, no. 3, pp. 17--64, 2008.

Digital Library

[26]

M. E. Nergiz, A. Tamersoy, and Y. Saygin, "Instant anonymization," ACM Transactions on Database Systems (TODS), vol. 36, no. 1, 2011.

Digital Library

[27]

M. Solé, V. Muntés-Mulero, and J. Nin, "Efficient microaggregation techniques for large numerical data volumes," International Journal of Information Security, vol. 80, no. 11, pp. 1866--1878, 2012.

[28]

X. Zhang, L. T. Yang, C. Liu, and J. Chen, "A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud," IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 2, pp. 363--373, 2014.

Digital Library

[29]

B. C. M. Fung, K. Wang, and P. S. Yu, "Top-down specialization for information and privacy preservation," ICDE, 2005.

Digital Library

[30]

R. L. Ferreira Cordeiro, C. Traina Junior, A. J. Machado Traina, J. López, U. Kang, and C. Faloutsos, "Clustering very large multi-dimensional datasets with mapreduce," KDD, 2011.

Digital Library

Cited By

Mireku Kwakye MBarker K(2024)Blockchain-Based Privacy-Preservation Platform for Data Storage and Query ProcessingUbiquitous Security10.1007/978-981-97-1274-8_25(380-400)Online publication date: 13-Mar-2024
https://doi.org/10.1007/978-981-97-1274-8_25
Gupta MDwivedi R(2023)Fortified MapReduce Layer: Elevating Security and Privacy in Big DataICST Transactions on Scalable Information Systems10.4108/eetsis.3859Online publication date: 2-Oct-2023
https://doi.org/10.4108/eetsis.3859
Kumar Agrawal AVerma HKumar JAnumeha Arvind P(2023)Dimensions and Hadoop of Big Data. A ReviewFuturistic Projects in Energy and Automation Sectors: A Brief Review of New Technologies Driving Sustainable Development10.2174/9789815080537123010020(323-333)Online publication date: 18-May-2023
https://doi.org/10.2174/9789815080537123010020
Show More Cited By

Index Terms

Privacy-preserving big data publishing

Recommendations

Privacy-Preserving Data Publishing: An Overview
Privacy preserving big data publishing: a scalable k‐anonymization approach using MapReduce

Big data is collected and processed using different sources and tools that lead to privacy issues. Privacy preserving data publishing techniques such as k‐anonymity, l‐diversity, and t‐closeness are used to de‐identify the data; however, the chances of re‐...
Multi-level privacy preserving data publishing

Policedata is an important source of social media data and can be regarded as a technical assistance to increase government accountability and transparency. Notably, it contains large amounts of personal private information that should be preserved ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management

June 2015

390 pages

ISBN:9781450337090

DOI:10.1145/2791347

Editors:
Amarnath Gupta
University of California San Diego
,
Susan Rathbun
University of California San Diego

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SSDBM 2015

SSDBM 2015: International Conference on Scientific and Statistical Database Management

June 29 - July 1, 2015

California, La Jolla

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
684
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mireku Kwakye MBarker K(2024)Blockchain-Based Privacy-Preservation Platform for Data Storage and Query ProcessingUbiquitous Security10.1007/978-981-97-1274-8_25(380-400)Online publication date: 13-Mar-2024
https://doi.org/10.1007/978-981-97-1274-8_25
Gupta MDwivedi R(2023)Fortified MapReduce Layer: Elevating Security and Privacy in Big DataICST Transactions on Scalable Information Systems10.4108/eetsis.3859Online publication date: 2-Oct-2023
https://doi.org/10.4108/eetsis.3859
Kumar Agrawal AVerma HKumar JAnumeha Arvind P(2023)Dimensions and Hadoop of Big Data. A ReviewFuturistic Projects in Energy and Automation Sectors: A Brief Review of New Technologies Driving Sustainable Development10.2174/9789815080537123010020(323-333)Online publication date: 18-May-2023
https://doi.org/10.2174/9789815080537123010020
Brindha GGobi M(2023)CryptoDataMR: Enhancing the Data Protection Using Cryptographic Hash and Encryption/Decryption Through MapReduce Programming ModelInternational Conference on Innovative Computing and Communications10.1007/978-981-99-3315-0_9(95-115)Online publication date: 23-Jul-2023
https://doi.org/10.1007/978-981-99-3315-0_9
Tsai YWang SHong T(2023)Privacy Preservation in Big Data AnalyticsGranular, Fuzzy, and Soft Computing10.1007/978-1-0716-2628-3_755(649-669)Online publication date: 30-Mar-2023
https://doi.org/10.1007/978-1-0716-2628-3_755
Mehta BRao U(2022)Improved l-diversityJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2019.08.00634:4(1423-1430)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1016/j.jksuci.2019.08.006
Zhang XWang YMa JJin Q(2022)QAPPInformation Fusion10.1016/j.inffus.2022.07.01188:C(281-295)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1016/j.inffus.2022.07.011
Ashkouti FKhamforoosh KSheikhahmadi AKhamfroush H(2022)DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache SparkThe Journal of Supercomputing10.1007/s11227-021-03958-378:2(2616-2650)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1007/s11227-021-03958-3
Rasi DMahaveerakannan R(2022)A Survey on Algorithms in Game Theory in Big DataInternational Conference on Computing, Communication, Electrical and Biomedical Systems10.1007/978-3-030-86165-0_15(155-166)Online publication date: 28-Feb-2022
https://doi.org/10.1007/978-3-030-86165-0_15
Carmel MAntonykumar KRavikumar SKulkarni Y(2022)A Collaborative Data Publishing Model with Privacy Preservation Using Group‐Based Classification and AnonymityMachine Learning Paradigm for Internet of Things Applications10.1002/9781119763499.ch3(53-66)Online publication date: 4-Apr-2022
https://doi.org/10.1002/9781119763499.ch3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents