Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

On Data Publishing with Clustering Preservation

Published: 01 April 2015 Publication History
  • Get Citation Alerts
  • Abstract

    The emergence of cloud-based storage services is opening up new avenues in data exchange and data dissemination. This has amplified the interest in right-protection mechanisms to establish ownership in the event of data leakage. Current right-protection technologies, however, rarely provide strong guarantees on dataset utility after the protection process. This work presents techniques that explicitly address this topic and provably preserve the outcome of certain mining operations. In particular, we take special care to guarantee that the outcome of hierarchical clustering operations remains the same before and after right protection. Our approach considers all prevalent hierarchical clustering variants: single-, complete-, and average-linkage. We imprint the ownership in a dataset using watermarking principles, and we derive tight bounds on the expansion/contraction of distances incurred by the process. We leverage our analysis to design fast algorithms for right protection without exhaustively searching the vast design space. Finally, because the right-protection process introduces a user-tunable distortion on the dataset, we explore the possibility of using this mechanism for data obfuscation. We quantify the tradeoff between obfuscation and utility for spatiotemporal datasets and discover very favorable characteristics of the process. An additional advantage is that when one is interested in both right-protecting and obfuscating the original data values, the proposed mechanism can accomplish both tasks simultaneously.

    References

    [1]
    Osman Abul, Francesco Bonchi, and Micro Nanni. 2008. Never walk alone: Uncertainty for anonymity in moving objects databases. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE’08). IEEE Computer Society, Washington, DC, 376--385.
    [2]
    Charu C. Aggarwal and Philip S. Yu. 2004. A condensation approach to privacy preserving data mining. In Advances in Database Technology - EDBT 2004, Elisa Bertino, Stavros Christodoulakis, Dimitris Plexousakis, Vassilis Christophides, Manolis Koubarakis, Klemens Böhm, and Elena Ferrari (Eds.). Lecture Notes in Computer Science, Vol. 2992. Springer, Berlin, 183--199.
    [3]
    Charu C. Aggarwal and Philip S. Yu. 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-Preserving Data Mining, Charu C. Aggarwal and Philip S. Yu (Eds.). Advances in Database Systems, Vol. 34. Springer, 11--52.
    [4]
    Rakesh Agrawal and Jerry Kiernan. 2002. Watermarking relational databases. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02). VLDB Endowment, 155--166.
    [5]
    Claudio Agostino Ardagna, Marco Cremonini, Ernesto Damiani, Sabrina De Capitani di Vimercati, and Pierangela Samarati. 2007. Location privacy protection through obfuscation-based techniques. In Data and Applications Security XXI, Steve Barker and Gail-Joon Ahn (Eds.). Lecture Notes in Computer Science, Vol. 4602. Springer, Berlin, 47--60.
    [6]
    Mikhail J. Atallah, Sunil Prabhakar, Keith B. Frikken, and Radu Sion. 2004. Digital rights protection. IEEE Data Engineering Bulletin 27, 1 (2004), 19--25.
    [7]
    Paraskevi Bassia, Ioannis Pitas, and Nikos Nikolaidis. 2001. Robust audio watermarking in the time domain. IEEE Transactions on Multimedia 3, 2 (2001), 232--241.
    [8]
    Steve Borgatti. 2007. Distance and Correlation. (2007). Retrieved March 30, 2014 from http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm.
    [9]
    Keke Chen and Ling Liu. 2005. Privacy preserving data classification with rotation perturbation. In Proceedings of the 5th IEEE International Conference on Data Mining. 589--592.
    [10]
    Rui Chen, Benjamin C. M. Fung, and Bipin C. Desai. 2011. Differentially private trajectory data publication. CoRR abs/1112.2020 (2011).
    [11]
    Gouenou Coatrieux, Emmanuel Chazard, Régis Beuscart, and Christian Roux. 2011. Lossless watermarking of categorical attributes for verifying medical data base integrity. In Proceedings of the 33th IEEE Annual International Conference of the Engineering in Medicine and Biology Society. 8195--8198.
    [12]
    Eric Cope and Gianluca Antonini. 2008. Observed correlations and dependencies among operational losses in the ORX consortium database. Journal of Operational Risk 3, 4 (2008), 47--76.
    [13]
    Ingemar J. Cox, Joe Kilian, F. Thomson Leighton, and Talal Shamoon. 1997. Secure spread spectrum watermarking for multimedia. IEEE Transactions on Image Processing 6, 12 (1997), 1673--1687.
    [14]
    Daniel Defays. 1977. An efficient algorithm for a complete link method. Computer Journal 20, 4 (1977), 364--366.
    [15]
    Olivier Devillers and Mordecai J. Golin. 1995. Incremental algorithms for finding the convex hulls of circles and the lower envelopes of parabolas. Information Processing Letters 56, 3 (1995), 157--164.
    [16]
    Nick G. Duffield and Matthias Grossglauser. 2001. Trajectory sampling for direct traffic observation. IEEE/ACM Transactions on Networking 9, 3 (2001), 280--292.
    [17]
    Ixchel M. Faniel and Ann Zimmerman. 2011. Beyond the data deluge: A research agenda for large-scale data sharing and reuse. International Journal of Digital Curation 6, 1 (2011), 58--69.
    [18]
    E. B. Fowlkes and C. L. Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78, 383 (1983), 553--569.
    [19]
    Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys 42, 4, Article 14 (June 2010), 53 pages.
    [20]
    Benjamin C. M. Fung, Ke Wang, Lingyu Wang, and Patrick C. K. Hung. 2009. Privacy-preserving data publishing for cluster analysis. Data and Knowledge Engineering 68, 6 (2009), 552--575.
    [21]
    Roxana Geambasu, Steven D. Gribble, and Henry M. Levy. 2009. CloudViews: Communal data sharing in public clouds. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing (HotCloud’09). USENIX Association, Article 14.
    [22]
    Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. 2009. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems 34, 2, Article 9 (July 2009), 47 pages.
    [23]
    Philippe Golle and Kurt Partridge. 2009. On the anonymity of home/work location pairs. In Proceedings of the 7th International Conference on Pervasive Computing (Pervasive’09). Springer-Verlag, Berlin, 390--397.
    [24]
    HGP 2013. All About The Human Genome Project. Retrieved from http://www.genome.gov/10001772.
    [25]
    Geetha Jagannathan, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. 2006. A new privacy-preserving distributed k-clustering algorithm. In Proceedings of the 2006 SIAM International Conference on Data Mining. 494--498.
    [26]
    Kaifeng Jiang, Dongxu Shao, Stéphane Bressan, Thomas Kister, and Kian-Lee Tan. 2013. Publishing trajectories with differential privacy guarantees. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM’13). ACM, New York, NY, Article 12, 12 pages.
    [27]
    Hillol Kargupta, Souptik Datta, Qi Wang, and Krishnamoorthy Sivakumar. 2003. On the privacy preserving properties of random data perturbation techniques. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM’03). IEEE Computer Society, Washington, DC, 99--106.
    [28]
    Tiancheng Li and Ninghui Li. 2009. On the tradeoff between privacy and utility in data publishing. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 517--526.
    [29]
    Li Liu and Bhavani Thuraisingham. 2006. The applicability of the perturbation model-based privacy preserving data mining for real-world data. In Proceedings of the 6th IEEE International Conference on Data Mining - Workshops (ICDMW’06). IEEE Computer Society, Washington, DC, 507--512.
    [30]
    Claudio Lucchese, Michail Vlachos, Deepak Rajan, and Philip S. Yu. 2010. Rights protection of trajectory datasets with nearest-neighbor preservation. The VLDB Journal 19, 4 (Aug. 2010), 531--556.
    [31]
    Wolfgang Ludwig and Hans-Peter Klenk. 2001. Overview: A phylogenetic backbone and taxonomic framework for procaryotic systematics. In Bergey’ s Manual of Systematic Bacteriology. Springer, 49--65.
    [32]
    Noman Mohammed, Benjamin C. M. Fung, and Mourad Debbabi. 2009. Walking in the crowd: Anonymizing trajectory data for pattern analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 1441--1444.
    [33]
    Marco Casassa Mont, Ilaria Matteucci, Marinella Petrocchi, and Marco Luca Sbodio. 2012. Enabling Data Sharing in the Cloud. Technical Report. HP Laboratories, Tech Report HPL-2012--22.
    [34]
    Pierre Moulin, Mehmet Kivanç Mihcak, and Gen-Iu Lin. 2000. An information-theoretic model for image watermarking and data hiding. In Proceedings of the IEEE International Conference on Image Processing, Vol. 3. 667--670.
    [35]
    Shibnath Mukherjee, Zhiyuan Chen, and Aryya Gangopadhyay. 2006. A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms. The VLDB Journal 15, 4 (Nov. 2006), 293--315.
    [36]
    Fionn Murtagh. 1984. Complexities of hierarchic clustering algorithms: State of the art. Computational Statistics Quarterly 1, 2 (1984), 101--113.
    [37]
    Mehmet Ercan Nergiz, Maurizio Atzori, Yücel Saygin, and Baris Güç. 2009. Towards trajectory anonymization: A generalization-based approach. Transactions on Data Privacy 2, 1 (April 2009), 47--75.
    [38]
    Stanley R. M. Oliveira and Osmar R. Zaïane. 2010. Privacy preserving clustering by data transformation. Journal of Information and Data Management 1, 1 (2010), 37--52.
    [39]
    Rupa Parameswaran and D. Blough. 2005. A robust data obfuscation approach for privacy preservation of clustered data. In Proceedings of the 2005 IEEE International Conference on Data Mining. 18--25.
    [40]
    Christine Parent, Stefano Spaccapietra, Chiara Renso, Gennady Andrienko, Natalia Andrienko, Vania Bogorny, Maria Luisa Damiani, Aris Gkoulalas-Divanis, Jose Macedo, Nikos Pelekis, Yannis Theodoridis, and Zhixian Yan. 2013. Semantic trajectories modeling and analysis. ACM Computing Surveys 45, 4, Article 42 (Aug. 2013), 32 pages.
    [41]
    Michal Piorkowski, Natasa Sarafijanovic-Djukic, and Matthias Grossglauser. 2009. A parsimonious model of mobile partitioned networks with clustering. In Proceedings of the International Conference on Communication Systems and Networks and Workshops (COMSNETS’09). 1--10.
    [42]
    Gang Qian, Shamik Sural, Yuelong Gu, and Sakti Pramanik. 2004. Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In Proceedings of the 2004 ACM Symposium on Applied Computing (SAC), Hisham Haddad, Andrea Omicini, Roger L. Wainwright, and Lorie M. Liebrock (Eds.). ACM, 1232--1237.
    [43]
    Robin Sibson. 1973. SLINK: An optimally efficient algorithm for the single-link cluster method. Computer Journal 16, 1 (1973), 30--34.
    [44]
    John Van Sickle. 1997. Using mean similarity dendrograms to evaluate classifications. Journal of Agricultural, Biological, and Environmental Statistics (1997), 370--388.
    [45]
    Dimitrios Simitopoulos, Sotirios A. Tsaftaris, Nikolaos V. Boulgouris, and Michael G. Strintzis. 2002. Compressed-domain video watermarking of MPEG streams. In Proceedings of the IEEE International Conference on Multimedia and Expo, Vol. 1. IEEE, 569--572.
    [46]
    Radu Sion, Mikhail Atallah, and Sunil Prabhakar. 2004. Rights protection for relational data. IEEE Transactions on Knowledge and Data Engineering 16, 12 (Dec. 2004), 1509--1525.
    [47]
    Radu Sion, Mikhail Atallah, and Sunil Prabhakar. 2006. Rights protection for discrete numeric streams. IEEE Transactions on Knowledge and Data Engineering 18, 5 (May 2006), 699--714.
    [48]
    Chaoming Song, Zehui Qu, Nicholas Blumm, and Albert-Lszl Barabsi. 2010. Limits of predictability in human mobility. Science 327, 5968 (2010), 1018--1021.
    [49]
    Mitchell D. Swanson, Bin Zhu, Ahmed H. Tewfik, and Laurence Boney. 1998. Robust audio watermarking using perceptual masking. Signal Processing 66, 3 (1998), 337--355.
    [50]
    Manolis Terrovitis and Nikos Mamoulis. 2008. Privacy preservation in the publication of trajectories. In Proceedings of the the 9th International Conference on Mobile Data Management (MDM’08). IEEE Computer Society, Washington, DC, USA, 65--72.
    [51]
    E. Onur Turgay, Thomas B. Pedersen, Yücel Saygın, Erkay Savaş, and Albert Levi. 2008. Disclosure risks of distance preserving data transformations. In Scientific and Statistical Database Management, Bertram Ludäscher and Nikos Mamoulis (Eds.). Lecture Notes in Computer Science, Vol. 5069. Springer, Berlin, 79--94.
    [52]
    Jaideep Vaidya and Chris Clifton. 2003. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, NY, 206--215.
    [53]
    Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. 2003. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, NY, 216--225.
    [54]
    Aleš Žiberna and Vesna Žabkar. 2003. Application of end-users market segmentation using statistical methods. In Developments in Applied Statistics, Anuška Ferligoj and Andrej Mrvar (Eds.). metodološki zvezki - Advances in Methodology and Statistics, Vol. 19. 243--263.
    [55]
    Stephen B. Wicker. 2012. The loss of location privacy in the cellular age. Commununications of the ACM 55, 8 (Aug. 2012), 60--68.
    [56]
    Raymond Chi-Wing Wong, Ada Wai-Chee Fu, Ke Wang, Philip S. Yu, and Jian Pei. 2011. Can the utility of anonymized data be used for privacy breaches? ACM Transactions on Knowledge Discovery Data 5, 3, Article 16 (Aug. 2011), 24 pages.
    [57]
    Xiaotong Wang Xiamu Niu, Chengyong Shao. 2006. A survey of digital vector map watermarking. International Journal of Innovative Computing, Information and Control 2, 6 (2006), 1301--1316.
    [58]
    Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2011. Differential privacy via wavelet transforms. IEEE Transactions on Knowledge and Data Engineering 23, 8 (2011), 1200--1214.
    [59]
    Andy Yuan Xue, Rui Zhang, Yu Zheng, Xing Xie, Jianhui Yu, Yong Tang, Sapna Jain, and Jingren Zhou. 2013. DesTeller: A system for destination prediction based on trajectories with privacy protection. PVLDB 6, 12 (2013), 1198--1201.
    [60]
    Mingqiang Xue, Panagiotis Karras, Chedy Raïssi, and Hung Keng Pung. 2011. Utility-driven anonymization in data publishing. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11). 2277--2280.
    [61]
    Hwanjo Yu, Xiaoqian Jiang, and Jaideep Vaidya. 2006a. Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. In Proceedings of the 2006 ACM Symposium on Applied Computing (SAC’06). 603--610.
    [62]
    Hwanjo Yu, Jaideep Vaidya, and Xiaoqian Jiang. 2006b. Privacy-preserving SVM classification on vertically partitioned data. In Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06). 647--656.
    [63]
    Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. 2011. Driving with knowledge from the physical world. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). 316--324.
    [64]
    Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, and Yan Huang. 2010. T-drive: Driving directions based on taxi trajectories. In Proceedings of SIGSPATIAL International Conference on Geographic Information Systems (GIS’10). 99--108.
    [65]
    Wenwu Zhu, Zixiang Xiong, and Ya-Qin Zhang. 1999. Multiresolution watermarking for images and video. IEEE Transactions on Circuits and Systems for Video Technology 9, 4 (1999), 545--550.
    [66]
    Spyros I. Zoumpoulis, Michail Vlachos, Nikolaos M. Freris, and Claudio Lucchese. 2014. Right-protected data publishing with provable distance-based mining. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 2014--2028.

    Cited By

    View all
    • (2023)A taxonomy of data governance decision domains in data marketplacesElectronic Markets10.1007/s12525-023-00631-w33:1Online publication date: 22-May-2023
    • (2019)Context-Driven Granular Disclosure Control for Internet of Things ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2017.27374635:3(408-422)Online publication date: 1-Sep-2019
    • (2017)A comparative study on innovative approaches for privacy-preservation in knowledge discoveryProceedings of the 9th International Conference on Information Management and Engineering10.1145/3149572.3149586(120-127)Online publication date: 9-Oct-2017
    • Show More Cited By

    Index Terms

    1. On Data Publishing with Clustering Preservation

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 9, Issue 3
        TKDD Special Issue (SIGKDD'13)
        April 2015
        313 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/2737800
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 01 April 2015
        Accepted: 01 September 2014
        Revised: 01 April 2014
        Received: 01 November 2013
        Published in TKDD Volume 9, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Distance-Based Mining
        2. Distortion Estimation
        3. Restricted Isometry Property
        4. Watermarking

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Funding Sources

        • European Research Council under the European Union's Seventh Framework Programme (FP7/2007-2013)/ERC

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)5
        • Downloads (Last 6 weeks)1

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)A taxonomy of data governance decision domains in data marketplacesElectronic Markets10.1007/s12525-023-00631-w33:1Online publication date: 22-May-2023
        • (2019)Context-Driven Granular Disclosure Control for Internet of Things ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2017.27374635:3(408-422)Online publication date: 1-Sep-2019
        • (2017)A comparative study on innovative approaches for privacy-preservation in knowledge discoveryProceedings of the 9th International Conference on Information Management and Engineering10.1145/3149572.3149586(120-127)Online publication date: 9-Oct-2017
        • (2017)Scalable Role-Based Data Disclosure Control for the Internet of Things2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS.2017.307(2226-2233)Online publication date: Jun-2017
        • (2016)On the Properties of Non-Media Digital Watermarking: A Review of State of the Art TechniquesIEEE Access10.1109/ACCESS.2016.25708124(2670-2704)Online publication date: 2016

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media