Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3579856.3590338acmconferencesArticle/Chapter ViewAbstractPublication Pagesasia-ccsConference Proceedingsconference-collections
research-article

Privacy-Preserving Record Linkage for Cardinality Counting

Published: 10 July 2023 Publication History

Abstract

Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget ϵ = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.

References

[1]
Differential Privacy Team Apple. 2017. Learning with privacy at scale. Apple Machine Learning Journal - Online at: https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html (2017).
[2]
Ramesh Balasubramaniam and K Nandhini. 2019. Algorithms Associated with Streaming Data Problems. International Journal of Applied Engineering Research 14, 9 (2019), 2238–2243.
[3]
Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 1–10.
[4]
B.H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
[5]
Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1–27.
[6]
Yousra Chabchoub and Georges Hébrail. 2010. Sliding hyperloglog: Estimating cardinality in a data stream over a sliding window. In International Conference on Data Mining Workshops. IEEE, 1297–1303.
[7]
Guoqiang Jerry Chen, Janet L Wiener, Shridhar Iyer, Anshul Jaiswal, Ran Lei, Nikhil Simha, Wei Wang, Kevin Wilfong, Tim Williamson, and Serhat Yilmaz. 2016. Realtime data processing at Facebook. In International Conference on Management of Data. 1087–1098.
[8]
Peter Christen. 2012. Data matching - concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer.
[9]
Peter Christen, Thilina Ranbaduge, Dinusha Vatsalan, and Rainer Schnell. 2018. Precise and fast cryptanalysis for Bloom filter based privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering (2018), 1.
[10]
Peter Christen, Anushka Vidanage, Thilina Ranbaduge, and Rainer Schnell. 2018. Pattern-mining based cryptanalysis of Bloom filters for privacy-preserving record linkage. In PAKDD, Springer LNAI. Melbourne, 530–542.
[11]
William W. Cohen and Jacob Richman. 2002. Learning to Match and Cluster Large High-dimensional Data Sets for Data Integration. In ACM SIGKDD. 475–480.
[12]
Damien Desfontaines, Andreas Lochbihler, and David Basin. 2019. Cardinality estimators do not preserve privacy. Proceedings on Privacy Enhancing Technologies 2019, 2 (2019), 26–46.
[13]
Duy-Tai Dinh, Tsutomu Fujinami, and Van-Nam Huynh. 2019. Estimating the optimal number of clusters in categorical data clustering by silhouette coefficient. In International Symposium on Knowledge and Systems Sciences. Springer, 1–17.
[14]
C. Dwork. 2006. Differential privacy. International Colloquium on Automata, Languages and Programming (2006), 1–12.
[15]
Cynthia Dwork. 2008. Differential privacy: A survey of results. In Theory and Applications of Models of Computation. Springer, 1–19.
[16]
Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N Rothblum, and Sergey Yekhanin. 2010. Pan-Private Streaming Algorithms. In ICS. 66–80.
[17]
Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In SIGSAC conference on computer and communications security. ACM, 1054–1067.
[18]
Otmar Ertl. 2017. New cardinality estimation algorithms for HyperLogLog sketches. arXiv preprint arXiv:1702.01284 (2017).
[19]
Alexandre Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant. 2003. Limiting privacy breaches in privacy preserving data mining. In ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 211–222.
[20]
Philippe Flajolet. 1990. On adaptive sampling. Computing 43, 4 (1990), 391–400.
[21]
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Conference on Analysis of Algorithms (AofA). Nancy, France.
[22]
Phillip B Gibbons. 2001. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, Vol. 1. 541–550.
[23]
Phillip B Gibbons. 2016. Distinct-values estimation over data streams. In Data Stream Management. Springer, 121–147.
[24]
A. Gkoulalas-Divanis, D. Vatsalan, D. Karapiperis, and M. Kantarcioglu. 2021. Modern Privacy-Preserving Record Linkage Techniques: An Overview. IEEE TIFS (2021).
[25]
Nikolay Golov, Alexander Filatov, and Sergey Bruskin. 2019. Efficient Exact Algorithm for Count Distinct Problem. In International Workshop on Computer Algebra in Scientific Computing. Springer, 67–77.
[26]
Andy Greenberg. 2016. Apple’s ‘differential privacy’is about collecting your data—but not your data. Wired, June 13 (2016).
[27]
Peter J Haas, Jeffrey F Naughton, S Seshadri, and Lynne Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In VLDB, Vol. 95. 311–322.
[28]
Hazar Harmouch and Felix Naumann. 2017. Cardinality estimation: An experimental survey. Proceedings of the VLDB Endowment 11, 4 (2017), 499–512.
[29]
Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, and Renée J Miller. 2009. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the Very Large Database Endowment 2, 1 (2009), 1282–1293.
[30]
Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In International Conference on Extending Database Technology. 683–692.
[31]
Bernard J Jansen. 2006. Search log analysis: What it is, what’s been done, how to do it. Library & information science research 28, 3 (2006), 407–432.
[32]
Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What can we learn privately?SIAM J. Comput. 40, 3 (2011), 793–826.
[33]
Abhishek Kumar, Jun Xu, and Jia Wang. 2006. Space-code bloom filter for efficient per-flow traffic measurement. IEEE Journal on Selected Areas in Communications 24, 12 (2006), 2327–2339.
[34]
Zhan Qin, Ting Yu, Yin Yang, Issa Khalil, Xiaokui Xiao, and Kui Ren. 2017. Generating synthetic decentralized social graphs with local differential privacy. In ACM SIGSAC Conference on Computer and Communications Security. 425–438.
[35]
Sean M Randall, Anna M Ferrante, James H Boyd, and James B Semmens. 2014. Privacy-preserving record linkage on large real world datasets. Journal of Biomedical Informatics 50, 1 (2014), 1.
[36]
Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20 (1987), 53–65.
[37]
Alieh Saeedi, Markus Nentwig, Eric Peukert, and Erhard Rahm. 2018. Scalable matching and clustering of entities with FAMER. Complex Systems Informatics and Modeling Quarterly16 (2018), 61–83.
[38]
R. Sakate, A. Fukagawa, Y. Takagaki, H. Okura, and A. Matsuyama. 2018. Trends of Clinical Trials for Drug Development in Rare Diseases. Curr Clin Pharmacol 13, 3 (2018), 199–208.
[39]
Rainer Schnell. 2016. Privacy preserving record linkage. In Methodological developments in data linkage, Katie Harron, Harvey Goldstein, and Chris Dibben (Eds.). Wiley, Chichester, 201–225.
[40]
Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Alessandro Flammini, and Filippo Menczer. 2017. The spread of fake news by social bots. arXiv preprint arXiv:1707.07592 96 (2017), 104.
[41]
Hagen Sparka, Florian Tschorsch, and Björn Scheuermann. 2018. P2KMV: a privacy-preserving counting sketch for efficient and accurate set intersection cardinality estimations. (2018).
[42]
Rade Stanojevic, Mohamed Nabeel, and Ting Yu. 2017. Distributed cardinality estimation of set operations with differential privacy. In 2017 IEEE Symposium on Privacy-Aware Computing (PAC). IEEE, 37–48.
[43]
Hong Su, Mohamed Zait, Vladimir Barrière, Joseph Torres, and Andre Menck. 2016. Approximate aggregates in oracle 12c. In ACM International on Conference on Information and Knowledge Management. 1603–1612.
[44]
Khoi-Nguyen Tran, Dinusha Vatsalan, and Peter Christen. 2013. GeCo: an online personal data generator and corruptor. In ACM Conference in Knowledge Management. San Francisco, 2473–2476.
[45]
Dinusha Vatsalan and Peter Christen. 2016. Privacy-preserving matching of similar patients. Journal of Biomedical Informatics 59 (2016), 285–298.
[46]
Dinusha Vatsalan, Peter Christen, and Erhard Rahm. 2020. Incremental clustering techniques for multi-party Privacy-Preserving Record Linkage. Data & Knowledge Engineering (2020).
[47]
D. Vatsalan, P. Christen, and Vassilios S. Verykios. 2011. An Efficient Two-Party Protocol for Approximate Matching in Private Record Linkage. In Australasian Data Mining Conference. Ballarat, Australia.
[48]
Dinusha Vatsalan, Peter Christen, and Vassilios S. Verykios. 2013. A Taxonomy of Privacy-Preserving Record Linkage Techniques. Information Systems 38, 6 (2013), 946–969.
[49]
Dinusha Vatsalan, Ziad Sehili, Peter Christen, and Erhard Rahm. 2017. Privacy-preserving record linkage for Big data: Current approaches and research challenges. In Handbook of Big Data Technologies. Springer, 851–895.
[50]
Saskia Nuñez von Voigt and Florian Tschorsch. 2019. RRTxFM: Probabilistic Counting for Differentially Private Statistics. In Conference on e-Business, e-Services and e-Society. Springer, 86–98.
[51]
Gang Wang, Xinyi Zhang, Shiliang Tang, Christo Wilson, Haitao Zheng, and Ben Y Zhao. 2017. Clickstream user behavior models. ACM Transactions on the Web (TWEB) 11, 4 (2017), 1–37.
[52]
Xu Wang and Yusheng Xu. 2019. An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. In IOP Conference Series: Materials Science and Engineering, Vol. 569. IOP Publishing, 052024.
[53]
Stanley L Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc. 60, 309 (1965), 63–69.
[54]
Kyu-Young Whang, Brad T Vander-Zanden, and Howard M Taylor. 1990. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS) 15, 2 (1990), 208–229.

Cited By

View all
  • (2024)Cardinality Counting in "Alcatraz": A Privacy-aware Federated Learning ApproachProceedings of the ACM Web Conference 202410.1145/3589334.3645655(3076-3084)Online publication date: 13-May-2024
  • (2024)Differential Cryptanalysis of Bloom Filters for Privacy-Preserving Record LinkageIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342129219(6665-6678)Online publication date: 2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASIA CCS '23: Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security
July 2023
1066 pages
ISBN:9798400700989
DOI:10.1145/3579856
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bloom filters
  2. Probabilistic counting
  3. differential privacy
  4. distinct-counting
  5. fuzzy matching
  6. unsupervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ASIA CCS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 418 of 2,322 submissions, 18%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)7
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cardinality Counting in "Alcatraz": A Privacy-aware Federated Learning ApproachProceedings of the ACM Web Conference 202410.1145/3589334.3645655(3076-3084)Online publication date: 13-May-2024
  • (2024)Differential Cryptanalysis of Bloom Filters for Privacy-Preserving Record LinkageIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.342129219(6665-6678)Online publication date: 2024

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media