Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

Published: 21 July 2021 Publication History

Abstract

Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.

References

[1]
Tiago Brasileiro Araújo, Carlos Eduardo Santos Pires, and Thiago Pereira da Nóbrega. 2017. Spark-based streamlined metablocking. In Proceedings of the 2017 IEEE Symposium on Computers and Communications. IEEE, 844–850.
[2]
B. Vijaya Babu and K. Jyotsna Santoshi. 2014. Unsupervised detection of duplicates in user query results using blocking. International Journal of Computer Science and Information Technologies 5, 3 (2014), 3514–3520.
[3]
Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th International Conference on Data Mining. IEEE, 87–96.
[4]
Peter Christen. 2008. Febrl-: An open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1065–1068.
[5]
Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science Business Media.
[6]
Mingyuan Cui. 2014. Towards a Scalable and Robust Entity Resolution-Approximate Blocking with Semantic Constraints. Technical Report. Australian National University.
[7]
Guilherme dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Information Systems 75 (2018), 75–89.
[8]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. In Proceedings of the VLDB Endowment. 11, 11 (2018), 1454–1467.
[9]
Mohamed G. Elfeky, Vassilios S. Verykios, and Ahmed K. Elmagarmid. 2002. TAILOR: A record linkage toolbox. In Proceedings of the18th International Conference on Data Engineering. IEEE, 17–28.
[10]
Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, Vol. 99. 518–529.
[11]
Mauricio A. Hernández and Salvatore J. Stolfo. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 1 (1998), 9–37.
[12]
Ekaterini Ioannou, Odysseas Papapetrou, Dimitrios Skoutas, and Wolfgang Nejdl. 2010. Efficient semantic-aware detection of near duplicate resources. In Proceedings of the 7th International Conference on Extended Semantic Web. Springer, 136–150.
[13]
Robert Isele and Christian Bizer. 2012. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5, 11 (2012), 1638–1649.
[14]
Delaram Javdani, Hossein Rahmani, Milad Allahgholi, and Fatemeh Karimkhani. 2019. Deep Block: A novel blocking approach for entity resolution using deep learning. In Proceedings of the 5th International Conference on Web Research.
[15]
Liang Jin, Chen Li, and Sharad Mehrotra. 2003. Efficient record linkage in large data sets. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications. IEEE, 137–146.
[16]
Anna Jurek, Jun Hong, Yuan Chi, and Weiru Liu. 2017. A novel ensemble learning approach to unsupervised record linkage. Information Systems 71 (2017), 40–54.
[17]
Anna Jurek-Loughrey. 2020. Siamese neural network for unstructured data linkage. In Proceedings of the 22nd International Conference on Information Integration and Web-Based Applications & Services. 417–425.
[18]
Dimitrios Karapiperis and Vassilios S. Verykios. 2015. An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering 27, 4 (2015), 909–921.
[19]
Mayank Kejriwal and Daniel P. Miranker. 2014. Two-step blocking scheme learner for scalable link discovery. In Proceedings of the 9th International Conference on Ontology Matching.
[20]
Mayank Kejriwal and Daniel P. Miranker. 2013. An unsupervised algorithm for learning blocking schemes. In Proceedings of the 13th International Conference on Data Mining. IEEE, 340–349.
[21]
Mayank Kejriwal and Daniel P. Miranker. 2014. On linking heterogeneous dataset collections. In Proceedings of the International Semantic Web Conference (Posters & Demos). Citeseer, 217–220.
[22]
Mayank Kejriwal and Daniel P. Miranker. 2015. An unsupervised instance matcher for schema-free RDF data. Web Semantics: Science, Services and Agents on the World Wide Web 35, 2 (2015), 102–123.
[23]
Hung-sik Kim and Dongwon Lee. 2010. HARRA: Fast iterative hashed record linkage for large-scale data collections. In Proceedings of the 13th International Conference on Extending Database Technology. ACM, 525–536.
[24]
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of massive datasets. Cambridge university press.
[25]
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 169–178.
[26]
Matthew Michelson and Craig A. Knoblock. 2006. Learning blocking schemes for record linkage. In Proceedings of the 21st National Conference on Artificial Intelligence.AAAI, 440–445.
[27]
Kevin O. Hare, Anna Jurek, and Cassio de Campos. 2018. A new technique of selecting an optimal blocking method for better record linkage. Information Systems Journal77 (2018), 151–166.
[28]
George Papadakis, Ekaterini Ioannou, Claudia Niederée, and Peter Fankhauser. 2011. Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 535–544.
[29]
George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2011. Eliminating the redundancy in blocking-based entity resolution methods. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. ACM, 85–94.
[30]
George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2011. To compare or not to compare: making entity resolution more efficient. In Proceedings of the International Workshop on Semantic Web Information Management. ACM.
[31]
George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and Wolfgang Nejdl. 2013. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering 25, 12 (2013), 2665–2682.
[32]
George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-blocking: Taking entity resolution to the next level. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 1946–1960.
[33]
George Papadakis and Themis Palpanas. 2016. Blocking for large-scale entity resolution: Challenges, algorithms, and practical examples. In Proceedings of the 32nd International Conference onData Engineering. IEEE, 1436–1439.
[34]
George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Boosting the efficiency of large-scale entity resolution with enhanced meta-blocking. Big Data Research 6 (2016), 43–63.
[35]
George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proceedings of the 19th International Conference on Extending Database Technology. 221–232.
[36]
George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative analysis of approximate blocking techniques for entity resolution. In Proceedings of the VLDB Endowment 9, 9 (2016), 684–695.
[37]
Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: A loosely schema-aware meta-blocking approach for entity resolution. In Proceedings of the VLDB Endowment 9, 12 (2016), 1173–1184.
[38]
Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2018. Schema-agnostic progressive entity resolution. In IEEE 34th International Conference on Data Engineering. 53–64.
[39]
Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2019. Schema-agnostic progressive entity resolution (extended version). IEEE Transactions on Knowledge and Data Engineering 31, 6 (2019), 1208–1221.
[40]
Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A comparison of blocking methods for record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases. Springer, 253–268.
[41]
Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. In Proceedings of the VLDB Endowment. 4, 10 (2011), 622–633.
[42]
Qing Wang, Mingyuan Cui, and Huizhi Liang. 2016. Semantic-aware blocking for entity resolution. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 166–180.
[43]
Qing Wang, Dinusha Vatsalan, and Peter Christen. 2015. Efficient interactive training selection for large-scale entity resolution. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 562–573.
[44]
William E. Winkler. 2006. Overview of record linkage and current research directions. Research Report Series (Statistics #2006-2). Statistical Research Division, U.S. Census Bureau, Washington, DC.

Cited By

View all
  • (2024)Towards Semantic Layer for Enhancing Blocking Entity Resolution Accuracy in Big Data2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)10.1109/ACDSA59508.2024.10467666(1-4)Online publication date: 1-Feb-2024
  • (2023)Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity MatchingProceedings of the VLDB Endowment10.14778/3583140.358316316:6(1507-1519)Online publication date: 1-Feb-2023

Index Terms

  1. High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 2
    April 2022
    514 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3476120
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 July 2021
    Accepted: 01 February 2021
    Revised: 01 November 2020
    Received: 01 March 2020
    Published in TKDD Volume 16, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Blocking
    2. record linkage
    3. entity resolution

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Towards Semantic Layer for Enhancing Blocking Entity Resolution Accuracy in Big Data2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)10.1109/ACDSA59508.2024.10467666(1-4)Online publication date: 1-Feb-2024
    • (2023)Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity MatchingProceedings of the VLDB Endowment10.14778/3583140.358316316:6(1507-1519)Online publication date: 1-Feb-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media