research-article

High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

Authors:

Kevin O’hare,

Anna Jurek-Loughrey,

Cassio De CamposAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 2

Article No.: 24, Pages 1 - 17

https://doi.org/10.1145/3450527

Published: 21 July 2021 Publication History

Abstract

Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.

References

[1]

Tiago Brasileiro Araújo, Carlos Eduardo Santos Pires, and Thiago Pereira da Nóbrega. 2017. Spark-based streamlined metablocking. In Proceedings of the 2017 IEEE Symposium on Computers and Communications. IEEE, 844–850.

[2]

B. Vijaya Babu and K. Jyotsna Santoshi. 2014. Unsupervised detection of duplicates in user query results using blocking. International Journal of Computer Science and Information Technologies 5, 3 (2014), 3514–3520.

[3]

Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Proceedings of the 6th International Conference on Data Mining. IEEE, 87–96.

Digital Library

[4]

Peter Christen. 2008. Febrl-: An open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1065–1068.

Digital Library

[5]

Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Science Business Media.

[6]

Mingyuan Cui. 2014. Towards a Scalable and Robust Entity Resolution-Approximate Blocking with Semantic Constraints. Technical Report. Australian National University.

[7]

Guilherme dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Information Systems 75 (2018), 75–89.

[8]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. In Proceedings of the VLDB Endowment. 11, 11 (2018), 1454–1467.

Digital Library

[9]

Mohamed G. Elfeky, Vassilios S. Verykios, and Ahmed K. Elmagarmid. 2002. TAILOR: A record linkage toolbox. In Proceedings of the18th International Conference on Data Engineering. IEEE, 17–28.

Digital Library

[10]

Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, Vol. 99. 518–529.

Digital Library

[11]

Mauricio A. Hernández and Salvatore J. Stolfo. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 1 (1998), 9–37.

Digital Library

[12]

Ekaterini Ioannou, Odysseas Papapetrou, Dimitrios Skoutas, and Wolfgang Nejdl. 2010. Efficient semantic-aware detection of near duplicate resources. In Proceedings of the 7th International Conference on Extended Semantic Web. Springer, 136–150.

Digital Library

[13]

Robert Isele and Christian Bizer. 2012. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5, 11 (2012), 1638–1649.

Digital Library

[14]

Delaram Javdani, Hossein Rahmani, Milad Allahgholi, and Fatemeh Karimkhani. 2019. Deep Block: A novel blocking approach for entity resolution using deep learning. In Proceedings of the 5th International Conference on Web Research.

[15]

Liang Jin, Chen Li, and Sharad Mehrotra. 2003. Efficient record linkage in large data sets. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications. IEEE, 137–146.

Digital Library

[16]

Anna Jurek, Jun Hong, Yuan Chi, and Weiru Liu. 2017. A novel ensemble learning approach to unsupervised record linkage. Information Systems 71 (2017), 40–54.

Digital Library

[17]

Anna Jurek-Loughrey. 2020. Siamese neural network for unstructured data linkage. In Proceedings of the 22nd International Conference on Information Integration and Web-Based Applications & Services. 417–425.

Digital Library

[18]

Dimitrios Karapiperis and Vassilios S. Verykios. 2015. An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage. IEEE Transactions on Knowledge and Data Engineering 27, 4 (2015), 909–921.

Digital Library

[19]

Mayank Kejriwal and Daniel P. Miranker. 2014. Two-step blocking scheme learner for scalable link discovery. In Proceedings of the 9th International Conference on Ontology Matching.

Digital Library

[20]

Mayank Kejriwal and Daniel P. Miranker. 2013. An unsupervised algorithm for learning blocking schemes. In Proceedings of the 13th International Conference on Data Mining. IEEE, 340–349.

[21]

Mayank Kejriwal and Daniel P. Miranker. 2014. On linking heterogeneous dataset collections. In Proceedings of the International Semantic Web Conference (Posters & Demos). Citeseer, 217–220.

Digital Library

[22]

Mayank Kejriwal and Daniel P. Miranker. 2015. An unsupervised instance matcher for schema-free RDF data. Web Semantics: Science, Services and Agents on the World Wide Web 35, 2 (2015), 102–123.

Digital Library

[23]

Hung-sik Kim and Dongwon Lee. 2010. HARRA: Fast iterative hashed record linkage for large-scale data collections. In Proceedings of the 13th International Conference on Extending Database Technology. ACM, 525–536.

Digital Library

[24]

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of massive datasets. Cambridge university press.

Digital Library

[25]

Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 169–178.

Digital Library

[26]

Matthew Michelson and Craig A. Knoblock. 2006. Learning blocking schemes for record linkage. In Proceedings of the 21st National Conference on Artificial Intelligence.AAAI, 440–445.

Digital Library

[27]

Kevin O. Hare, Anna Jurek, and Cassio de Campos. 2018. A new technique of selecting an optimal blocking method for better record linkage. Information Systems Journal77 (2018), 151–166.

[28]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, and Peter Fankhauser. 2011. Efficient entity resolution for large heterogeneous information spaces. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 535–544.

Digital Library

[29]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2011. Eliminating the redundancy in blocking-based entity resolution methods. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries. ACM, 85–94.

Digital Library

[30]

George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2011. To compare or not to compare: making entity resolution more efficient. In Proceedings of the International Workshop on Semantic Web Information Management. ACM.

Digital Library

[31]

George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and Wolfgang Nejdl. 2013. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering 25, 12 (2013), 2665–2682.

Digital Library

[32]

George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-blocking: Taking entity resolution to the next level. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 1946–1960.

[33]

George Papadakis and Themis Palpanas. 2016. Blocking for large-scale entity resolution: Challenges, algorithms, and practical examples. In Proceedings of the 32nd International Conference onData Engineering. IEEE, 1436–1439.

[34]

George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Boosting the efficiency of large-scale entity resolution with enhanced meta-blocking. Big Data Research 6 (2016), 43–63.

[35]

George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proceedings of the 19th International Conference on Extending Database Technology. 221–232.

[36]

George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative analysis of approximate blocking techniques for entity resolution. In Proceedings of the VLDB Endowment 9, 9 (2016), 684–695.

Digital Library

[37]

Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: A loosely schema-aware meta-blocking approach for entity resolution. In Proceedings of the VLDB Endowment 9, 12 (2016), 1173–1184.

Digital Library

[38]

Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2018. Schema-agnostic progressive entity resolution. In IEEE 34th International Conference on Data Engineering. 53–64.

[39]

Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2019. Schema-agnostic progressive entity resolution (extended version). IEEE Transactions on Knowledge and Data Engineering 31, 6 (2019), 1208–1221.

[40]

Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A comparison of blocking methods for record linkage. In Proceedings of the International Conference on Privacy in Statistical Databases. Springer, 253–268.

[41]

Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. In Proceedings of the VLDB Endowment. 4, 10 (2011), 622–633.

Digital Library

[42]

Qing Wang, Mingyuan Cui, and Huizhi Liang. 2016. Semantic-aware blocking for entity resolution. IEEE Transactions on Knowledge and Data Engineering 28, 1 (2016), 166–180.

Digital Library

[43]

Qing Wang, Dinusha Vatsalan, and Peter Christen. 2015. Efficient interactive training selection for large-scale entity resolution. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 562–573.

[44]

William E. Winkler. 2006. Overview of record linkage and current research directions. Research Report Series (Statistics #2006-2). Statistical Research Division, U.S. Census Bureau, Washington, DC.

Cited By

Abd El Aziz RElzanfaly DFarhan M(2024)Towards Semantic Layer for Enhancing Blocking Entity Resolution Accuracy in Big Data2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)10.1109/ACDSA59508.2024.10467666(1-4)Online publication date: 1-Feb-2024
https://doi.org/10.1109/ACDSA59508.2024.10467666
Paulsen DGovind YDoan A(2023)Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity MatchingProceedings of the VLDB Endowment10.14778/3583140.358316316:6(1507-1519)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583163

Index Terms

High-Value Token-Blocking: Efficient Blocking Method for Record Linkage
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval

Recommendations

Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
An unsupervised blocking technique for more efficient record linkage
Abstract
Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person) within a dataset or across multiple datasets. This allows for the ...
Privacy preserving blocking and meta-blocking
ECMLPKDD'15: Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part III

Record linkage refers to integrating data from heterogeneous sources to identify information regarding the same entity and provides the basis for sophisticated data mining. When privacy restrictions apply, the data sources may only have access to the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 2

April 2022

514 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3476120

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 July 2021

Accepted: 01 February 2021

Revised: 01 November 2020

Received: 01 March 2020

Published in TKDD Volume 16, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
186
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abd El Aziz RElzanfaly DFarhan M(2024)Towards Semantic Layer for Enhancing Blocking Entity Resolution Accuracy in Big Data2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)10.1109/ACDSA59508.2024.10467666(1-4)Online publication date: 1-Feb-2024
https://doi.org/10.1109/ACDSA59508.2024.10467666
Paulsen DGovind YDoan A(2023)Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity MatchingProceedings of the VLDB Endowment10.14778/3583140.358316316:6(1507-1519)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.14778/3583140.3583163

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents