research-article

A graph-based blocking approach for entity matching using pre-trained contextual embedding models

Authors:

John Bosco Mugeni,

Toshiyuki AmagasaAuthors Info & Claims

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

Pages 357 - 364

https://doi.org/10.1145/3477314.3507689

Published: 06 May 2022 Publication History

Abstract

Data integration is considered a crucial task in the entity matching process. In this process, redundant and cunning entries must be identified and eliminated to improve the data quality. To archive this, a comparison between all entities is performed. However, this has quadratic computational complexity. To avoid this, 'blocking' is introduced to limit comparisons to probable matches. This paper presents a k-nearest neighbor graph-based blocking approach utilizing state-of-the-art context-aware sentence embeddings from pre-trained transformers. Our approach maps each database tuple to a node and generates a graph where nodes are connected by edges if they are related. We then invoke unsupervised community detection techniques over this graph and treat blocking as a graph clustering problem. Our work is motivated by the scarcity of training data for entity matching in real-world scenarios and the limited scalability of blocking schemes in the presence of proliferating data. We test the capabilities of our blocking system on four data sets exhibiting more than 6 million comparisons. We show that our block processing times vary from 59s to 461s owing to the efficient data structure of the k-nearest neighbor graph. Our results also show that our method achieves better performance in terms of F1 score when compared to current deep learning-based blocking solutions.

References

[1]

Fabio Azzalini, Songle Jin, Marco Renzi, and Letizia Tanca. 2020. Blocking Techniques for Entity Linkage: A Semantics-Based Approach. Data Science and Engineering 6, 1 (2020), 20--38.

[2]

Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: A Survey. ACM Trans. Knowl. Discov. Data 15, 3, Article 52 (April 2021), 37 pages.

Digital Library

[3]

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, 10 (oct 2008), P10008.

[4]

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. arXiv:cs.CL/1508.05326

[5]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment 11, 11 (Jul 2018), 1454--1467.

Digital Library

[6]

Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-End Multi-Perspective Matching for Entity Resolution. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4961--4967.

[7]

Hector Garcia-Molina. 2004. Entity Resolution: Overview and Challenges. In Conceptual Modeling - ER 2004, Paolo Atzeni, Wesley Chu, Hongjun Lu, Shuigeng Zhou, and Tok-Wang Ling (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--2.

[8]

Ram Deepak Gottapu, Cihan Dagli, and Bharami Ali. 2016. Entity Resolution Using Convolutional Neural Network. Procedia Computer Science 95 (2016), 153--158. Complex Adaptive Systems Los Angeles, CA November 2--4, 2016.

[9]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv:cs.CL/2006.03654

[10]

Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In ICLR 2014 : International Conference on Learning Representations (ICLR) 2014.

[11]

Nihel Kooli, Robin Allesiardo, and Erwan Pigneul. 2018. Deep Learning Based Approach for Entity Resolution in Databases. In ACIIDS.

[12]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv:cs.CL/1910.13461

[13]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:cs.CL/1907.11692

[14]

Leland McInnes, John Healy, and James Melville. 2020. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:stat.ML/1802.03426

[15]

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2--4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1301.3781

[16]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 19--34.

Digital Library

[17]

George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2, Article 31 (mar 2020), 42 pages.

Digital Library

[18]

Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In Companion Proceedings of The 2019 World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 381--386.

Digital Library

[19]

V. A. Traag, L. Waltman, and N. J. van Eck. 2019. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9, 1 (Mar 2019).

[20]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Digital Library

[21]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv:cs.CL/1910.03771

[22]

Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis. In NAACL.

Index Terms

A graph-based blocking approach for entity matching using pre-trained contextual embedding models
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

A Graph-Based Blocking Approach for Entity Matching Using Contrastively Learned Embeddings

Data integration is considered a crucial task in the entity matching process. In this process, redundant and cunning entries must be identified and eliminated to improve the data quality. To archive this, a comparison between all entities is performed. ...
Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Ground Truth Inference for Weakly Supervised Entity Matching
PACMMOD

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

April 2022

2099 pages

ISBN:9781450387132

DOI:10.1145/3477314

Conference Chairs:
Jiman Hong
Soongsil University
,
Miroslav Bures
Czech Technical University, Czechia
,
Program Chairs:
Juw Won Park
University of Louisville
,
Tomas Cerny
Baylor University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC '22

Sponsor:

SIGAPP

SAC '22: The 37th ACM/SIGAPP Symposium on Applied Computing

April 25 - 29, 2022

Virtual Event

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
186
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents