Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Deep Entity Matching: Challenges and Opportunities

Published: 06 January 2021 Publication History

Abstract

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to be identified and managed. The term “entity matching” also loosely refers to the broader problem of determining whether two heterogeneous representations of different entities should be associated together. This problem has an even wider scope of applications, from determining the subsidiaries of companies to matching jobs to job seekers, which has impactful consequences.
In this article, we first report our recent system DITTO, which is an example of a modern entity matching system based on pretrained language models. Then we summarize recent solutions in applying deep learning and pre-trained language models for solving the entity matching task. Finally, we discuss research directions beyond entity matching, including the promise of synergistically integrating blocking and entity matching steps together, the need to examine methods to alleviate steep training data requirements that are typical of deep learning or pre-trained language models, and the importance of generalizing entity matching solutions to handle the broader entity matching problem, which leads to an even more pressing need to explain matching outcomes.

References

[1]
Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. 2019. To index or not to index: Optimizing exact maximum inner product search. In ICDE. IEEE, 1250--1261.
[2]
Alexander Amini, Ava P. Soleimany, Wilko Schwarting, Sangeeta N. Bhatia, and Daniela Rus. 2019. Uncovering and mitigating algorithmic bias through learned latent structure. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 289--295.
[3]
Ricardo Baeza-Yates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54--61.
[4]
Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2017. Fairness in machine learning. NIPS Tutorial 1 (2017).
[5]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5 (2017), 135--146.
[6]
Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures-a step forward in data integration. In EDBT. 463--473.
[7]
Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2016. A declarative framework for linking entities. ACM Trans. Database Syst. 41, 3 (2016), 17:1--17:38.
[8]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In SIGMOD. 1335--1349.
[9]
Alexandra Chouldechova and Aaron Roth. 2020. A snapshot of the frontiers of fairness in machine learning. Commun. ACM 63, 5 (2020), 82--89.
[10]
Peter Christen. 2011. A survey of indexing techniques for scalable record linkage and deduplication. TKDE 24, 9 (2011), 1537--1555.
[11]
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. In ACL Workshop BlackboxNLP. 276--286.
[12]
Chris DeBrusk. 2018. The risk of machine-learning bias (and how to prevent it). MIT Sloan Management Review.
[13]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171--4186.
[14]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. PVLDB 11, 11 (2018), 1454--1467.
[15]
Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, and Vassilis Christophides. 2019. MinoanER: Schema-agnostic, non-iterative, massively parallel resolution of web entities. In EDBT, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). 373--384.
[16]
I. P. Fellegi and A. B. Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64 (1969), 1183--1210.
[17]
Ruth C. Fong and Andrea Vedaldi. 2017. Interpretable explanations of black boxes by meaningful perturbation. In ICCV. 3429--3437.
[18]
Cheng Fu, Xianpei Han, Jiaming He, and Le Sun. 2020. Hierarchical matching network for heterogeneous entity resolution. In IJCAI. 3665--3671.
[19]
Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-end multi-perspective matching for entity resolution. In IJCAI. AAAI Press, 4961--4967.
[20]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 59 (2016), 1--35.
[21]
Sudipto Guha, H. V. Jagadish, Nick Koudas, Divesh Srivastava, and Ting Yu. 2002. Approximate XML joins. In SIGMOD. 287--298.
[22]
Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 51, 5 (2018), 1--42.
[23]
Sairam Gurajada, Lucian Popa, Kun Qian, and Prithviraj Sen. 2019. Learning-based methods with human-in-the-loop for entity resolution. In CIKM.
[24]
Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl. 2000. Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work. 241--250.
[25]
Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In SIGMOD. 127--138.
[26]
Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. In NAACL-HLT’19. 3543--3556.
[27]
Heng Ji and Ralph Grishman. 2011. Knowledge base population: Successful approaches and challenges. In ACL: Human Language Technologies. 1148--1158.
[28]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. In ACL. 5851--5861.
[29]
Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In ICML. 1885--1894.
[30]
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward building entity matching management systems. PVLDB 9, 12 (2016), 1197--1208.
[31]
Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data Knowl. Eng. 69, 2 (2010), 197--210.
[32]
Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. PVLDB 3, 1--2 (2010), 484--493.
[33]
Nick Koudas, Sunita Sarawagi, and Divesh Srivastava. 2006. Record linkage: Similarity measures and algorithms. In SIGMOD. 802--803.
[34]
Chen Li, Jiaheng Lu, and Yiming Lu. 2008. Efficient merging and filtering algorithms for approximate string searches. In ICDE. IEEE, 257--266.
[35]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2021. Deep entity matching with pre-trained language models. PVLDB 14, 1 (2021). The full version is available at https://arxiv.org/abs/2004.00584.
[36]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized bert pretraining approach. arXiv (2019).
[37]
Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard S. Zemel. 2016. The variational fair autoencoder. In ICLR, Yoshua Bengio and Yann LeCun (Eds.).
[38]
Scott M. Lundberg, Gabriel G. Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. 2020. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2 (2020), 56--67.
[39]
Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In NeurIPS. 4765--4774.
[40]
Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems. 165--172.
[41]
Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A comprehensive benchmark framework for active learning methods in entity matching. In SIGMOD. 1133--1147.
[42]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019).
[43]
Zhengjie Miao, Yuliang Li, Xiaolan Wang, and Wang-Chiew Tan. 2020. Snippext: Semi-supervised opinion mining with augmented data. In WWW. 617--628.
[44]
Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In EMNLP. 404--411.
[45]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS. 3111--3119.
[46]
Don Monroe. 2018. AI, explain yourself. Commun. ACM 61, 11 (2018), 11--13.
[47]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19--34.
[48]
Jonas Mueller and Aditya Thyagarajan. 2016. Siamese recurrent architectures for learning sentence similarity. In AAAI. 2786--2792.
[49]
Felix Naumann and Melanie Herschel. 2010. An introduction to duplicate detection. Synthesis Lectures Data Manage. 2, 1 (2010), 1--87.
[50]
Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In CIKM. 629--638.
[51]
Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. TKDE 22, 10 (2010), 1345--1359.
[52]
Ralph Peeters, Christian Bizer, and Goran Glavaš. [n.d.]. Intermediate training of BERT for product matching. In DI2KG Workshop at VLDB’20.
[53]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP. 1532--1543.
[54]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT. 2227--2237.
[55]
Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC training dataset and gold standard for large-scale product matching. In Companion Proc. WWW’19. 381--386.
[56]
Kun Qian, Douglas Burdick, Sairam Gurajada, and Lucian Popa. 2019. Learning explainable entity resolution algorithms for small business data using SystemER. In Proceedings of the 5th Workshop on Data Science for Macro-modeling with Financial and Economic Datasets. 1--6.
[57]
Kun Qian, Lucian Popa, and Prithviraj Sen. 2019. SystemER: A human-in-the-loop system for explainable entity resolution. PVLDB 12, 12 (2019), 1794--1797.
[58]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In EMNLP-IJCNLP. 3982--3992.
[59]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?” Explaining the predictions of any classifier. In KDD. 1135--1144.
[60]
Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In EMNLP.
[61]
Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional fairness: Causal database repair for algorithmic fairness. In SIGMOD. 793--810.
[62]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. In EMC2.
[63]
Amit Sharma and Dan Cosley. 2013. Do social explanations work? Studying and modeling the effects of social explanations in recommender systems. In WWW. 1133--1144.
[64]
Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 27, 2 (2015), 443--460.
[65]
Julia Stoyanovich, Bill Howe, and H. V. Jagadish. 2020. Responsible data management. PVLDB 13, 12 (2020), 3474--3488.
[66]
John R. Talburt and Yinle Zhou. 2013. A practical guide to entity resolution with OYSTER. In Handbook of Data Quality. Springer, 235--270.
[67]
John R. Talburt and Yinle Zhou. 2015. Entity Information Life Cycle for Big Data: Master Data Management and Information Integration. Morgan Kaufmann.
[68]
Saravanan Thirumuruganathan, Mourad Ouzzani, and Nan Tang. 2019. Explaining entity resolution predictions: Where are we and what needs to be done? In HILDA@SIGMOD. ACM, 10:1--10:6.
[69]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.
[70]
Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. PVLDB 4, 10 (2011), 622--633.
[71]
Jin Wang, Chunbin Lin, and Carlo Zaniolo. 2019. MF-join: Efficient fuzzy string similarity join with multi-level filtering. In ICDE. 386--397.
[72]
Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, and Xing Xie. 2018. A reinforcement learning framework for explainable recommendation. In ICDM. IEEE, 587--596.
[73]
Zhengyang Wang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Shuiwang Ji. 2020. CorDEL: A contrastive deep learning approach for entity linkage. In ICDM.
[74]
Michael J. Welch, Aamod Sane, and Chris Drome. 2012. Fast and accurate incremental entity resolution relative to an entity knowledge base. In CIKM. 2667--2670.
[75]
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In EMNLP/IJCNLP. 11--20.
[76]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. HuggingFace’s transformers: State-of-the-art natural language processing. arXiv (2019).
[77]
Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity resolution using zero labeled examples. In SIGMOD. 1149--1164.
[78]
Yao Wu and Martin Ester. 2015. Flame: A probabilistic model combining aspect based opinion mining and collaborative filtering. In WSDM. 199--208.
[79]
Diego Zardetto, Monica Scannapieco, and Tiziana Catarci. 2010. Effective automated object matching. In ICDE. 757--768.
[80]
Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and Davd Page. 2020. AutoBlock: A hands-off blocking framework for entity matching. In WSDM. 744--752.
[81]
Yongfeng Zhang and Xu Chen. 2020. Explainable recommendation: A survey and new perspectives. Foundations and Trends® in Information Retrieval 14, 1 (2020), 1--101.
[82]
Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In SIGIR. 83--92.
[83]
Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In WWW. 2413--2424.

Cited By

View all
  • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
  • (2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 1-Jan-2024
  • (2024)Threshold-Independent Fair Matching through Score CalibrationProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669845(40-44)Online publication date: 9-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 13, Issue 1
On the Horizon, On the Horizon and Experience Papers
March 2021
104 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3446835
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 January 2021
Accepted: 01 October 2020
Revised: 01 October 2020
Received: 01 October 2020
Published in JDIQ Volume 13, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Entity matching
  2. data integration
  3. deep learning
  4. entity resolution
  5. pre-trained language models

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)344
  • Downloads (Last 6 weeks)20
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
  • (2024)Leveraging Pretrained Language Models for Enhanced Entity MatchingInternational Journal of Intelligent Systems10.1155/2024/19412212024Online publication date: 1-Jan-2024
  • (2024)Threshold-Independent Fair Matching through Score CalibrationProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669845(40-44)Online publication date: 9-Jun-2024
  • (2024)Creating A Patient Data Redundancy Detection Model using Deep Learning Methods2024 7th International Conference on Informatics and Computational Sciences (ICICoS)10.1109/ICICoS62600.2024.10636911(262-266)Online publication date: 17-Jul-2024
  • (2024)Applications and Challenges for Large Language Models: From Data Management Perspective2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00441(5530-5541)Online publication date: 13-May-2024
  • (2024)Gen-T: Table Reclamation in Data Lakes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00272(3532-3545)Online publication date: 13-May-2024
  • (2024)Fairness-Aware Data Preparation for Entity Matching2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00268(3476-3489)Online publication date: 13-May-2024
  • (2024)A Critical Re-evaluation of Record Linkage Benchmarks for Learning-Based Matching Algorithms2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00265(3435-3448)Online publication date: 13-May-2024
  • (2024)Ontology in Text Mining and MatchingText Mining Approaches for Biomedical Data10.1007/978-981-97-3962-2_8(127-147)Online publication date: 4-Sep-2024
  • (2023)Heterogeneous Entity Matching with Complex Attribute Associations using BERT and Neural NetworksSSRN Electronic Journal10.2139/ssrn.4577447Online publication date: 2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media