research-article

Effective entity matching with transformers

Authors:

Wang-Chiew TanAuthors Info & Claims

The VLDB Journal, Volume 32, Issue 6

Pages 1215 - 1235

https://doi.org/10.1007/s00778-023-00779-z

Published: 17 January 2023 Publication History

Abstract

We present

Ditto

, a novel entity matching system based on pre-trained Transformer language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve

Ditto

’s matching capability.

Ditto

allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions.

Ditto

also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally,

Ditto

adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way,

Ditto

is forced to learn “harder” to improve the model’s matching capability. The optimizations we developed further boost the performance of

Ditto

by up to 9.8%. Perhaps more surprisingly, we establish that

Ditto

can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate

Ditto

’s effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records,

Ditto

achieves a high F1 score of 96.5%.

References

[1]

Abuzaid, F., Sethi, G., Bailis, P., Zaharia, M.: To index or not to index: optimizing exact maximum inner product search. In: Proceedings of ICDE ’19, pp. 1250–1261. IEEE (2019)

[2]

Baraldi, A., Buono, F.D., Paganelli, M., Guerra, F.: Using landmarks for explaining entity matching models. In: EDBT, pp. 451–456 (2021)

[3]

Barlaug, N.: LEMON: explainable entity matching (2021). CoRR arXiv:2110.00516

[4]

Baxter, L.R., Baxter, R., Christen, P., et al.: A comparison of fast blocking methods for record (2003)

[5]

Beltagy, I., Lo, K., Cohan, A.: Scibert: a pretrained language model for scientific text (2019). arXiv preprint arXiv:1903.10676

[6]

Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings KDD ’03, pp. 39–48 (2003)

[7]

Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information TACL 2017 5 135-146

[8]

Brunner, U., Stockinger, K.: Entity matching with transformer architectures—a step forward in data integration. In: EDBT (2020)

[9]

Chen, Q., Zhu, X., Ling, Z.H., Inkpen, D., Wei, S.: Neural natural language inference models enhanced with external knowledge. In: Proceedings of ACL ’18, pp. 2406–2417 (2018)

[10]

Christen P A survey of indexing techniques for scalable record linkage and deduplication TKDE 2011 24 9 1537-1555

[11]

Cicco, V.D., Firmani, D., Koudas, N., Merialdo, P., Srivastava, D.: Interpreting deep learning models for entity resolution: an experience report using LIME. In: aiDM@SIGMOD, pp. 8:1–8:4 (2019)

[12]

Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. In: Proceedings of BlackBoxNLP ’19, pp. 276–286 (2019)

[13]

Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of KDD ’02, pp. 475–480 (2002)

[14]

Dalvi, N., Rastogi, V., Dasgupta, A., Das Sarma, A., Sarlos, T.: Optimal hashing schemes for entity matching. In: Proceeding of WWW ’13, pp. 295–306 (2013)

[15]

Das, S., Doan, A., Psgc, G.C., Gokhale, C., Konda, P., Govind, Y., Paulsen, D.: The magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data

[16]

De Bruin, J.: Python Record Linkage Toolkit: a toolkit for record linkage and duplicate detection in Python (2019)

[17]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT ’19, pp. 4171–4186 (2019)

[18]

Ebaid, A., Thirumuruganathan, S., Aref, W.G., Elmagarmid, A.K., Ouzzani, M.: EXPLAINER: entity resolution explanations. In: ICDE, pp. 2000–2003 (2019)

[19]

Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, and Tang N Distributed representations of tuples for entity resolution PVLDB 2018 11 11 1454-1467

[20]

Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.A., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: Proceedings of SIGMOD ’14, pp. 1071–1074 (2014)

[21]

Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution. In: Proceedings of KDD ’15, pp. 279–288 (2015)

[22]

Fu, C., Han, X., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: End-to-end multi-perspective matching for entity resolution. In: Proceedings of IJCAI ’19, pp. 4961–4967. AAAI Press (2019)

[23]

Ge, C., Wang, P., Chen, L., Liu, X., Zheng, B., Gao, Y.: Collaborer: a self-supervised entity resolution framework using multi-features collaboration (2021). CoRR arXiv:2108.08090

[24]

Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of SIGMOD ’14, pp. 601–612 (2014)

[25]

Gurajada, S., Popa, L., Qian, K., Sen, P.: Learning-based methods with human-in-the-loop for entity resolution. In: CIKM, pp. 2969–2970 (2019)

[26]

He, Y., Ganjam, K., Lee, K., Wang, Y., Narasayya, V., Chaudhuri, S., Chu, X., Zheng, Y.: Transform-data-by-example (tde) extensible data transformation in excel. In: SIGMOD, pp. 1785–1788 (2018)

[27]

Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)

[28]

Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780

[29]

Jain A, Sarawagi S, and Sen P Deep indexed active learning for matching heterogeneous entity representations PVLDB 2021 15 1 31-45

[30]

Jin D, Sisman B, Wei H, Dong XL, and Koutra D Deep transfer learning for multi-source entity linkage via domain adaptation PVLDB 2021 15 3 465-477

[31]

Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. In: Proceedings of ACL ’19, pp. 5851–5861 (2019)

[32]

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980

[33]

Konda P, Das S, GC PS, Doan A, Ardalan A, Ballard JR, Li H, Panahi F, Zhang H, Naughton JF, Prasad S, Krishnan G, Deep R, and Raghavendra V Magellan: toward building entity matching management systems PVLDB 2016 9 12 1197-1208

[34]

Köpcke H, Thor A, and Rahm E Evaluation of entity resolution approaches on real-world match problems PVLDB 2010 3 1–2 484-493

[35]

Lample, G., Conneau, A.: Cross-lingual language model pretraining (2019). arXiv preprint arXiv:1901.07291

[36]

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, and Kang J Biobert: a pre-trained biomedical language representation model for biomedical text mining Bioinformatics 2020 36 4 1234-1240

[37]

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL, pp. 7871–7880 (2020)

[38]

Li, B., Miao, Y., Wang, Y., Sun, Y., Wang, W.: Improving the efficiency and effectiveness for bert-based entity resolution. In: AAAI, vol. 35, pp. 13226–13233 (2021)

[39]

Li P, Cheng X, Chu X, He Y, and Chaudhuri S Auto-FuzzyJoin: Auto-program Fuzzy Similarity Joins Without Labeled Examples 2021 New York Association for Computing Machinery 1064-1076

[40]

Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models (2020). arXiv preprint arXiv:2004.00584

[41]

Li Y, Li J, Suhara Y, Wang J, Hirota W, and Tan WC Deep entity matching: challenges and opportunities J. Data Inf. Qual. (JDIQ) 2021 13 1 1-17

[42]

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: a robustly optimized bert pretraining approach (2019). arXiv preprint arXiv:1907.11692

[43]

Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller R.C.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011)

[44]

Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: SIGMOD, pp. 1133–1147 (2020)

[45]

Miao, Z., Li, Y., Wang, X.: Rotom: a meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In: SIGMOD, pp. 1303–1316 (2021)

[46]

Miao, Z., Li, Y., Wang, X., Tan, W.C.: Snippext: semi-supervised opinion mining with augmented data. In: Proceedings of WWW ’20 (2020)

[47]

Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of EMNLP ’04, pp. 404–411 (2004)

[48]

Mitchell, T.M., et al.: Machine Learning, vol. 45, no. 37, pp. 870–877. McGraw Hill, Burr Ridge (1997)

[49]

Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: Proceedings of SIGMOD ’18, pp. 19–34 (2018)

[50]

Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey (2019). arXiv preprint arXiv:1905.06167

[51]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of NeurIPS ’19, pp. 8024–8035 (2019)

[52]

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python J. Mach. Learn. Res. 2011 12 2825-2830

[53]

Peeters, R., Bizer, C.: Cross-language learning for entity matching (2021). arXiv preprint arXiv:2110.03338

[54]

Peeters R and Bizer C Dual-objective fine-tuning of bert for entity matching PVLDB 2021 14 10 1913-1921

[55]

Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of EMNLP ’14, pp. 1532–1543 (2014)

[56]

Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of WWW ’19, pp. 381–386 (2019)

[57]

Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: CIKM, pp. 1379–1388 (2017)

[58]

Radford A, Wu J, Child R, Luan D, Amodei D, and Sutskever I Language models are unsupervised multitask learners OpenAI Blog 2019 1 8 9

[59]

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, and Liu PJ Exploring the limits of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. 2020 21 140 1-67

[60]

Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, and Ré C Snorkel: rapid training data creation with weak supervision PVLDB 2017 11 269

[61]

Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP-IJCNLP ’19, pp. 3982–3992 (2019)

[62]

Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. In: ACM SIGKDD, pp. 1135–1144 (2016)

[63]

Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. In: Proceedings of EMNLP ’15 (2015)

[64]

Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: ESWC, Lecture Notes in Computer Science, vol. 10843, pp. 576–592. Springer (2018)

[65]

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: Proceedings of

{EMC}^{2}

’19 (2019)

[66]

Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of KDD ’02, pp. 269–278 (2002)

[67]

Singh R, Meduri VV, Elmagarmid A, Madden S, Papotti P, Quiané-Ruiz JA, Solar-Lezama A, and Tang N Synthesizing entity matching rules by examples PVLDB 2017 11 2 189-202

[68]

Spacy: https://spacy.io/api/entityrecognizer

[69]

Stoyanovich, J., Howe, B., Jagadish, H.V.: Responsible data management. PVLDB 13(12), 3474–3488 (2020)

[70]

Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., Wu, H.: ERNIE: enhanced representation through knowledge integration (2019). arXiv preprint arXiv:1904.09223

[71]

Suri S, Ilyas IF, Ré C, and Rekatsinas T Ember: no-code context enrichment via similarity-based keyless joins PVLDB 2021 15 3 699-712

[72]

Tang N, Fan J, Li F, Tu J, Du X, Li G, Madden S, and Ouzzani M RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation PVLDB 2021 14 8 1254-1261

[73]

Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline. In: Proceedings of ACL ’19, pp. 4593–4601 (2019)

[74]

Teofili, T., Firmani, D., Koudas, N., Martello, V., Merialdo, P., Srivastava, D.: Effective explanations for entity resolution models. In: ICDE, pp. 2709–2721. IEEE (2022)

[75]

Thirumuruganathan S, Li H, Tang N, Ouzzani M, Govind Y, Paulsen D, Fung G, and Doan A Deep learning for blocking in entity matching: a design space exploration PVLDB 2021 14 11 2459-2472

[76]

Tu J, Han X, Fan J, Tang N, Chai C, Li G, and Du X Dader: hands-off entity resolution with domain adaptation PVLDB 2022 15 12 3666-3669

[77]

Varma P and Ré C Snuba: automating weak supervision to label training data PVLDB 2018 12 223

[78]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS ’17, pp. 5998–6008 (2017)

[79]

Wang J, Kraska T, Franklin MJ, and Feng J CrowdER: crowdsourcing entity resolution PVLDB 2012 5 11 1483-1494

[80]

Wang J, Li G, Yu JX, and Feng J Entity matching: how similar is similar PVLDB 2011 4 10 622-633

[81]

Wang, P., Zheng, W., Wang, J., Pei, J.: Automating entity matching model development. In: ICDE, pp. 1296–1307. IEEE (2021)

[82]

Wang Q, Cui M, and Liang H Semantic-aware blocking for entity resolution TKDE 2015 28 1 166-180

[83]

Wang, X., He, X., Cao, Y., Liu, M., Chua, T.S.: KGAT: knowledge graph attention network for recommendation. In: Proceedings of KDD ’19, pp. 950–958 (2019)

[84]

WDC Product Data Corpus: http://webdatacommons.org/largescaleproductcorpus/v2

[85]

Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of EMNLP-IJCNLP ’19, pp. 6382–6388 (2019)

[86]

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: state-of-the-art natural language processing (2019). arXiv preprint arXiv:1910.03771

[87]

Wu, R., Bendeck, A., Chu, X., He, Y.: Ground truth inference for weakly supervised entity matching (2022). arXiv preprint arXiv:2211.06975

[88]

Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: SIGMOD, pp. 1149–1164. Association for Computing Machinery, New York (2020)

[89]

Wu R, Sakala P, Li P, Chu X, and He Y Demonstration of panda: a weakly supervised entity matching system PVLDB 2021 14 12 2735-2738

[90]

Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15:1–15:41 (2011)

[91]

Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation (2019). arXiv preprint arXiv:1904.12848

[92]

Yang, B., Mitchell, T.: Leveraging knowledge bases in LSTMs for improving machine reading. In: Proceedings of ACL ’17, pp. 1436–1446 (2017)

[93]

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of NeurIPS ’19, pp. 5754–5764 (2019)

[94]

Zhang D, Li D, Guo L, and Tan K Unsupervised entity resolution with blocking and graph algorithms TKDE 2022 34 3 1501-1515

[95]

Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: Proceedings of ICLR ’18 (2018)

[96]

Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of WWW ’19, pp. 2413–2424 (2019)

Cited By

Huang Z(2024)Disambiguate Entity Matching using Large Language Models through Relation DiscoveryProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669844(36-39)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669844
Low JFung BXiong P(2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111678

Recommendations

Deep Learning for Entity Matching: A Design Space Exploration
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Entity matching (EM) finds data instances that refer to the same real-world entity. In this paper we examine applying deep learning (DL) to EM, to understand DL's benefits and limitations. We review many DL solutions that have been developed for related ...
Better entity matching with transformers through ensembles
Abstract
In this paper, we introduce AttendEM, a framework for entity matching (EM), i.e., pairwise identification of duplicates across databases. Eschewing the prevalent focus on text cleaning and training data augmentation of other transformers-based EM ...
Highlights
- Ensembles can be constructed with just one transformer architecture.
- Single-architecture ensembles save on memory, making LLM ensembles viable.
- Non-stochastic shuffling of LLM input tokens is sufficiently unique for ensembles.
- ...
Mixed Hierarchical Networks for Deep Entity Matching
Abstract
Entity matching is a fundamental problem of data integration. It groups records according to underlying real-world entities. There is a growing trend of entity matching via deep learning techniques. We design mixed hierarchical deep neural ...

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 32, Issue 6

Nov 2023

233 pages

ISSN:1066-8888

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 17 January 2023

Accepted: 29 December 2022

Revision received: 04 December 2022

Received: 01 April 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang Z(2024)Disambiguate Entity Matching using Large Language Models through Relation DiscoveryProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669844(36-39)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3665601.3669844
Low JFung BXiong P(2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111678

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents