Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Effective entity matching with transformers

Published: 17 January 2023 Publication History
  • Get Citation Alerts
  • Abstract

    We present Ditto, a novel entity matching system based on pre-trained Transformer language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto ’s matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn “harder” to improve the model’s matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto ’s effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

    References

    [1]
    Abuzaid, F., Sethi, G., Bailis, P., Zaharia, M.: To index or not to index: optimizing exact maximum inner product search. In: Proceedings of ICDE ’19, pp. 1250–1261. IEEE (2019)
    [2]
    Baraldi, A., Buono, F.D., Paganelli, M., Guerra, F.: Using landmarks for explaining entity matching models. In: EDBT, pp. 451–456 (2021)
    [3]
    Barlaug, N.: LEMON: explainable entity matching (2021). CoRR arXiv:2110.00516
    [4]
    Baxter, L.R., Baxter, R., Christen, P., et al.: A comparison of fast blocking methods for record (2003)
    [5]
    Beltagy, I., Lo, K., Cohan, A.: Scibert: a pretrained language model for scientific text (2019). arXiv preprint arXiv:1903.10676
    [6]
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings KDD ’03, pp. 39–48 (2003)
    [7]
    Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information TACL 2017 5 135-146
    [8]
    Brunner, U., Stockinger, K.: Entity matching with transformer architectures—a step forward in data integration. In: EDBT (2020)
    [9]
    Chen, Q., Zhu, X., Ling, Z.H., Inkpen, D., Wei, S.: Neural natural language inference models enhanced with external knowledge. In: Proceedings of ACL ’18, pp. 2406–2417 (2018)
    [10]
    Christen P A survey of indexing techniques for scalable record linkage and deduplication TKDE 2011 24 9 1537-1555
    [11]
    Cicco, V.D., Firmani, D., Koudas, N., Merialdo, P., Srivastava, D.: Interpreting deep learning models for entity resolution: an experience report using LIME. In: aiDM@SIGMOD, pp. 8:1–8:4 (2019)
    [12]
    Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. In: Proceedings of BlackBoxNLP ’19, pp. 276–286 (2019)
    [13]
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of KDD ’02, pp. 475–480 (2002)
    [14]
    Dalvi, N., Rastogi, V., Dasgupta, A., Das Sarma, A., Sarlos, T.: Optimal hashing schemes for entity matching. In: Proceeding of WWW ’13, pp. 295–306 (2013)
    [15]
    Das, S., Doan, A., Psgc, G.C., Gokhale, C., Konda, P., Govind, Y., Paulsen, D.: The magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data
    [16]
    De Bruin, J.: Python Record Linkage Toolkit: a toolkit for record linkage and duplicate detection in Python (2019)
    [17]
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT ’19, pp. 4171–4186 (2019)
    [18]
    Ebaid, A., Thirumuruganathan, S., Aref, W.G., Elmagarmid, A.K., Ouzzani, M.: EXPLAINER: entity resolution explanations. In: ICDE, pp. 2000–2003 (2019)
    [19]
    Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, and Tang N Distributed representations of tuples for entity resolution PVLDB 2018 11 11 1454-1467
    [20]
    Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.A., Tang, N., Yin, S.: NADEEF/ER: generic and interactive entity resolution. In: Proceedings of SIGMOD ’14, pp. 1071–1074 (2014)
    [21]
    Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution. In: Proceedings of KDD ’15, pp. 279–288 (2015)
    [22]
    Fu, C., Han, X., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: End-to-end multi-perspective matching for entity resolution. In: Proceedings of IJCAI ’19, pp. 4961–4967. AAAI Press (2019)
    [23]
    Ge, C., Wang, P., Chen, L., Liu, X., Zheng, B., Gao, Y.: Collaborer: a self-supervised entity resolution framework using multi-features collaboration (2021). CoRR arXiv:2108.08090
    [24]
    Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of SIGMOD ’14, pp. 601–612 (2014)
    [25]
    Gurajada, S., Popa, L., Qian, K., Sen, P.: Learning-based methods with human-in-the-loop for entity resolution. In: CIKM, pp. 2969–2970 (2019)
    [26]
    He, Y., Ganjam, K., Lee, K., Wang, Y., Narasayya, V., Chaudhuri, S., Chu, X., Zheng, Y.: Transform-data-by-example (tde) extensible data transformation in excel. In: SIGMOD, pp. 1785–1788 (2018)
    [27]
    Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)
    [28]
    Hochreiter S and Schmidhuber J Long short-term memory Neural Comput. 1997 9 8 1735-1780
    [29]
    Jain A, Sarawagi S, and Sen P Deep indexed active learning for matching heterogeneous entity representations PVLDB 2021 15 1 31-45
    [30]
    Jin D, Sisman B, Wei H, Dong XL, and Koutra D Deep transfer learning for multi-source entity linkage via domain adaptation PVLDB 2021 15 3 465-477
    [31]
    Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. In: Proceedings of ACL ’19, pp. 5851–5861 (2019)
    [32]
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
    [33]
    Konda P, Das S, GC PS, Doan A, Ardalan A, Ballard JR, Li H, Panahi F, Zhang H, Naughton JF, Prasad S, Krishnan G, Deep R, and Raghavendra V Magellan: toward building entity matching management systems PVLDB 2016 9 12 1197-1208
    [34]
    Köpcke H, Thor A, and Rahm E Evaluation of entity resolution approaches on real-world match problems PVLDB 2010 3 1–2 484-493
    [35]
    Lample, G., Conneau, A.: Cross-lingual language model pretraining (2019). arXiv preprint arXiv:1901.07291
    [36]
    Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, and Kang J Biobert: a pre-trained biomedical language representation model for biomedical text mining Bioinformatics 2020 36 4 1234-1240
    [37]
    Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL, pp. 7871–7880 (2020)
    [38]
    Li, B., Miao, Y., Wang, Y., Sun, Y., Wang, W.: Improving the efficiency and effectiveness for bert-based entity resolution. In: AAAI, vol. 35, pp. 13226–13233 (2021)
    [39]
    Li P, Cheng X, Chu X, He Y, and Chaudhuri S Auto-FuzzyJoin: Auto-program Fuzzy Similarity Joins Without Labeled Examples 2021 New York Association for Computing Machinery 1064-1076
    [40]
    Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models (2020). arXiv preprint arXiv:2004.00584
    [41]
    Li Y, Li J, Suhara Y, Wang J, Hirota W, and Tan WC Deep entity matching: challenges and opportunities J. Data Inf. Qual. (JDIQ) 2021 13 1 1-17
    [42]
    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: a robustly optimized bert pretraining approach (2019). arXiv preprint arXiv:1907.11692
    [43]
    Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller R.C.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011)
    [44]
    Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: SIGMOD, pp. 1133–1147 (2020)
    [45]
    Miao, Z., Li, Y., Wang, X.: Rotom: a meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In: SIGMOD, pp. 1303–1316 (2021)
    [46]
    Miao, Z., Li, Y., Wang, X., Tan, W.C.: Snippext: semi-supervised opinion mining with augmented data. In: Proceedings of WWW ’20 (2020)
    [47]
    Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of EMNLP ’04, pp. 404–411 (2004)
    [48]
    Mitchell, T.M., et al.: Machine Learning, vol. 45, no. 37, pp. 870–877. McGraw Hill, Burr Ridge (1997)
    [49]
    Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: Proceedings of SIGMOD ’18, pp. 19–34 (2018)
    [50]
    Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey (2019). arXiv preprint arXiv:1905.06167
    [51]
    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of NeurIPS ’19, pp. 8024–8035 (2019)
    [52]
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python J. Mach. Learn. Res. 2011 12 2825-2830
    [53]
    Peeters, R., Bizer, C.: Cross-language learning for entity matching (2021). arXiv preprint arXiv:2110.03338
    [54]
    Peeters R and Bizer C Dual-objective fine-tuning of bert for entity matching PVLDB 2021 14 10 1913-1921
    [55]
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of EMNLP ’14, pp. 1532–1543 (2014)
    [56]
    Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of WWW ’19, pp. 381–386 (2019)
    [57]
    Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: CIKM, pp. 1379–1388 (2017)
    [58]
    Radford A, Wu J, Child R, Luan D, Amodei D, and Sutskever I Language models are unsupervised multitask learners OpenAI Blog 2019 1 8 9
    [59]
    Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, and Liu PJ Exploring the limits of transfer learning with a unified text-to-text transformer J. Mach. Learn. Res. 2020 21 140 1-67
    [60]
    Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, and Ré C Snorkel: rapid training data creation with weak supervision PVLDB 2017 11 269
    [61]
    Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP-IJCNLP ’19, pp. 3982–3992 (2019)
    [62]
    Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. In: ACM SIGKDD, pp. 1135–1144 (2016)
    [63]
    Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. In: Proceedings of EMNLP ’15 (2015)
    [64]
    Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: ESWC, Lecture Notes in Computer Science, vol. 10843, pp. 576–592. Springer (2018)
    [65]
    Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: Proceedings of EMC2 ’19 (2019)
    [66]
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of KDD ’02, pp. 269–278 (2002)
    [67]
    Singh R, Meduri VV, Elmagarmid A, Madden S, Papotti P, Quiané-Ruiz JA, Solar-Lezama A, and Tang N Synthesizing entity matching rules by examples PVLDB 2017 11 2 189-202
    [69]
    Stoyanovich, J., Howe, B., Jagadish, H.V.: Responsible data management. PVLDB 13(12), 3474–3488 (2020)
    [70]
    Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., Wu, H.: ERNIE: enhanced representation through knowledge integration (2019). arXiv preprint arXiv:1904.09223
    [71]
    Suri S, Ilyas IF, Ré C, and Rekatsinas T Ember: no-code context enrichment via similarity-based keyless joins PVLDB 2021 15 3 699-712
    [72]
    Tang N, Fan J, Li F, Tu J, Du X, Li G, Madden S, and Ouzzani M RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation PVLDB 2021 14 8 1254-1261
    [73]
    Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline. In: Proceedings of ACL ’19, pp. 4593–4601 (2019)
    [74]
    Teofili, T., Firmani, D., Koudas, N., Martello, V., Merialdo, P., Srivastava, D.: Effective explanations for entity resolution models. In: ICDE, pp. 2709–2721. IEEE (2022)
    [75]
    Thirumuruganathan S, Li H, Tang N, Ouzzani M, Govind Y, Paulsen D, Fung G, and Doan A Deep learning for blocking in entity matching: a design space exploration PVLDB 2021 14 11 2459-2472
    [76]
    Tu J, Han X, Fan J, Tang N, Chai C, Li G, and Du X Dader: hands-off entity resolution with domain adaptation PVLDB 2022 15 12 3666-3669
    [77]
    Varma P and Ré C Snuba: automating weak supervision to label training data PVLDB 2018 12 223
    [78]
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of NIPS ’17, pp. 5998–6008 (2017)
    [79]
    Wang J, Kraska T, Franklin MJ, and Feng J CrowdER: crowdsourcing entity resolution PVLDB 2012 5 11 1483-1494
    [80]
    Wang J, Li G, Yu JX, and Feng J Entity matching: how similar is similar PVLDB 2011 4 10 622-633
    [81]
    Wang, P., Zheng, W., Wang, J., Pei, J.: Automating entity matching model development. In: ICDE, pp. 1296–1307. IEEE (2021)
    [82]
    Wang Q, Cui M, and Liang H Semantic-aware blocking for entity resolution TKDE 2015 28 1 166-180
    [83]
    Wang, X., He, X., Cao, Y., Liu, M., Chua, T.S.: KGAT: knowledge graph attention network for recommendation. In: Proceedings of KDD ’19, pp. 950–958 (2019)
    [85]
    Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of EMNLP-IJCNLP ’19, pp. 6382–6388 (2019)
    [86]
    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: state-of-the-art natural language processing (2019). arXiv preprint arXiv:1910.03771
    [87]
    Wu, R., Bendeck, A., Chu, X., He, Y.: Ground truth inference for weakly supervised entity matching (2022). arXiv preprint arXiv:2211.06975
    [88]
    Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: SIGMOD, pp. 1149–1164. Association for Computing Machinery, New York (2020)
    [89]
    Wu R, Sakala P, Li P, Chu X, and He Y Demonstration of panda: a weakly supervised entity matching system PVLDB 2021 14 12 2735-2738
    [90]
    Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. TODS 36(3), 15:1–15:41 (2011)
    [91]
    Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation (2019). arXiv preprint arXiv:1904.12848
    [92]
    Yang, B., Mitchell, T.: Leveraging knowledge bases in LSTMs for improving machine reading. In: Proceedings of ACL ’17, pp. 1436–1446 (2017)
    [93]
    Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of NeurIPS ’19, pp. 5754–5764 (2019)
    [94]
    Zhang D, Li D, Guo L, and Tan K Unsupervised entity resolution with blocking and graph algorithms TKDE 2022 34 3 1501-1515
    [95]
    Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: Proceedings of ICLR ’18 (2018)
    [96]
    Zhao, C., He, Y.: Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of WWW ’19, pp. 2413–2424 (2019)

    Cited By

    View all
    • (2024)Disambiguate Entity Matching using Large Language Models through Relation DiscoveryProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669844(36-39)Online publication date: 9-Jun-2024
    • (2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image The VLDB Journal — The International Journal on Very Large Data Bases
    The VLDB Journal — The International Journal on Very Large Data Bases  Volume 32, Issue 6
    Nov 2023
    233 pages

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 17 January 2023
    Accepted: 29 December 2022
    Revision received: 04 December 2022
    Received: 01 April 2022

    Author Tags

    1. Entity matching
    2. Transformers
    3. Deep learning
    4. Data integration

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Disambiguate Entity Matching using Large Language Models through Relation DiscoveryProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669844(36-39)Online publication date: 9-Jun-2024
    • (2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024

    View Options

    View options

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media