research-article

Deep indexed active learning for matching heterogeneous entity representations

Authors:

Sunita Sarawagi,

Prithviraj SenAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 1

Pages 31 - 45

https://doi.org/10.14778/3485450.3485455

Published: 01 September 2021 Publication History

Abstract

Given two large lists of records, the task in entity resolution (ER) is to find the pairs from the Cartesian product of the lists that correspond to the same real world entity. Typically, passive learning methods on such tasks require large amounts of labeled data to yield useful models. Active Learning is a promising approach for ER in low resource settings. However, the search space, to find informative samples for the user to label, grows quadratically for instance-pair tasks making active learning hard to scale. Previous works, in this setting, rely on hand-crafted predicates, pre-trained language model embeddings, or rule learning to prune away unlikely pairs from the Cartesian product. This blocking step can miss out on important regions in the product space leading to low recall. We propose DIAL, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs. DIAL uses an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models. We highlight surprising differences between the matcher and the blocker in the creation of the training data and the objective used to train their parameters. Experiments on five benchmark datasets and a multilingual record matching dataset show the effectiveness of our approach in terms of precision, recall and running time.

References

[1]

Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. 2019. To Index or Not to Index: Optimizing Exact Maximum Inner Product Search. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 1250--1261.

[2]

Akiko N. Aizawa and Keizo Oyama. 2005. A Fast Linkage Detection Scheme for Multi-Source Information Integration. In 2005 International Workshop on Challenges in Web Information Retrieval and Integration (WIRI 2005), 8-9 April 2005, Tokyo, Japan. IEEE Computer Society, 30--39.

[3]

Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On Active Learning of Record Matching Packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (Indianapolis, Indiana, USA) (SIGMOD '10). Association for Computing Machinery, New York, NY, USA, 783--794.

Digital Library

[4]

David Arthur and Sergei Vassilvitskii. 2007. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (New Orleans, Louisiana) (SODA '07). Society for Industrial and Applied Mathematics, USA, 1027--1035.

Digital Library

[5]

Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. 2020. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=ryghZJBKPS

[6]

Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: A Survey. ACM Trans. Knowl. Discov. Data 15, 3 (2021), 52:1--52:37.

Digital Library

[7]

William H. Beluch, Tim Genewein, Andreas Nürnberger, and Jan M. Köhler. 2018. The Power of Ensembles for Active Learning in Image Classification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 9368--9377.

[8]

Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. 2006. Adaptive Blocking: Learning to Scale Up Record Linkage. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18-22 December 2006, Hong Kong, China. IEEE Computer Society, 87--96.

Digital Library

[9]

Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5--32.

Digital Library

[10]

Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George H. L. Fletcher, Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 463--473.

[11]

Xiao Chen, Yinlong Xu, David Broneske, Gabriel Campero Durand, Roman Zoun, and Gunter Saake. 2019. Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER). In Advances in Databases and Information Systems, Tatjana Welzer, Johann Eder, Vili Podgorelec, and Aida Kamišalić Latifić (Eds.). Springer International Publishing, Cham, 69--85.

[12]

P. Christen. 2012. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Transactions on Knowledge and Data Engineering 24, 9 (2012), 1537--1555.

Digital Library

[13]

Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2020. An Overview of End-to-End Entity Resolution for Big Data. ACM Comput. Surv. 53, 6, Article 127 (Dec. 2020), 42 pages.

Digital Library

[14]

Guilherme dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Information Systems 75 (2018), 75--89.

[15]

Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. 2021. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository.

[16]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186.

[17]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. Proc. VLDB Endow. 11, 11 (July 2018), 1454--1467.

Digital Library

[18]

Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, and Vassilis Christophides. 2019. MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 373--384.

[19]

Varun Embar, Bunyamin Sisman, Hao Wei, Xin Luna Dong, Christos Faloutsos, and Lise Getoor. 2020. Contrastive Entity Linkage: Mining Variational Attributes from Large Catalogs for Entity Linkage. In Conference on Automated Knowledge Base Construction, AKBC 2020, Virtual, June 22-24, 2020, Dipanjan Das, Hannaneh Hajishirzi, Andrew McCallum, and Sameer Singh (Eds.).

[20]

Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. 64, 328 (1969), 1183--1210. arXiv:https://www.tandfonline.com/doi/pdf/10.1080/01621459.1969.10501049

[21]

Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. 1997. Selective Sampling Using the Query by Committee Algorithm. Mach. Learn. 28, 2-3 (1997), 133--168.

Digital Library

[22]

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. 2017. Deep Bayesian Active Learning with Image Data. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 1183--1192. http://proceedings.mlr.press/v70/gal17a.html

[23]

Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2018. Robust Entity Resolution Using Random Graphs. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 3--18.

Digital Library

[24]

Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2021. Efficient and effective ER with progressive blocking. The VLDB Journal 30, 4 (Mar 2021), 537--557.

[25]

Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and Diego Garcia-Olano. 2019. Learning Dense Representations for Entity Retrieval. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Association for Computational Linguistics, Hong Kong, China, 528--537.

[26]

Kazuma Hashimoto, Raffaella Buschiazzo, James Bradbury, Teresa Marshall, Richard Socher, and Caiming Xiong. 2019. A High-Quality Multilingual Dataset for Structured Documentation Translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers). Association for Computational Linguistics, Florence, Italy, 116--127.

[27]

Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The Merge/Purge Problem for Large Databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (San Jose, California, USA) (SIGMOD '95). Association for Computing Machinery, New York, NY, USA, 127--138.

Digital Library

[28]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535--547.

[29]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 5851--5861.

[30]

Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. 2019. BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 7024--7035. https://proceedings.neurips.cc/paper/2019/hash/95323660ed2124450caaac2c46b5ed90-Abstract.html

[31]

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems. Proc. VLDB Endow. 9, 12 (Aug. 2016), 1197--1208.

Digital Library

[32]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-World Match Problems. Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 484--493.

Digital Library

[33]

Hanna Köpcke and Erhard Rahm. 2010. Frameworks for entity matching: A comparison. Data & Knowledge Engineering 69, 2 (2010), 197--210.

Digital Library

[34]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow. 14, 1 (Sept. 2020), 50--60.

Digital Library

[35]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692

[36]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Bkg6RiCqY7

[37]

Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, Massachusetts, USA) (KDD '00). Association for Computing Machinery, New York, NY, USA, 169--178.

Digital Library

[38]

Paul McNamee, James Mayfield, Dawn Lawrie, Douglas Oard, and David Doermann. 2011. Cross-Language Entity Linking. In Proceedings of 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, Chiang Mai, Thailand, 255--263. https://www.aclweb.org/anthology/I11--1029

[39]

Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1133--1147.

Digital Library

[40]

Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning. Proc. VLDB Endow. 8, 2 (Oct. 2014), 125--136.

Digital Library

[41]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 19--34.

Digital Library

[42]

Stephen Mussmann, Robin Jia, and Percy Liang. 2020. On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3400--3413.

[43]

Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM '19). Association for Computing Machinery, New York, NY, USA, 629--638.

Digital Library

[44]

George Papadakis, George Alexiou, George Papastefanatos, and Georgia Koutrika. 2015. Schema-Agnostic vs Schema-Based Configurations for Blocking Methods on Homogeneous Data. Proc. VLDB Endow. 9, 4 (Dec. 2015), 312--323.

Digital Library

[45]

George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and Wolfgang Nejdl. 2013. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. IEEE Transactions on Knowledge and Data Engineering 25, 12 (2013), 2665--2682.

Digital Library

[46]

George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-Blocking: Taking Entity Resolution to the Next Level. IEEE Transactions on Knowledge and Data Engineering 26, 8 (2014), 1946--1960.

[47]

George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with JedAI. Information Systems 93 (2020), 101565.

[48]

George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Supervised Meta-Blocking. Proc. VLDB Endow. 7, 14 (Oct. 2014), 1929--1940.

Digital Library

[49]

George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking. In Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016, Evaggelia Pitoura, Sofian Maabout, Georgia Koutrika, Amélie Marian, Letizia Tanca, Ioana Manolescu, and Kostas Stefanidis (Eds.). OpenProceedings.org, 221--232.

[50]

George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2, Article 31 (March 2020), 42 pages.

Digital Library

[51]

George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. 2018. The Return of jedAI: End-to-end Entity Resolution for Structured and Semi-structured Data. PVLDB 11, 12 (2018), 1950--1953.

Digital Library

[52]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[53]

Anna Primpeli, Christian Bizer, and Margret Keuper. 2020. Unsupervised Boot-strapping of Active Learning for Entity Resolution. In The Semantic Web, Andreas Harth, Sabrina Kirrane, Axel-Cyrille Ngonga Ngomo, Heiko Paulheim, Anisa Rula, Anna Lisa Gentile, Peter Haase, and Michael Cochez (Eds.). Springer International Publishing, Cham, 215--231.

[54]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active Learning for Large-Scale Entity Resolution. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, EePeng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tseng, and Chenliang Li (Eds.). ACM, 1379--1388.

Digital Library

[55]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2019. SystemER: A Human-in-the-loop System for Explainable Entity Resolution. Proc. VLDB Endow. 12, 12 (2019), 1794--1797.

Digital Library

[56]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed-dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980--3990.

[57]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. 2020. A Survey of Deep Active Learning. CoRR abs/2009.00236 (2020). arXiv:2009.00236 https://arxiv.org/abs/2009.00236

[58]

Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive Deduplication Using Active Learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edmonton, Alberta, Canada) (KDD '02). Association for Computing Machinery, New York, NY, USA, 269--278.

Digital Library

[59]

Ozan Sener and Silvio Savarese. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=H1aIuk-RW

[60]

Burr Settles. 2009. Active Learning Literature Survey. Technical Report.

[61]

H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by Committee. In Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, COLT 1992, Pittsburgh, PA, USA, July 27-29, 1992, David Haussler (Ed.). ACM, 287--294.

Digital Library

[62]

Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: A Loosely Schema-Aware Meta-Blocking Approach for Entity Resolution. Proc. VLDB Endow. 9, 12 (Aug. 2016), 1173--1184.

Digital Library

[63]

Sheila Tejada, Craig A Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633.

Digital Library

[64]

Janusz Tracz, Piotr Iwo Wójcik, Kalina Jasinska-Kobus, Riccardo Belluzzo, Robert Mroczkowski, and Ireneusz Gawlik. 2020. BERT-based similarity learning for product matching. In Proceedings of Workshop on Natural Language Processing in E-Commerce. Association for Computational Linguistics, Barcelona, Spain, 66--75. https://www.aclweb.org/anthology/2020.ecomnlp-1.7

[65]

Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. 2017. Waldo: An Adaptive Human Interface for Crowd Entity Resolution. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 1133--1148.

Digital Library

[66]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. Proc. VLDB Endow. 5, 11 (July 2012), 1483--1494.

Digital Library

[67]

Zhengyang Wang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Shuiwang Ji. 2020. CorDEL: A Contrastive Deep Learning Approach for Entity Linkage. In 20th IEEE International Conference on Data Mining, ICDM 2020, Sorrento, Italy, November 17-20, 2020, Claudia Plant, Haixun Wang, Alfredo Cuzzocrea, Carlo Zaniolo, and Xindong Wu (Eds.). IEEE, 1322--1327.

[68]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38--45.

[69]

Donggeun Yoo and In So Kweon. 2019. Learning Loss for Active Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 93--102.

[70]

Fulin Zhang, Zhipeng Gao, and Kun Niu. 2017. A pruning algorithm for Meta-blocking based on cumulative weight. Journal of Physics: Conference Series 887 (aug 2017), 012058.

[71]

Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, and Davd Page. 2020. AutoBlock: A Hands-off Blocking Framework for Entity Matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM '20). Association for Computing Machinery, New York, NY, USA, 744--752.

Digital Library

[72]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-End Fuzzy Entity-Matching Using Pre-Trained Deep Models and Transfer Learning. In The World Wide Web Conference (San Francisco, CA, USA) (WWW '19). Association for Computing Machinery, New York, NY, USA, 2413--2424.

Digital Library

[73]

Fedor Zhdanov. 2019. Diverse mini-batch Active Learning. CoRR abs/1901.05954 (2019). arXiv:1901.05954 http://arxiv.org/abs/1901.05954

Cited By

Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00868-7
Meduri VQuamar ALei CQin XReinwald B(2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00822-z
Huang JSun ZChen QXu XRen WHu W(2023)Deep Active Alignment of Knowledge Graph Entities and SchemataProceedings of the ACM on Management of Data10.1145/35893041:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589304
Show More Cited By

Recommendations

Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience Papers

Entity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER)
Advances in Databases and Information Systems
Abstract
Entity resolution identifies records that refer to the same real-world entity. For its classification step, supervised learning can be adopted, but this faces limitations in the availability of labeled training data. Under this situation, active ...
Deep entity matching with adversarial active learning
Abstract
Entity matching (EM), as a fundamental task in data cleansing and integration, aims to identify the data records in databases that refer to the same real-world entity. While recent deep learning technologies significantly improve the performance ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 1

September 2021

140 pages

ISSN:2150-8097

Editors:
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2021

Published in PVLDB Volume 15, Issue 1

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
69
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00868-7
Meduri VQuamar ALei CQin XReinwald B(2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-023-00822-z
Huang JSun ZChen QXu XRen WHu W(2023)Deep Active Alignment of Knowledge Graph Entities and SchemataProceedings of the ACM on Management of Data10.1145/35893041:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589304
Long WLi XWang LZhang FLin ZLin X(2023)Efficient m-closest entity matching over heterogeneous information networksKnowledge-Based Systems10.1016/j.knosys.2023.110299263:COnline publication date: 5-Mar-2023
https://dl.acm.org/doi/10.1016/j.knosys.2023.110299
Li YLi JSuhara YDoan ATan W(2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s00778-023-00779-z
Shraga R(2022)HumanALProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547496(1-8)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3546930.3547496
Huang JHu WBao ZChen QQu Y(2022)Deep entity matching with adversarial active learningThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00745-132:1(229-255)Online publication date: 28-Apr-2022
https://dl.acm.org/doi/10.1007/s00778-022-00745-1
Wrembel R(2022)Data Integration, Cleaning, and Deduplication: Research Versus Industrial ProjectsInformation Integration and Web Intelligence10.1007/978-3-031-21047-1_1(3-17)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.1007/978-3-031-21047-1_1

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents