Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Knowledge Transfer for Entity Resolution with Siamese Neural Networks

Published: 13 January 2021 Publication History

Abstract

The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity—duplicates—into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise.
We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.

References

[1]
Asma Abboura, Soror Sahri, Mourad Ouziri, and Salima Benbernou. 2015. CrowdMD: Crowdsourcing-based approach for deduplication. In Proceedings of the IEEE International Conference on Big Data. 2621--2627.
[2]
Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). 2011. Schema Matching and Mapping. Springer.
[3]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5 (1994), 157--166.
[4]
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. MIT Press, 2546--2554.
[5]
Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’03) (2003), 39--48.
[6]
Jens Bleiholder and Felix Naumann. 2008. Data fusion. Comput. Surveys 41, 1 (2008), 1--41.
[7]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135--146.
[8]
Jane Bromley, James W. Bentz, Leon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak. Shah. 1993. Signature verification using Siamese time delay neural networks. Int. J. Pattern Recogn. Artific. Intell. (1993), 688--669.
[9]
Rich Caruana. 1997. Multitask learning. Mach. Learn. 28 (1997), 41--75.
[10]
Sung-Hyuk Cha. 2007. Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1 (2007), 300--307.
[11]
Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’05). 539--546.
[12]
Peter Christen. 2007. A two-step classification approach to unsupervised record linkage. Proceedings of the Australasian Conference on Data Mining and Analytics. 111--119.
[13]
Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). ACM, 151--159.
[14]
Peter Christen. 2012. Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
[15]
William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the International Workshop on Information Integration on the Web (IIWeb’03). AAAI Press, 73--78.
[16]
Sanjib Das, AnHai Doan, Suganthan G. C. Paul, Chaitanya Gokhale, and Pradap Konda. [n.d.]. The Magellan Data Repository. Retrieved from https://sites.google.com/site/anhaidgroup/useful-stuff/data.
[17]
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2012. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the International World Wide Web Conference (WWW’12). 469--478.
[18]
Lee R. Dice. 1945. Measures of the amount of ecologic association between species. Ecology 26, 3 (1945), 297--302.
[19]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11, 11 (2018), 1454--1467.
[20]
Mohamed G. Elfeky, Vassilios Verykios, and Ahmed Elmagarmid. 2002. TAILOR: A record linkage tool box. In Proceedings of the International Conference on Data Engineering (ICDE’02). 17--28.
[21]
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.
[22]
Raul Castro Fernandez and Samuel Madden. 2019. Termite: A system for tunneling through heterogeneous data. In Proceedings of the International Conference on Management of Data (SIGMOD’19). 7:1--7:8.
[23]
Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online entity resolution using an Oracle. Proc. VLDB Endow. 9 (2016), 384--395. Issue 5.
[24]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). 249--256.
[25]
Karl Goiser and Peter Christen. 2006. Towards automated record linkage. In Proceedings of the Australasian Conference on Data Mining and Analystics. Australian Computer Society, Inc., 23--31.
[26]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
[27]
Yash Govind, Erik Paulson, Mukilan Ashok, Suganthan G. C. Paul, Ali Hitawala, AnHai Doan, Youngchoon Park, Peggy L Peissig, Eric LaRose, and Jonathan C. Badger. 2017. Cloudmatcher: A cloud/crowd service for entity matching. In Proceedings of the KDD Workshop on Big Data as a Service (BIGDAS’17).
[28]
Lifang Gu and Rohan A. Baxter. 2006. Decision models for record linkage. In Data Mining—Theory, Methodology, Techniques, and Applications. Springer, 146--160.
[29]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’06). 1735--1742.
[30]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9 (1997), 1735--1780.
[31]
Paul Jaccard. 1901. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles 37 (1901), 241--272.
[32]
Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359.
[33]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’19). 5851--5861.
[34]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the International Conference for Learning Representations (ICLR’15).
[35]
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning (ICML’15) Deep Learning Workshop.
[36]
Prodromos Kolyvakis, Alexandros Kalousis, and Dimitris Kiritsis. 2018. DeepAlignment: Unsupervised ontology matching with refined word vectors. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 787--798.
[37]
Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 8 (1966), 707--710.
[38]
Iaroslav Melekhov, Juho Kannala, and Esa Rahtu. 2016. Siamese network features for image matching. In Proceedings of the International Conference on Pattern Recognition (ICPR’16). 378--383.
[39]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’13). 3111--3119.
[40]
Alvaro E. Monge and Charles P. Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270.
[41]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD’18). 19--34.
[42]
Jonas Mueller and Aditya Thyagarajan. 2016. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the National Conference on Artificial Intelligence (AAAI’16). 2786--2792.
[43]
Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan and Claypool Publishers.
[44]
Azade Nazi, Bolin Ding, Vivek R. Narasayya, and Surajit Chaudhuri. 2018. Efficient estimation of inclusion coefficient using HyperLogLog sketches. Proc. VLDB Endow. 11, 10 (2018), 1097--1109.
[45]
Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’16).
[46]
Sahand Negahban, Benjamin I. P. Rubinstein, and Jim Gemmell. 2012. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’12). 2224--2228.
[47]
M. Odell and R. Russell. 1918. The soundex coding system. U.S. Patents 1261167 (1918).
[48]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. Understanding the exploding gradient problem. CoRR abs/1211.5063. http://arxiv.org/abs/1211.5063.
[49]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.
[50]
Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’02). 269--278.
[51]
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45 (1997), 2673--2681.
[52]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems. MIT Press, 2960--2968.
[53]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.
[54]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. MIT Press, 3104--3112.
[55]
Sheila Tejada, Craig A. Knoblock, and Steven Minton. 2002. Learning domain-independent string transformation weights for high accuracy object identification. Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’02). 350--359.
[56]
Vasilis Verroios and Hector Garcia-Molina. 2015. Entity resolution with crowd errors. In Proceedings of the International Conference on Data Engineering (ICDE’15). 219--230.
[57]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. VLDB Endow. 5 (2012), 1483--1494. Issue 11.
[58]
William E. Winkler and Yves Thibaudeau. 1991. An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. decennial census. U.S. Bureau of the Census (1991), 1--22.
[59]
Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In Proceedings of the International World Wide Web Conference (WWW’19). 2413--2424.
[60]
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-scale domain search. Proc. VLDB Endow. 9, 12 (2016), 1185--1196.

Cited By

View all
  • (2025)Siamese Neural Networks in Unmanned Aerial Vehicle Target Tracking ProcessIEEE Access10.1109/ACCESS.2025.353646113(24309-24322)Online publication date: 2025
  • (2024)Advancing Optical Character Recognition for Low-Resource Scripts: A Siamese Meta-Learning Approach With PSN FrameworkIEEE Access10.1109/ACCESS.2024.350960512(189651-189666)Online publication date: 2024
  • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 13, Issue 1
On the Horizon, On the Horizon and Experience Papers
March 2021
104 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3446835
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2021
Accepted: 01 July 2020
Revised: 01 May 2020
Received: 01 December 2019
Published in JDIQ Volume 13, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Entity resolution
  2. data quality
  3. duplicate detection
  4. metric learning
  5. neural networks
  6. similarity learning
  7. transfer learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)323
  • Downloads (Last 6 weeks)36
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Siamese Neural Networks in Unmanned Aerial Vehicle Target Tracking ProcessIEEE Access10.1109/ACCESS.2025.353646113(24309-24322)Online publication date: 2025
  • (2024)Advancing Optical Character Recognition for Low-Resource Scripts: A Siamese Meta-Learning Approach With PSN FrameworkIEEE Access10.1109/ACCESS.2024.350960512(189651-189666)Online publication date: 2024
  • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
  • (2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
  • (2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
  • (2022)Inteplato: Generating Mappings of Heterogeneous Relational Schemas Using Unsupervised Learning2022 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI58124.2022.00083(426-431)Online publication date: Dec-2022
  • (2021)Deep learning for blocking in entity matchingProceedings of the VLDB Endowment10.14778/3476249.347629414:11(2459-2472)Online publication date: 1-Jul-2021
  • (2021)Entity Resolution of Japanese Apartment Property Information Using Neural Networks2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR51284.2021.00052(277-282)Online publication date: Sep-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media