research-article

Open access

Knowledge Transfer for Entity Resolution with Siamese Neural Networks

Authors:

Michael Loster,

Ioannis Koumarelas,

Felix NaumannAuthors Info & Claims

Journal of Data and Information Quality (JDIQ), Volume 13, Issue 1

Article No.: 2, Pages 1 - 25

https://doi.org/10.1145/3410157

Published: 13 January 2021 Publication History

All formats PDF

Abstract

The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity—duplicates—into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise.

We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.

References

[1]

Asma Abboura, Soror Sahri, Mourad Ouziri, and Salima Benbernou. 2015. CrowdMD: Crowdsourcing-based approach for deduplication. In Proceedings of the IEEE International Conference on Big Data. 2621--2627.

Digital Library

[2]

Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). 2011. Schema Matching and Mapping. Springer.

[3]

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5 (1994), 157--166.

Digital Library

[4]

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. MIT Press, 2546--2554.

[5]

Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’03) (2003), 39--48.

[6]

Jens Bleiholder and Felix Naumann. 2008. Data fusion. Comput. Surveys 41, 1 (2008), 1--41.

Digital Library

[7]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135--146.

[8]

Jane Bromley, James W. Bentz, Leon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak. Shah. 1993. Signature verification using Siamese time delay neural networks. Int. J. Pattern Recogn. Artific. Intell. (1993), 688--669.

[9]

Rich Caruana. 1997. Multitask learning. Mach. Learn. 28 (1997), 41--75.

Digital Library

[10]

Sung-Hyuk Cha. 2007. Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1 (2007), 300--307.

[11]

Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’05). 539--546.

Digital Library

[12]

Peter Christen. 2007. A two-step classification approach to unsupervised record linkage. Proceedings of the Australasian Conference on Data Mining and Analytics. 111--119.

Digital Library

[13]

Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). ACM, 151--159.

Digital Library

[14]

Peter Christen. 2012. Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.

Digital Library

[15]

William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the International Workshop on Information Integration on the Web (IIWeb’03). AAAI Press, 73--78.

[16]

Sanjib Das, AnHai Doan, Suganthan G. C. Paul, Chaitanya Gokhale, and Pradap Konda. [n.d.]. The Magellan Data Repository. Retrieved from https://sites.google.com/site/anhaidgroup/useful-stuff/data.

[17]

Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2012. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the International World Wide Web Conference (WWW’12). 469--478.

Digital Library

[18]

Lee R. Dice. 1945. Measures of the amount of ecologic association between species. Ecology 26, 3 (1945), 297--302.

[19]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11, 11 (2018), 1454--1467.

Digital Library

[20]

Mohamed G. Elfeky, Vassilios Verykios, and Ahmed Elmagarmid. 2002. TAILOR: A record linkage tool box. In Proceedings of the International Conference on Data Engineering (ICDE’02). 17--28.

[21]

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.

[22]

Raul Castro Fernandez and Samuel Madden. 2019. Termite: A system for tunneling through heterogeneous data. In Proceedings of the International Conference on Management of Data (SIGMOD’19). 7:1--7:8.

Digital Library

[23]

Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online entity resolution using an Oracle. Proc. VLDB Endow. 9 (2016), 384--395. Issue 5.

Digital Library

[24]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). 249--256.

[25]

Karl Goiser and Peter Christen. 2006. Towards automated record linkage. In Proceedings of the Australasian Conference on Data Mining and Analystics. Australian Computer Society, Inc., 23--31.

[26]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Digital Library

[27]

Yash Govind, Erik Paulson, Mukilan Ashok, Suganthan G. C. Paul, Ali Hitawala, AnHai Doan, Youngchoon Park, Peggy L Peissig, Eric LaRose, and Jonathan C. Badger. 2017. Cloudmatcher: A cloud/crowd service for entity matching. In Proceedings of the KDD Workshop on Big Data as a Service (BIGDAS’17).

[28]

Lifang Gu and Rohan A. Baxter. 2006. Decision models for record linkage. In Data Mining—Theory, Methodology, Techniques, and Applications. Springer, 146--160.

[29]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’06). 1735--1742.

Digital Library

[30]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9 (1997), 1735--1780.

Digital Library

[31]

Paul Jaccard. 1901. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles 37 (1901), 241--272.

[32]

Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359.

Digital Library

[33]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’19). 5851--5861.

[34]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the International Conference for Learning Representations (ICLR’15).

[35]

Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning (ICML’15) Deep Learning Workshop.

[36]

Prodromos Kolyvakis, Alexandros Kalousis, and Dimitris Kiritsis. 2018. DeepAlignment: Unsupervised ontology matching with refined word vectors. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 787--798.

[37]

Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10, 8 (1966), 707--710.

[38]

Iaroslav Melekhov, Juho Kannala, and Esa Rahtu. 2016. Siamese network features for image matching. In Proceedings of the International Conference on Pattern Recognition (ICPR’16). 378--383.

[39]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’13). 3111--3119.

[40]

Alvaro E. Monge and Charles P. Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270.

Digital Library

[41]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data (SIGMOD’18). 19--34.

Digital Library

[42]

Jonas Mueller and Aditya Thyagarajan. 2016. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the National Conference on Artificial Intelligence (AAAI’16). 2786--2792.

[43]

Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan and Claypool Publishers.

[44]

Azade Nazi, Bolin Ding, Vivek R. Narasayya, and Surajit Chaudhuri. 2018. Efficient estimation of inclusion coefficient using HyperLogLog sketches. Proc. VLDB Endow. 11, 10 (2018), 1097--1109.

Digital Library

[45]

Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru. 2016. Learning text similarity with siamese recurrent networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’16).

[46]

Sahand Negahban, Benjamin I. P. Rubinstein, and Jim Gemmell. 2012. Scaling multiple-source entity resolution using statistically efficient transfer learning. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’12). 2224--2228.

Digital Library

[47]

M. Odell and R. Russell. 1918. The soundex coding system. U.S. Patents 1261167 (1918).

[48]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2012. Understanding the exploding gradient problem. CoRR abs/1211.5063. http://arxiv.org/abs/1211.5063.

[49]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.

[50]

Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’02). 269--278.

Digital Library

[51]

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45 (1997), 2673--2681.

Digital Library

[52]

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems. MIT Press, 2960--2968.

Digital Library

[53]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.

Digital Library

[54]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. MIT Press, 3104--3112.

Digital Library

[55]

Sheila Tejada, Craig A. Knoblock, and Steven Minton. 2002. Learning domain-independent string transformation weights for high accuracy object identification. Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’02). 350--359.

Digital Library

[56]

Vasilis Verroios and Hector Garcia-Molina. 2015. Entity resolution with crowd errors. In Proceedings of the International Conference on Data Engineering (ICDE’15). 219--230.

[57]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. VLDB Endow. 5 (2012), 1483--1494. Issue 11.

Digital Library

[58]

William E. Winkler and Yves Thibaudeau. 1991. An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. decennial census. U.S. Bureau of the Census (1991), 1--22.

[59]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In Proceedings of the International World Wide Web Conference (WWW’19). 2413--2424.

Digital Library

[60]

Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH Ensemble: Internet-scale domain search. Proc. VLDB Endow. 9, 12 (2016), 1185--1196.

Digital Library

Cited By

Sabeeh Hasan Allak AYi JAl-Sabbagh HChen L(2025)Siamese Neural Networks in Unmanned Aerial Vehicle Target Tracking ProcessIEEE Access10.1109/ACCESS.2025.353646113(24309-24322)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3536461
Ghosh ABarman DSufian AHameed I(2024)Advancing Optical Character Recognition for Low-Resource Scripts: A Siamese Meta-Learning Approach With PSN FrameworkIEEE Access10.1109/ACCESS.2024.350960512(189651-189666)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3509605
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3584014.3584015
Show More Cited By

Index Terms

Knowledge Transfer for Entity Resolution with Siamese Neural Networks
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Multi-task learning
        Transfer learning
    2. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Data management systems
    1. Information integration
      1. Deduplication
      2. Entity resolution

Recommendations

Class-balanced siamese neural networks

Original siamese neural network objective function.Polar sine-based angular reformulation for cosine dissimilarity learning.Application on a multimodal human action dataset.New evaluations of 3 siamese neural networks using input data pairs, triplets ...
Transfer learning with pre-trained deep convolutional neural networks for serous cell classification
Abstract
Serous effusion is a condition of excess accumulation of fluids in serous cavities due to different underlying pathological conditions. The basis of cytopathological assessment of serous effusions is the identification of cells in the fluid based ...
Deep representation-based transfer learning for deep neural networks
Abstract
In recent years, deep neural networks (DNNs) have become the de facto models for practically all visual tasks and most temporal analysis tasks due to the abundance of available labeled data and advances in computational resources. Deep ...
Highlights
- A deep representation-based transfer learning method is proposed for knowledge transfer between deep neural networks.

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 13, Issue 1

On the Horizon, On the Horizon and Experience Papers

March 2021

104 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3446835

Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2021

Accepted: 01 July 2020

Revised: 01 May 2020

Received: 01 December 2019

Published in JDIQ Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
1,294
Total Downloads

Downloads (Last 12 months)323
Downloads (Last 6 weeks)36

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sabeeh Hasan Allak AYi JAl-Sabbagh HChen L(2025)Siamese Neural Networks in Unmanned Aerial Vehicle Target Tracking ProcessIEEE Access10.1109/ACCESS.2025.353646113(24309-24322)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3536461
Ghosh ABarman DSufian AHameed I(2024)Advancing Optical Character Recognition for Low-Resource Scripts: A Siamese Meta-Learning Approach With PSN FrameworkIEEE Access10.1109/ACCESS.2024.350960512(189651-189666)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3509605
Buono FFaggioli GPaganelli MBaraldi AGuerra FFerro N(2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
https://dl.acm.org/doi/10.1145/3584014.3584015
Tudoreanu M(2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
https://doi.org/10.3389/fdata.2022.931398
Wang PZeng XChen LYe FMao YZhu JGao Y(2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565836
Traeger LBehrend AKarabatis G(2022)Inteplato: Generating Mappings of Heterogeneous Relational Schemas Using Unsupervised Learning2022 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI58124.2022.00083(426-431)Online publication date: Dec-2022
https://doi.org/10.1109/CSCI58124.2022.00083
Thirumuruganathan SLi HTang NOuzzani MGovind YPaulsen DFung GDoan A(2021)Deep learning for blocking in entity matchingProceedings of the VLDB Endowment10.14778/3476249.347629414:11(2459-2472)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476249.3476294
Kado YHirokata TMatsumura KWang XYamasaki T(2021)Entity Resolution of Japanese Apartment Property Information Using Neural Networks2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR51284.2021.00052(277-282)Online publication date: Sep-2021
https://doi.org/10.1109/MIPR51284.2021.00052

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents