research-article

Gradual Machine Learning for Entity Resolution

Authors:

Xin Liu,

Zhanhuai LiAuthors Info & Claims

WWW '19: The World Wide Web Conference

Pages 3526 - 3530

https://doi.org/10.1145/3308558.3314121

Published: 13 May 2019 Publication History

Get Access

Abstract

Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this demo, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine labeling without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and its performance is also highly competitive compared to the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort. Video: https://youtu.be/99bA9aamsgk

References

[1]

Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On Active Learning of Record Matching Packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data(SIGMOD '10). ACM, New York, NY, USA, 783-794.

Digital Library

Google Scholar

[2]

Yoshua Bengio, Je´r&ocir;me Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning(ICML '09). ACM, New York, NY, USA, 41-48.

Digital Library

Google Scholar

[3]

Songxi Chen. 1994. Empirical Likelihood Confidence Intervals for Linear Regression Coefficients. Journal of Multivariate Analysis 49, 1 (1994), 24-40.

Digital Library

Google Scholar

[4]

Zhiyuan Chen, Bing Liu, Ronald Brachman, Peter Stone, and Francesca Rossi. 2018. Lifelong Machine Learning: Second Edition. Morgan & Claypool. https://ieeexplore.ieee.org/document/8438617

Digital Library

Google Scholar

[5]

Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Publishing Company, Incorporated.

Digital Library

Google Scholar

[6]

Boyi Hou, Qun Chen, Xin Liu, Ping Zhong, Yanyan Wang, Zhaoqiang Chen, and Zhanhuai Li. 2019. Gradual Machine Learning for Entity Resolution (Technical Report). https://arxiv.org/abs/1810.12125{Online}.

Google Scholar

[7]

Eric Jones, Travis Oliphant, Pearu Peterson, 2001-. SciPy: Open source scientific tools for Python. http://www.scipy.org/{Online}.

Google Scholar

[8]

Rada Mihalcea. 2004. Co-training and Self-training for Word Sense Disambiguation. In Proceedings of the Eighth Conference on Computational Natural Language Learning, CoNLL 2004, Held in cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6-7, 2004. 33-40. http://aclweb.org/anthology/W/W04/W04-2405.pdf

Google Scholar

[9]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, Houston, TX, USA, June 10-15, 2018. 19-34.

Digital Library

Google Scholar

[10]

Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22, 10(2010), 1345-1359.

Digital Library

Google Scholar

[11]

Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris Re´. 2017. Snorkel: Fast Training Set Generation for Information Extraction. In Proceedings of the 2017 ACM International Conference on Management of Data(SIGMOD '17). ACM, New York, NY, USA, 1683-1686.

Digital Library

Google Scholar

[12]

Claude E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 3 (1948), 379-423.

Crossref

Google Scholar

[13]

Dong Yu and Li Deng. 2014. Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated.

Crossref

Google Scholar

[14]

Xiaofeng Zhou, Yang Chen, and Daisy Zhe Wang. 2016. ArchimedesOne: Query Processing over Probabilistic Knowledge Bases. Proc. VLDB Endow. 9, 13 (2016), 1461-1464.

Digital Library

Google Scholar

Cited By

View all

Chen NKuang XLiu FWang KZhang LChen Q(2024)Few-shot image classification based on gradual machine learningExpert Systems with Applications10.1016/j.eswa.2024.124676255(124676)Online publication date: Dec-2024
https://doi.org/10.1016/j.eswa.2024.124676
Kari VAmalanathan G(2024)Renaissance of Fuzzy and Fast Matching Entity with DSHS AlgorithmSN Computer Science10.1007/s42979-024-03093-95:6Online publication date: 29-Jul-2024
https://dl.acm.org/doi/10.1007/s42979-024-03093-9
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
https://doi.org/10.1007/s10844-024-00853-0
Show More Cited By

Recommendations

ZeroER: Entity Resolution using Zero Labeled Examples
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of ...
Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER)
Advances in Databases and Information Systems
Abstract
Entity resolution identifies records that refer to the same real-world entity. For its classification step, supervised learning can be adopted, but this faces limitations in the availability of labeled training data. Under this situation, active ...
Unsupervised Bootstrapping of Active Learning for Entity Resolution
The Semantic Web
Abstract
Entity resolution is one of the central challenges when integrating data from large numbers of data sources. Active learning for entity resolution aims to learn high-quality matching models while minimizing the human labeling effort by selecting ...

Comments

Information & Contributors

Information

Published In

WWW '19: The World Wide Web Conference

May 2019

3620 pages

ISBN:9781450366748

DOI:10.1145/3308558

Editors:
Ling Liu
Georgia Tech, USA
,
Ryen White
Microsoft Research, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '19

WWW '19: The Web Conference

May 13 - 17, 2019

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
316
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)3

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chen NKuang XLiu FWang KZhang LChen Q(2024)Few-shot image classification based on gradual machine learningExpert Systems with Applications10.1016/j.eswa.2024.124676255(124676)Online publication date: Dec-2024
https://doi.org/10.1016/j.eswa.2024.124676
Kari VAmalanathan G(2024)Renaissance of Fuzzy and Fast Matching Entity with DSHS AlgorithmSN Computer Science10.1007/s42979-024-03093-95:6Online publication date: 29-Jul-2024
https://dl.acm.org/doi/10.1007/s42979-024-03093-9
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
https://doi.org/10.1007/s10844-024-00853-0
Wu HLi S(2023)MixER: linear interpolation of latent space for entity resolutionComplex & Intelligent Systems10.1007/s40747-023-01018-210:1(3-22)Online publication date: 14-Mar-2023
https://doi.org/10.1007/s40747-023-01018-2
Li SWu H(2023)Transformer-based Denoising Adversarial Variational Entity ResolutionJournal of Intelligent Information Systems10.1007/s10844-022-00773-x61:2(631-650)Online publication date: 17-Apr-2023
https://doi.org/10.1007/s10844-022-00773-x
Tudoreanu M(2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
https://doi.org/10.3389/fdata.2022.931398
Hou BChen QWang YNafa YLi Z(2022)Gradual Machine Learning for Entity ResolutionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300614234:4(1803-1814)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TKDE.2020.3006142
Zhong PLi ZChen QHou B(2021)Attention-Enhanced Gradual Machine Learning for Entity ResolutionIEEE Intelligent Systems10.1109/MIS.2021.307726536:6(71-79)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1109/MIS.2021.3077265
Wang YChen QShen JHou BAhmed MLi Z(2021)Aspect-level sentiment analysis based on gradual machine learningKnowledge-Based Systems10.1016/j.knosys.2020.106509212(106509)Online publication date: Jan-2021
https://doi.org/10.1016/j.knosys.2020.106509
Maratea ACiaramella ACianci G(2020)Record linkage of banks and municipalities through multiple criteria and neural networksPeerJ Computer Science10.7717/peerj-cs.2586(e258)Online publication date: 24-Feb-2020
https://doi.org/10.7717/peerj-cs.258
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

ZeroER: Entity Resolution using Zero Labeled Examples

Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER)

Unsupervised Bootstrapping of Active Learning for Entity Resolution

Comments

Published In

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Recommendations

ZeroER: Entity Resolution using Zero Labeled Examples

Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER)

Unsupervised Bootstrapping of Active Learning for Entity Resolution

Comments

Information

Published In

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations