Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3308558.3314121acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Gradual Machine Learning for Entity Resolution

Published: 13 May 2019 Publication History

Abstract

Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this demo, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine labeling without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alternatives, and its performance is also highly competitive compared to the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort. Video: https://youtu.be/99bA9aamsgk

References

[1]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On Active Learning of Record Matching Packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data(SIGMOD '10). ACM, New York, NY, USA, 783-794.
[2]
Yoshua Bengio, Je´r⊚me Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning(ICML '09). ACM, New York, NY, USA, 41-48.
[3]
Songxi Chen. 1994. Empirical Likelihood Confidence Intervals for Linear Regression Coefficients. Journal of Multivariate Analysis 49, 1 (1994), 24-40.
[4]
Zhiyuan Chen, Bing Liu, Ronald Brachman, Peter Stone, and Francesca Rossi. 2018. Lifelong Machine Learning: Second Edition. Morgan & Claypool. https://ieeexplore.ieee.org/document/8438617
[5]
Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Publishing Company, Incorporated.
[6]
Boyi Hou, Qun Chen, Xin Liu, Ping Zhong, Yanyan Wang, Zhaoqiang Chen, and Zhanhuai Li. 2019. Gradual Machine Learning for Entity Resolution (Technical Report). https://arxiv.org/abs/1810.12125{Online}.
[7]
Eric Jones, Travis Oliphant, Pearu Peterson, 2001-. SciPy: Open source scientific tools for Python. http://www.scipy.org/{Online}.
[8]
Rada Mihalcea. 2004. Co-training and Self-training for Word Sense Disambiguation. In Proceedings of the Eighth Conference on Computational Natural Language Learning, CoNLL 2004, Held in cooperation with HLT-NAACL 2004, Boston, Massachusetts, USA, May 6-7, 2004. 33-40. http://aclweb.org/anthology/W/W04/W04-2405.pdf
[9]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, Houston, TX, USA, June 10-15, 2018. 19-34.
[10]
Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22, 10(2010), 1345-1359.
[11]
Alexander J. Ratner, Stephen H. Bach, Henry R. Ehrenberg, and Chris Re´. 2017. Snorkel: Fast Training Set Generation for Information Extraction. In Proceedings of the 2017 ACM International Conference on Management of Data(SIGMOD '17). ACM, New York, NY, USA, 1683-1686.
[12]
Claude E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 3 (1948), 379-423.
[13]
Dong Yu and Li Deng. 2014. Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated.
[14]
Xiaofeng Zhou, Yang Chen, and Daisy Zhe Wang. 2016. ArchimedesOne: Query Processing over Probabilistic Knowledge Bases. Proc. VLDB Endow. 9, 13 (2016), 1461-1464.

Cited By

View all
  • (2024)Few-shot image classification based on gradual machine learningExpert Systems with Applications10.1016/j.eswa.2024.124676255(124676)Online publication date: Dec-2024
  • (2024)Renaissance of Fuzzy and Fast Matching Entity with DSHS AlgorithmSN Computer Science10.1007/s42979-024-03093-95:6Online publication date: 29-Jul-2024
  • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. entity resolution
  2. gradual machine learning
  3. unsupervised learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)3
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Few-shot image classification based on gradual machine learningExpert Systems with Applications10.1016/j.eswa.2024.124676255(124676)Online publication date: Dec-2024
  • (2024)Renaissance of Fuzzy and Fast Matching Entity with DSHS AlgorithmSN Computer Science10.1007/s42979-024-03093-95:6Online publication date: 29-Jul-2024
  • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
  • (2023)MixER: linear interpolation of latent space for entity resolutionComplex & Intelligent Systems10.1007/s40747-023-01018-210:1(3-22)Online publication date: 14-Mar-2023
  • (2023)Transformer-based Denoising Adversarial Variational Entity ResolutionJournal of Intelligent Information Systems10.1007/s10844-022-00773-x61:2(631-650)Online publication date: 17-Apr-2023
  • (2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
  • (2022)Gradual Machine Learning for Entity ResolutionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.300614234:4(1803-1814)Online publication date: 1-Apr-2022
  • (2021)Attention-Enhanced Gradual Machine Learning for Entity ResolutionIEEE Intelligent Systems10.1109/MIS.2021.307726536:6(71-79)Online publication date: 1-Nov-2021
  • (2021)Aspect-level sentiment analysis based on gradual machine learningKnowledge-Based Systems10.1016/j.knosys.2020.106509212(106509)Online publication date: Jan-2021
  • (2020)Record linkage of banks and municipalities through multiple criteria and neural networksPeerJ Computer Science10.7717/peerj-cs.2586(e258)Online publication date: 24-Feb-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media