research-article

Active Learning for Large-Scale Entity Resolution

Authors:

Lucian Popa, and

Prithviraj SenAuthors Info & Claims

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

November 2017

Pages 1379 - 1388

https://doi.org/10.1145/3132847.3132949

Published: 06 November 2017 Publication History

Abstract

Entity resolution (ER) is the task of identifying different representations of the same real-world object across datasets. Designing and tuning ER algorithms is an error-prone, labor-intensive process, which can significantly benefit from data-driven, automated learning methods. Our focus is on "big data'' scenarios where the primary challenges include 1) identifying, out of a potentially massive set, a small subset of informative examples to be labeled by the user, 2) using the labeled examples to efficiently learn ER algorithms that achieve both high precision and high recall, and 3) executing the learned algorithm to determine duplicates at scale. Recent work on learning ER algorithms has employed active learning to partially address the above challenges by aiming to learn ER rules in the form of conjunctions of matching predicates, under precision guarantees. While successful in learning a single rule, prior work has been less successful in learning multiple rules that are sufficiently different from each other, thus missing opportunities for improving recall. In this paper, we introduce an active learning system that learns, at scale, multiple rules each having significant coverage of the space of duplicates, thus leading to high recall, in addition to high-precision. We show the superiority of our system on real-world ER scenarios of sizes up to tens of millions of records, over state-of-the-art active learning methods that learn either rules or committees of statistical classifiers for ER, and even over sophisticated methods based on first-order probabilistic models.

References

[1]

D. Angluin. 1988. Queries and Concept Learning. Machine Learning (1988), 319--342.

Digital Library

[2]

A. Arasu, M. Götz, and R. Kaushik. 2010. On Active Learning of Record Matching Packages. In SIGMOD. 783--794.

Digital Library

[3]

S. Bach, M. Broecheler, B. Huang, and L. Getoor. 2015. Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. CoRR (2015). arXiv:abs/1505.04406

[4]

K. Bellare, S. Iyengar, A. Parameswaran, and V. Rastogi. 2012. Active Sampling for Entity Matching. In KDD. 1131--1139.

Digital Library

[5]

A. Beygelzimer, J. Langford, T. Zhang, and D. Hsu. 2010. Agnostic Active Learning Without Constraints. In NIPS. 199--207.

Digital Library

[6]

M. Bilenko, B. Kamath, and R. Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Workshop on Information Integration on the Web. 87--96.

Digital Library

[7]

P. Christen, D. Vatsalan, and Q. Wang. 2015. Efficient Entity Resolution with Adaptive and Interactive Training Data Selection. In ICDM. 1550--4786.

Digital Library

[8]

G. Dal Bianco, R. Galante, M. Gonsalves, S. Canuto, and C. Heuser. 2015. A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication. IEEE TKDE (2015), 2305--2319.

[9]

S. Dasgupta and D. Hsu. 2008. Hierarchical Sampling for Active Learning. In ICML. 208--215.

Digital Library

[10]

J. de Freitas, G. Pappa, A. da Silva, M. Gonçalves, E. Moura, A. Veloso, A. Laender, and M. de Carvalho. 2010. Active Learning Genetic programming for record deduplication. In IEEE Congress on Evolutionary Computation. 1--8.

[11]

G. Demartini, D. Difallah, and P. Cudre-Mauroux. 2013. Large-scale Linked Data Integration using Probabilistic Reasoning and Crowdsourcing. VLDB Journal (2013), 665--687.

Digital Library

[12]

X. Dong, A. Halevy, and J. Madhavan. 2005. Reference Reconciliation in Complex Information Spaces. In SIGMOD. 85--96.

Digital Library

[13]

B. Efron and R. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman & Hall.

[14]

I. Fellegi and A. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. (1969), 1183--1210.

[15]

J. Fisher, P. Christen, and Q. Wang. 2016. Active Learning Based Entity Resolution using Markov Logic. In PAKDD. 338--349.

Digital Library

[16]

Y. Freund, H. Seung, E. Shamir, and N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning (1997), 133--168.

Digital Library

[17]

L. Getoor and A. Machanavajjhala. 2013. Entity Resolution for Big Data. In KDD.

Digital Library

[18]

O. Goga, P. Loiseau, R. Sommer, R. Teixeira, and K. Gummadi. 2015. On the reliability of profile matching across large online social networks. In KDD. 1799-- 1808.

Digital Library

[19]

M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. 2013. HIL: A High-level Scripting Language for Entity Integration. In EDBT. 549--560.

Digital Library

[20]

M. Hernández and S. Stolfo. 1995. The Merge/Purge Problem for Large Databases. In SIGMOD. 127--138.

Digital Library

[21]

R. Isele and C. Bizer. 2013. Active Learning of Expressive Linkage Rules using Genetic Programming. Web Semantics: Science, Services and Agents on the World Wide Web (2013), 2--15.

Digital Library

[22]

Matti Kääriäinen. 2006. Active Learning in the Non-realizable Case. 63--77.

Digital Library

[23]

A. Khan and H. Garcia-Molina. 2016. Attribute-based Crowd Entity Resolution. In CIKM. 549--558.

Digital Library

[24]

S. Kok and P. Domingos. 2010. Learning Markov logic networks using structural motifs. In ICML. 551--558.

Digital Library

[25]

H. Köpcke and E. Rahm. 2008. Training selection for tuning entity matching. In QDB/MUD. 3--12.

[26]

N. Koudas, S. Sarawagi, and D. Srivastava. 2006. Record Linkage: Similarity Measures and Algorithms. In SIGMOD. 802--803.

Digital Library

[27]

M. Michelson and C. Knoblock. 2006. Learning Blocking Schemes for Record Linkage. In AAAI. 440--445.

Digital Library

[28]

M. Motoyama and G. Varghese. 2009. I seek you: Searching and matching individuals in social networks. In Workshop on Web Information and Data Management. 67--75.

Digital Library

[29]

B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. In VLDB. 125--136.

Digital Library

[30]

M. Richardson and P. Domingos. 2006. Markov logic networks. Machine Learning Journal (2006), 107--136.

Digital Library

[31]

S. Sarawagi and A. Bhamidipaty. 2002. Interactive Deduplication Using Active Learning. In KDD. 269--278.

Digital Library

[32]

H. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In COLT. 287--294.

Digital Library

[33]

P. Singla and P. Domingos. 2006. Entity Resolution with Markov Logic. In ICDM. 572--582.

Digital Library

[34]

S. Tejada, C. Knoblock, and S. Minton. 2001. Learning Object Identification Rules for Information Integration. Information Systems (2001), 607--633.

Digital Library

[35]

V. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.

Digital Library

[36]

V. Verroios and H. Garcia-Molina. 2015. Entity Resolution with Crowd Errors. In ICDE. 219--230.

[37]

N. Vesdapunt, K. Bellare, and N. Dalvi. 2014. Crowdsourcing algorithms for entity resolution. In VLDB. 1071--1082.

Digital Library

[38]

J. Wang, T. Kraska, M. Franklin, and J. Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. PVLDB (2012), 1483--1494.

Digital Library

[39]

S. Whang, P. Lofgren, and H. Garcia-Molina. 2013. Question Selection for Crowd Entity Resolution. In VLDB. 349--360.

Digital Library

[40]

G. You, S. Hwang, Z. Nie, and J. Wen. 2011. SocialSearch: Enhancing Entity Search with Social Network Matching. In EDBT. 515--519.

Digital Library

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Fan WLu PPang KJin RYu W(2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 3-Jan-2024
https://dl.acm.org/doi/10.1145/3639363
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
https://doi.org/10.1007/s10844-024-00853-0
Show More Cited By

Index Terms

Active Learning for Large-Scale Entity Resolution
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution
  2. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Heterogeneous Committee-Based Active Learning for Entity Resolution (HeALER)
Advances in Databases and Information Systems
Abstract
Entity resolution identifies records that refer to the same real-world entity. For its classification step, supervised learning can be adopted, but this faces limitations in the availability of labeled training data. Under this situation, active ...
Read More
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Read More
Pair-Wise entity resolution: overview and challenges
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Information integration is one of the oldest and most important computer science problems: Information from diverse sources must be combined, so that users can access and manipulate the information in a unified way. One of the central problems in ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

November 2017

2604 pages

ISBN:9781450349185

DOI:10.1145/3132847

General Chairs:
Ee-Peng Lim
Singapore Management University, Singapore
,
Marianne Winslett
University of Illinois at Urbana-Champaign, USA, and Advanced Digital Sciences Center, Singapore
,
Program Chairs:
Mark Sanderson
RMIT, Australia
,
Ada Fu
Chinese University of Hong Kong, Hong Kong
,
Jimeng Sun
Georgia Tech, USA
,
Shane Culpepper
RMIT, Australia
,
Eric Lo
Chinese University of Hong Kong, Hong Kong
,
Joyce Ho
Emory University, USA
,
Debora Donato
Mix Tech, Inc., USA
,
Rakesh Agrawal
Data Insights Laboratories, USA
,
Yu Zheng
Microsoft Research Asia, China
,
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Aixin Sun
Nanyang Technological University, Singapore
,
Vincent S. Tseng
National Cheng Kung University, Taiwan
,
Chenliang Li
Wuhan University, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '17

Sponsor:

CIKM '17: ACM Conference on Information and Knowledge Management

November 6 - 10, 2017

Singapore, Singapore

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

48
Total Citations
View Citations
1,027
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

Han YLi C(2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
https://doi.org/10.3390/electronics13030559
Fan WLu PPang KJin RYu W(2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 3-Jan-2024
https://dl.acm.org/doi/10.1145/3639363
Mourad JHiba TYassir RImad H(2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
https://doi.org/10.1007/s10844-024-00853-0
Côté PNikanjam AAhmed NHumeniuk DKhomh F(2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
https://doi.org/10.1007/s10515-024-00453-w
Buss CMousavi JTokarev MTermehchy AMaier DLee S(2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611535
Fan WHan ZRen WWang DWang YXie MYan M(2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
https://doi.org/10.1145/3626763
Genossar BGal AShraga R(2023)The Battleship Approach to the Low Resource Entity Matching ProblemProceedings of the ACM on Management of Data10.1145/36267111:4(1-25)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626711
Fan WFu WJin RLiu MLu PTian C(2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
https://doi.org/10.1145/3588940
Bornemann LBleifuß TKalashnikov DNargesian FNaumann FSrivastava D(2023)Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-ChiefProceedings of the ACM on Management of Data10.1145/35889191:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588919
Maheshwary SSohoney S(2023)Learning Geolocation by Accurately Matching Customer Addresses via Graph based Active LearningCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3584647(457-463)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3584647
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents