Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3132847.3132949acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Active Learning for Large-Scale Entity Resolution

Published: 06 November 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Entity resolution (ER) is the task of identifying different representations of the same real-world object across datasets. Designing and tuning ER algorithms is an error-prone, labor-intensive process, which can significantly benefit from data-driven, automated learning methods. Our focus is on "big data'' scenarios where the primary challenges include 1) identifying, out of a potentially massive set, a small subset of informative examples to be labeled by the user, 2) using the labeled examples to efficiently learn ER algorithms that achieve both high precision and high recall, and 3) executing the learned algorithm to determine duplicates at scale. Recent work on learning ER algorithms has employed active learning to partially address the above challenges by aiming to learn ER rules in the form of conjunctions of matching predicates, under precision guarantees. While successful in learning a single rule, prior work has been less successful in learning multiple rules that are sufficiently different from each other, thus missing opportunities for improving recall. In this paper, we introduce an active learning system that learns, at scale, multiple rules each having significant coverage of the space of duplicates, thus leading to high recall, in addition to high-precision. We show the superiority of our system on real-world ER scenarios of sizes up to tens of millions of records, over state-of-the-art active learning methods that learn either rules or committees of statistical classifiers for ER, and even over sophisticated methods based on first-order probabilistic models.

    References

    [1]
    D. Angluin. 1988. Queries and Concept Learning. Machine Learning (1988), 319--342.
    [2]
    A. Arasu, M. Götz, and R. Kaushik. 2010. On Active Learning of Record Matching Packages. In SIGMOD. 783--794.
    [3]
    S. Bach, M. Broecheler, B. Huang, and L. Getoor. 2015. Hinge-Loss Markov Random Fields and Probabilistic Soft Logic. CoRR (2015). arXiv:abs/1505.04406
    [4]
    K. Bellare, S. Iyengar, A. Parameswaran, and V. Rastogi. 2012. Active Sampling for Entity Matching. In KDD. 1131--1139.
    [5]
    A. Beygelzimer, J. Langford, T. Zhang, and D. Hsu. 2010. Agnostic Active Learning Without Constraints. In NIPS. 199--207.
    [6]
    M. Bilenko, B. Kamath, and R. Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Workshop on Information Integration on the Web. 87--96.
    [7]
    P. Christen, D. Vatsalan, and Q. Wang. 2015. Efficient Entity Resolution with Adaptive and Interactive Training Data Selection. In ICDM. 1550--4786.
    [8]
    G. Dal Bianco, R. Galante, M. Gonsalves, S. Canuto, and C. Heuser. 2015. A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication. IEEE TKDE (2015), 2305--2319.
    [9]
    S. Dasgupta and D. Hsu. 2008. Hierarchical Sampling for Active Learning. In ICML. 208--215.
    [10]
    J. de Freitas, G. Pappa, A. da Silva, M. Gonçalves, E. Moura, A. Veloso, A. Laender, and M. de Carvalho. 2010. Active Learning Genetic programming for record deduplication. In IEEE Congress on Evolutionary Computation. 1--8.
    [11]
    G. Demartini, D. Difallah, and P. Cudre-Mauroux. 2013. Large-scale Linked Data Integration using Probabilistic Reasoning and Crowdsourcing. VLDB Journal (2013), 665--687.
    [12]
    X. Dong, A. Halevy, and J. Madhavan. 2005. Reference Reconciliation in Complex Information Spaces. In SIGMOD. 85--96.
    [13]
    B. Efron and R. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman & Hall.
    [14]
    I. Fellegi and A. Sunter. 1969. A Theory for Record Linkage. J. Amer. Statist. Assoc. (1969), 1183--1210.
    [15]
    J. Fisher, P. Christen, and Q. Wang. 2016. Active Learning Based Entity Resolution using Markov Logic. In PAKDD. 338--349.
    [16]
    Y. Freund, H. Seung, E. Shamir, and N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning (1997), 133--168.
    [17]
    L. Getoor and A. Machanavajjhala. 2013. Entity Resolution for Big Data. In KDD.
    [18]
    O. Goga, P. Loiseau, R. Sommer, R. Teixeira, and K. Gummadi. 2015. On the reliability of profile matching across large online social networks. In KDD. 1799-- 1808.
    [19]
    M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. 2013. HIL: A High-level Scripting Language for Entity Integration. In EDBT. 549--560.
    [20]
    M. Hernández and S. Stolfo. 1995. The Merge/Purge Problem for Large Databases. In SIGMOD. 127--138.
    [21]
    R. Isele and C. Bizer. 2013. Active Learning of Expressive Linkage Rules using Genetic Programming. Web Semantics: Science, Services and Agents on the World Wide Web (2013), 2--15.
    [22]
    Matti Kääriäinen. 2006. Active Learning in the Non-realizable Case. 63--77.
    [23]
    A. Khan and H. Garcia-Molina. 2016. Attribute-based Crowd Entity Resolution. In CIKM. 549--558.
    [24]
    S. Kok and P. Domingos. 2010. Learning Markov logic networks using structural motifs. In ICML. 551--558.
    [25]
    H. Köpcke and E. Rahm. 2008. Training selection for tuning entity matching. In QDB/MUD. 3--12.
    [26]
    N. Koudas, S. Sarawagi, and D. Srivastava. 2006. Record Linkage: Similarity Measures and Algorithms. In SIGMOD. 802--803.
    [27]
    M. Michelson and C. Knoblock. 2006. Learning Blocking Schemes for Record Linkage. In AAAI. 440--445.
    [28]
    M. Motoyama and G. Varghese. 2009. I seek you: Searching and matching individuals in social networks. In Workshop on Web Information and Data Management. 67--75.
    [29]
    B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. In VLDB. 125--136.
    [30]
    M. Richardson and P. Domingos. 2006. Markov logic networks. Machine Learning Journal (2006), 107--136.
    [31]
    S. Sarawagi and A. Bhamidipaty. 2002. Interactive Deduplication Using Active Learning. In KDD. 269--278.
    [32]
    H. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In COLT. 287--294.
    [33]
    P. Singla and P. Domingos. 2006. Entity Resolution with Markov Logic. In ICDM. 572--582.
    [34]
    S. Tejada, C. Knoblock, and S. Minton. 2001. Learning Object Identification Rules for Information Integration. Information Systems (2001), 607--633.
    [35]
    V. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.
    [36]
    V. Verroios and H. Garcia-Molina. 2015. Entity Resolution with Crowd Errors. In ICDE. 219--230.
    [37]
    N. Vesdapunt, K. Bellare, and N. Dalvi. 2014. Crowdsourcing algorithms for entity resolution. In VLDB. 1071--1082.
    [38]
    J. Wang, T. Kraska, M. Franklin, and J. Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. PVLDB (2012), 1483--1494.
    [39]
    S. Whang, P. Lofgren, and H. Garcia-Molina. 2013. Question Selection for Crowd Entity Resolution. In VLDB. 349--360.
    [40]
    G. You, S. Hwang, Z. Nie, and J. Wen. 2011. SocialSearch: Enhancing Entity Search with Social Network Matching. In EDBT. 515--519.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
    November 2017
    2604 pages
    ISBN:9781450349185
    DOI:10.1145/3132847
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 November 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. entity resolution
    2. large-scale data cleansing

    Qualifiers

    • Research-article

    Conference

    CIKM '17
    Sponsor:

    Acceptance Rates

    CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)49
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Entity Matching by Pool-Based Active LearningElectronics10.3390/electronics1303055913:3(559)Online publication date: 30-Jan-2024
    • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 3-Jan-2024
    • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
    • (2024)Data cleaning and machine learning: a systematic literature reviewAutomated Software Engineering10.1007/s10515-024-00453-w31:2Online publication date: 11-Jun-2024
    • (2023)Effective Entity Augmentation by Querying External Data SourcesProceedings of the VLDB Endowment10.14778/3611479.361153516:11(3404-3417)Online publication date: 24-Aug-2023
    • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
    • (2023)The Battleship Approach to the Low Resource Entity Matching ProblemProceedings of the ACM on Management of Data10.1145/36267111:4(1-25)Online publication date: 12-Dec-2023
    • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
    • (2023)Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-ChiefProceedings of the ACM on Management of Data10.1145/35889191:1(1-26)Online publication date: 30-May-2023
    • (2023)Learning Geolocation by Accurately Matching Customer Addresses via Graph based Active LearningCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3584647(457-463)Online publication date: 30-Apr-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media