Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Crowdsourcing algorithms for entity resolution

Published: 01 August 2014 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a database, where each edge has a probability denoting our prior belief (based on Machine Learning models) that the pair of records represented by the given edge are duplicates. Our objective is to resolve all the duplicates by asking humans to verify the equality of a subset of edges, leveraging the transitivity of the equality relation to infer the remaining edges (e.g. a = c can be inferred given a = b and b = c). We consider the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked. Using our theoretical framework, we analyze several strategies, and show that a strategy, claimed as "optimal" for this problem in a recent work, can perform arbitrarily bad in theory. We propose alternate strategies with theoretical guarantees. Using both public datasets as well as the production system at Facebook, we show that our techniques are effective in practice.

    References

    [1]
    http://www.facebook.com/places/editor.
    [2]
    http://www.facebook.com/about/location.
    [3]
    http://dbs.uni-leipzig.de/file/Abt-Buy.zip.
    [4]
    N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002.
    [5]
    M. Bilgic and L. Getoor. Active inference for collective classification. In Twenty-Fourth Conference on Artificial Intelligence (AAAI NECTAR Track), pages 1652--1655, 2010.
    [6]
    N. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 285--294. International World Wide Web Conferences Steering Committee / ACM, 2013.
    [7]
    G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12, pages 469--478, New York, NY, USA, 2012. ACM.
    [8]
    M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12, pages 1970--1974, New York, NY, USA, 2012. ACM.
    [9]
    A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, EC '11, pages 167--176, New York, NY, USA, 2011. ACM.
    [10]
    A. Gruenheid, D. Kossmann, R. Sukriti, and F. Widmer. Crowdsourcing entity resolution: When is a=b? Technical Report 785, ETH Zurich, Sept. 2012.
    [11]
    S. R. Jeffery, L. Sun, M. DeLand, N. Pendar, R. Barber, and A. Galdi. Arnold: Declarative crowd-machine data integration. In CIDR. www.cidrdb.org, 2013.
    [12]
    D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, NIPS, pages 1953--1961, 2011.
    [13]
    A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5(1):13--24, Sept. 2011.
    [14]
    A. McCallum. Cora dataset. http://www.cs.umass.edu/~mcallum/data/cora-refs.tar.gz, 2004.
    [15]
    J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: crowdsourcing entity resolution. Proc. VLDB Endow., 5(11):1483--1494, July 2012.
    [16]
    J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In K. A. Ross, D. Srivastava, and D. Papadias, editors, SIGMOD Conference, pages 229--240. ACM, 2013.
    [17]
    S. E. Whang and H. Garcia-Molina. Developments in generic entity resolution. IEEE Data Eng. Bull., 34(3):51--59, 2011.
    [18]
    S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. In PVLDB. Stanford InfoLab, August 2013.
    [19]
    W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.

    Cited By

    View all
    • (2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
    • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
    • (2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 7, Issue 12
    August 2014
    296 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2014
    Published in PVLDB Volume 7, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Alfa: active learning for graph neural network-based semantic schema alignmentThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00822-z33:4(981-1011)Online publication date: 1-Jul-2024
    • (2023)A Framework to Evaluate the Quality of Integrated DatasetsACM SIGAPP Applied Computing Review10.1145/3584014.358401522:4(5-23)Online publication date: 10-Feb-2023
    • (2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
    • (2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
    • (2022)Parallel tensor factorization for relational learningNeural Computing and Applications10.1007/s00521-021-05692-634:11(8455-8464)Online publication date: 1-Jun-2022
    • (2021)Active clustering for labeling training dataProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540909(8469-8480)Online publication date: 6-Dec-2021
    • (2021)Fuzzy clustering with similarity queriesProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540322(789-801)Online publication date: 6-Dec-2021
    • (2021)How to design robust algorithms using noisy comparison OracleProceedings of the VLDB Endowment10.14778/3467861.346786214:10(1703-1716)Online publication date: 26-Oct-2021
    • (2021)Budget constrained interactive search for multiple targetsProceedings of the VLDB Endowment10.14778/3447689.344769414:6(890-902)Online publication date: 12-Apr-2021
    • (2021)Learning to Generate Fair Clusters from DemonstrationsProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society10.1145/3461702.3462558(491-501)Online publication date: 21-Jul-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media