Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Question selection for crowd entity resolution

Published: 01 April 2013 Publication History
  • Get Citation Alerts
  • Abstract

    We study the problem of enhancing Entity Resolution (ER) with the help of crowdsourcing. ER is the problem of clustering records that refer to the same real-world entity and can be an extremely difficult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an easy one for humans. We study the problem of resolving records with crowdsourcing where we ask questions to humans in order to guide ER into producing accurate results. Since human work is costly, our goal is to ask as few questions as possible. We propose a probabilistic framework for ER that can be used to estimate how much ER accuracy we obtain by asking each question and select the best question with the highest expected accuracy. Computing the expected accuracy is #P-hard, so we propose approximation techniques for efficient computation. We evaluate our best question algorithms on real and synthetic datasets and demonstrate how we can obtain high ER accuracy while significantly reducing the number of questions asked to humans.

    References

    [1]
    Amazon mechanical turk. https://www.mturk.com.
    [2]
    A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783-794, 2010.
    [3]
    N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89-113, 2004.
    [4]
    M. Bilenko and R. J. Mooney. Employing trainable string similarity metrics for information integration. In IIWeb, pages 67-72, 2003.
    [5]
    Crowdflower. http://crowdflower.com.
    [6]
    G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469-478, New York, NY, USA, 2012.
    [7]
    A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1-16, 2007.
    [8]
    M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD Conference, pages 61-72, 2011.
    [9]
    R. Gomes, P. Welinder, A. Krause, and P. Perona. Crowdclustering. In NIPS, 2011.
    [10]
    O. Hassanzadeh, F. Chiang, R. J. Miller, and H. C. Lee. Framework for evaluating clustering algorithms in duplicate detection. PVLDB, 2(1):1282-1293, 2009.
    [11]
    M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proc. of ACM SIGMOD, pages 127-138, 1995.
    [12]
    P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 64-67, New York, NY, USA, 2010.
    [13]
    E. Law and L. von Ahn. Human Computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2011.
    [14]
    A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In SIGMOD Conference, pages 1315-1318, 2011.
    [15]
    A. Marcus, E. Wu, D. RKarger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5(1):13-24, 2011.
    [16]
    H. Park, R. Pang, A. G. Parameswaran, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: A system for declarative crowdsourcing. PVLDB, 5(12):1990-1993, 2012.
    [17]
    P. Venetis and H. Garcia-Molina. Quality control for comparison microtasks. In CrowdKDD, August 2012.
    [18]
    J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. In PVLDB, 2012.
    [19]
    S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. Technical report, Stanford University, available at http://ilpubs.stanford.edu:8090/1047/.
    [20]
    W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.
    [21]
    Y. Yang, P. Singh, J. Yao, C. man Au Yeung, A. Zareian, X. Wang, Z. Cai, M. Salvadores, N. Gibbins, W. Hall, and N. Shadbolt. Distributed human computation framework for linked data co-reference resolution. In ESWC (1), pages 32-46, 2011.

    Cited By

    View all
    • (2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
    • (2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
    • (2021)Budget constrained interactive search for multiple targetsProceedings of the VLDB Endowment10.14778/3447689.344769414:6(890-902)Online publication date: 12-Apr-2021
    • Show More Cited By

    Index Terms

    1. Question selection for crowd entity resolution
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 6, Issue 6
      April 2013
      144 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 April 2013
      Published in PVLDB Volume 6, Issue 6

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)1

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
      • (2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
      • (2021)Budget constrained interactive search for multiple targetsProceedings of the VLDB Endowment10.14778/3447689.344769414:6(890-902)Online publication date: 12-Apr-2021
      • (2021)GNEM: A Generic One-to-Set Neural Entity Matching FrameworkProceedings of the Web Conference 202110.1145/3442381.3450119(1686-1694)Online publication date: 19-Apr-2021
      • (2021)Neural Networks for Entity Matching: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/344220015:3(1-37)Online publication date: 21-Apr-2021
      • (2021)Adaptive algorithms for crowd-aided categorizationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00685-231:6(1311-1337)Online publication date: 13-Aug-2021
      • (2020)Efficient algorithms for crowd-aided categorizationProceedings of the VLDB Endowment10.14778/3389133.338913913:8(1221-1233)Online publication date: 3-May-2020
      • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
      • (2020)Towards Interpretable and Learnable Risk Analysis for Entity ResolutionProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380572(1165-1180)Online publication date: 11-Jun-2020
      • (2020)Effective and efficient crowd-assisted similarity retrieval of medical images in resource-constraint Mobile telemedicine systemsMultimedia Tools and Applications10.1007/s11042-020-08755-379:27-28(19893-19923)Online publication date: 1-Jul-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media