Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CrowdER: crowdsourcing entity resolution

Published: 01 July 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NP-Hard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.

    References

    [1]
    A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783--794, 2010.
    [2]
    R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
    [3]
    M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In UIST, pages 313--322, 2010.
    [4]
    M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003.
    [5]
    S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.
    [6]
    P. Christen. Febrl: a freely available record linkage system with a graphical user interface. In HDKM, pages 17--25, 2008.
    [7]
    P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 99(PrePrints), 2011.
    [8]
    V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):pp. 233--235, 1979.
    [9]
    A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 28(1):20--28, 1979.
    [10]
    A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, 2011.
    [11]
    A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.
    [12]
    A. Feng, M. J. Franklin, D. Kossmann, T. Kraska, S. Madden, S. Ramesh, A. Wang, and R. Xin. Crowddb: Query processing with the vldb crowd. PVLDB, 4(12):1387--1390, 2011.
    [13]
    M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011.
    [14]
    P. C. Gilmore and R. E. Gomory. A linear programming approach to the cutting-stock problem. Operations Research, 9(6):849--859, 1961.
    [15]
    O. Goldschmidt, D. S. Hochbaum, C. A. J. Hurkens, and G. Yu. Approximation algorithms for the k-clique covering problem. SIAM J. Discrete Math., 9(3):492--509, 1996.
    [16]
    P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 64--67, 2010.
    [17]
    S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD Conference, pages 847--860, 2008.
    [18]
    H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.
    [19]
    A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011.
    [20]
    A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, pages 211--214, 2011.
    [21]
    R. McCann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, pages 110--119, 2008.
    [22]
    A. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: Declarative crowdsourcing. Technical report, Stanford University. http://ilpubs.stanford.edu:8090/1015/.
    [23]
    A. J. Quinn and B. B. Bederson. Human-machine hybrid computation. In Position paper for CHI 2011 Workshop On Crowdsourcing And Human Computation, 2011.
    [24]
    S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002.
    [25]
    J. M. Valério and D. Carvalho. Exact solution of cutting stock problems using column generation and branch-and-bound. International Transactions in Operational Research, 5(1):35--44, 1998.
    [26]
    C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.
    [27]
    T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In MobiSys, pages 77--90, 2010.

    Cited By

    View all
    • (2023)Crowdsourcing Truth Inference Based on Label Confidence ClusteringACM Transactions on Knowledge Discovery from Data10.1145/355654517:4(1-20)Online publication date: 24-Feb-2023
    • (2023)Unsupervised Graph-Based Entity Resolution for Complex EntitiesACM Transactions on Knowledge Discovery from Data10.1145/353301617:1(1-30)Online publication date: 20-Feb-2023
    • (2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 July 2012
    Published in PVLDB Volume 5, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Crowdsourcing Truth Inference Based on Label Confidence ClusteringACM Transactions on Knowledge Discovery from Data10.1145/355654517:4(1-20)Online publication date: 24-Feb-2023
    • (2023)Unsupervised Graph-Based Entity Resolution for Complex EntitiesACM Transactions on Knowledge Discovery from Data10.1145/353301617:1(1-30)Online publication date: 20-Feb-2023
    • (2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
    • (2023)Entity Resolution Based on Pre-trained Language Models with Two AttentionsWeb and Big Data10.1007/978-981-97-2387-4_29(433-448)Online publication date: 6-Oct-2023
    • (2022)Can Foundation Models Wrangle Your Data?Proceedings of the VLDB Endowment10.14778/3574245.357425816:4(738-746)Online publication date: 1-Dec-2022
    • (2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
    • (2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
    • (2022)HumanALProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547496(1-8)Online publication date: 12-Jun-2022
    • (2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
    • (2022)Entity Resolution with Hierarchical Graph Attention NetworksProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517872(429-442)Online publication date: 10-Jun-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media