research-article

CrowdER: crowdsourcing entity resolution

Authors:

Michael J. Franklin, and

Jianhua FengAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 5, Issue 11

Pages 1483 - 1494

https://doi.org/10.14778/2350229.2350263

Published: 01 July 2012 Publication History

Abstract

Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NP-Hard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.

References

[1]

A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783--794, 2010.

Digital Library

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[3]

M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: a word processor with a crowd inside. In UIST, pages 313--322, 2010.

Digital Library

[4]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003.

Digital Library

[5]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.

Digital Library

[6]

P. Christen. Febrl: a freely available record linkage system with a graphical user interface. In HDKM, pages 17--25, 2008.

Digital Library

[7]

P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 99(PrePrints), 2011.

Digital Library

[8]

V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):pp. 233--235, 1979.

Digital Library

[9]

A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 28(1):20--28, 1979.

[10]

A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, 2011.

Digital Library

[11]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.

Digital Library

[12]

A. Feng, M. J. Franklin, D. Kossmann, T. Kraska, S. Madden, S. Ramesh, A. Wang, and R. Xin. Crowddb: Query processing with the vldb crowd. PVLDB, 4(12):1387--1390, 2011.

Digital Library

[13]

M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, pages 61--72, 2011.

Digital Library

[14]

P. C. Gilmore and R. E. Gomory. A linear programming approach to the cutting-stock problem. Operations Research, 9(6):849--859, 1961.

Digital Library

[15]

O. Goldschmidt, D. S. Hochbaum, C. A. J. Hurkens, and G. Yu. Approximation algorithms for the k-clique covering problem. SIAM J. Discrete Math., 9(3):492--509, 1996.

Digital Library

[16]

P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 64--67, 2010.

Digital Library

[17]

S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD Conference, pages 847--860, 2008.

Digital Library

[18]

H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.

Digital Library

[19]

A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011.

Digital Library

[20]

A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, pages 211--214, 2011.

[21]

R. McCann, W. Shen, and A. Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE, pages 110--119, 2008.

Digital Library

[22]

A. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: Declarative crowdsourcing. Technical report, Stanford University. http://ilpubs.stanford.edu:8090/1015/.

[23]

A. J. Quinn and B. B. Bederson. Human-machine hybrid computation. In Position paper for CHI 2011 Workshop On Crowdsourcing And Human Computation, 2011.

[24]

S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002.

Digital Library

[25]

J. M. Valério and D. Carvalho. Exact solution of cutting stock problems using column generation and branch-and-bound. International Transactions in Operational Research, 5(1):35--44, 1998.

[26]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.

Digital Library

[27]

T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In MobiSys, pages 77--90, 2010.

Digital Library

Cited By

Wu GZhou LXia JLi LBao XWu X(2023)Crowdsourcing Truth Inference Based on Label Confidence ClusteringACM Transactions on Knowledge Discovery from Data10.1145/355654517:4(1-20)Online publication date: 24-Feb-2023
https://dl.acm.org/doi/10.1145/3556545
Kirielle NChristen PRanbaduge T(2023)Unsupervised Graph-Based Entity Resolution for Complex EntitiesACM Transactions on Knowledge Discovery from Data10.1145/353301617:1(1-30)Online publication date: 20-Feb-2023
https://dl.acm.org/doi/10.1145/3533016
Li YLi JSuhara YDoan ATan W(2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
https://dl.acm.org/doi/10.1007/s00778-023-00779-z
Show More Cited By

Recommendations

Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Read More
TRR: Reducing Crowdsourcing Task Redundancy
Database and Expert Systems Applications
Abstract
In this paper, we address the problem of task redundancy in crowdsourcing systems while providing a methodology to decrease the overall effort required to accomplish a crowdsourcing task. Typical task assignment systems assign tasks to a fixed ...
Read More
Spatio-spectral fusion of satellite images based on dictionary-pair learning

This paper proposes a novel spatial and spectral fusion method for satellite multispectral and hyperspectral (or high-spectral) images based on dictionary-pair learning. By combining the spectral information from sensors with low spatial resolution but ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 5, Issue 11

July 2012

608 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2012

Published in PVLDB Volume 5, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

182
Total Citations
View Citations
1,076
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Wu GZhou LXia JLi LBao XWu X(2023)Crowdsourcing Truth Inference Based on Label Confidence ClusteringACM Transactions on Knowledge Discovery from Data10.1145/355654517:4(1-20)Online publication date: 24-Feb-2023
https://dl.acm.org/doi/10.1145/3556545
Kirielle NChristen PRanbaduge T(2023)Unsupervised Graph-Based Entity Resolution for Complex EntitiesACM Transactions on Knowledge Discovery from Data10.1145/353301617:1(1-30)Online publication date: 20-Feb-2023
https://dl.acm.org/doi/10.1145/3533016
Li YLi JSuhara YDoan ATan W(2023)Effective entity matching with transformersThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00779-z32:6(1215-1235)Online publication date: 17-Jan-2023
https://dl.acm.org/doi/10.1007/s00778-023-00779-z
Zhu LLiu HSong XWei YWang Y(2023)Entity Resolution Based on Pre-trained Language Models with Two AttentionsWeb and Big Data10.1007/978-981-97-2387-4_29(433-448)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-2387-4_29
Narayan AChami IOrr LRé C(2022)Can Foundation Models Wrangle Your Data?Proceedings of the VLDB Endowment10.14778/3574245.357425816:4(738-746)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574258
Wang PZeng XChen LYe FMao YZhu JGao Y(2022)PromptEMProceedings of the VLDB Endowment10.14778/3565816.356583616:2(369-378)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565836
Jain ASarawagi SSen P(2022)Deep indexed active learning for matching heterogeneous entity representationsProceedings of the VLDB Endowment10.14778/3485450.348545515:1(31-45)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.14778/3485450.3485455
Shraga R(2022)HumanALProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3546930.3547496(1-8)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3546930.3547496
Cong QTang JHan KHuang YChen LChee YZhang ARangwala H(2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539267
Yao DGu YCong GJin HLv XIves ZBonifati AEl Abbadi A(2022)Entity Resolution with Hierarchical Graph Attention NetworksProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517872(429-442)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517872
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents