article

Question selection for crowd entity resolution

Authors:

Steven Euijong Whang,

Peter Lofgren, and

Hector Garcia-MolinaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 6

Pages 349 - 360

https://doi.org/10.14778/2536336.2536337

Published: 01 April 2013 Publication History

Abstract

We study the problem of enhancing Entity Resolution (ER) with the help of crowdsourcing. ER is the problem of clustering records that refer to the same real-world entity and can be an extremely difficult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an easy one for humans. We study the problem of resolving records with crowdsourcing where we ask questions to humans in order to guide ER into producing accurate results. Since human work is costly, our goal is to ask as few questions as possible. We propose a probabilistic framework for ER that can be used to estimate how much ER accuracy we obtain by asking each question and select the best question with the highest expected accuracy. Computing the expected accuracy is #P-hard, so we propose approximation techniques for efficient computation. We evaluate our best question algorithms on real and synthetic datasets and demonstrate how we can obtain high ER accuracy while significantly reducing the number of questions asked to humans.

References

[1]

Amazon mechanical turk. https://www.mturk.com.

[2]

A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD Conference, pages 783-794, 2010.

[3]

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning, 56(1-3):89-113, 2004.

[4]

M. Bilenko and R. J. Mooney. Employing trainable string similarity metrics for information integration. In IIWeb, pages 67-72, 2003.

[5]

Crowdflower. http://crowdflower.com.

[6]

G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469-478, New York, NY, USA, 2012.

[7]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1-16, 2007.

[8]

M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD Conference, pages 61-72, 2011.

[9]

R. Gomes, P. Welinder, A. Krause, and P. Perona. Crowdclustering. In NIPS, 2011.

[10]

O. Hassanzadeh, F. Chiang, R. J. Miller, and H. C. Lee. Framework for evaluating clustering algorithms in duplicate detection. PVLDB, 2(1):1282-1293, 2009.

[11]

M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proc. of ACM SIGMOD, pages 127-138, 1995.

[12]

P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, pages 64-67, New York, NY, USA, 2010.

[13]

E. Law and L. von Ahn. Human Computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2011.

[14]

A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In SIGMOD Conference, pages 1315-1318, 2011.

[15]

A. Marcus, E. Wu, D. RKarger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5(1):13-24, 2011.

[16]

H. Park, R. Pang, A. G. Parameswaran, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: A system for declarative crowdsourcing. PVLDB, 5(12):1990-1993, 2012.

[17]

P. Venetis and H. Garcia-Molina. Quality control for comparison microtasks. In CrowdKDD, August 2012.

[18]

J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. In PVLDB, 2012.

[19]

S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. Technical report, Stanford University, available at http://ilpubs.stanford.edu:8090/1047/.

[20]

W. Winkler. Overview of record linkage and current research directions. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC, 2006.

[21]

Y. Yang, P. Singh, J. Yao, C. man Au Yeung, A. Zareian, X. Wang, Z. Cai, M. Salvadores, N. Gibbins, W. Hall, and N. Shadbolt. Distributed human computation framework for linked data co-reference resolution. In ESWC (1), pages 32-46, 2011.

Cited By

Cong QTang JHan KHuang YChen LChee YZhang ARangwala H(2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539267
Galhotra SFirmani DSaha BSrivastava DIves ZBonifati AEl Abbadi A(2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526147
Zhu XHuang XChoi BJiang JZou ZXu J(2021)Budget constrained interactive search for multiple targetsProceedings of the VLDB Endowment10.14778/3447689.344769414:6(890-902)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.14778/3447689.3447694
Show More Cited By

Index Terms

Question selection for crowd entity resolution
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems

Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Read More
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Read More
Joint Entity Resolution
ICDE '12: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 6

April 2013

144 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2013

Published in PVLDB Volume 6, Issue 6

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
389
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

Cong QTang JHan KHuang YChen LChee YZhang ARangwala H(2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539267
Galhotra SFirmani DSaha BSrivastava DIves ZBonifati AEl Abbadi A(2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526147
Zhu XHuang XChoi BJiang JZou ZXu J(2021)Budget constrained interactive search for multiple targetsProceedings of the VLDB Endowment10.14778/3447689.344769414:6(890-902)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.14778/3447689.3447694
Chen RShen YZhang D(2021)GNEM: A Generic One-to-Set Neural Entity Matching FrameworkProceedings of the Web Conference 202110.1145/3442381.3450119(1686-1694)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450119
Barlaug NGulla J(2021)Neural Networks for Entity Matching: A SurveyACM Transactions on Knowledge Discovery from Data10.1145/344220015:3(1-37)Online publication date: 21-Apr-2021
https://dl.acm.org/doi/10.1145/3442200
Li YWu XJin YLi JLi GFeng J(2021)Adaptive algorithms for crowd-aided categorizationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00685-231:6(1311-1337)Online publication date: 13-Aug-2021
https://dl.acm.org/doi/10.1007/s00778-021-00685-2
Li YWu XJin YLi JLi G(2020)Efficient algorithms for crowd-aided categorizationProceedings of the VLDB Endowment10.14778/3389133.338913913:8(1221-1233)Online publication date: 3-May-2020
https://dl.acm.org/doi/10.14778/3389133.3389139
Christophides VEfthymiou VPalpanas TPapadakis GStefanidis K(2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.1145/3418896
Chen ZChen QHou BLi ZLi GMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Towards Interpretable and Learnable Risk Analysis for Entity ResolutionProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380572(1165-1180)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380572
Jiang NZhuang YChiu D(2020)Effective and efficient crowd-assisted similarity retrieval of medical images in resource-constraint Mobile telemedicine systemsMultimedia Tools and Applications10.1007/s11042-020-08755-379:27-28(19893-19923)Online publication date: 1-Jul-2020
https://dl.acm.org/doi/10.1007/s11042-020-08755-3
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents