research-article

CrowdScreen: algorithms for filtering data with humans

Authors:

Aditya G. Parameswaran,

Hector Garcia-Molina,

Neoklis Polyzotis,

Aditya Ramesh, and

Jennifer WidomAuthors Info & Claims

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

May 2012

Pages 361 - 372

https://doi.org/10.1145/2213836.2213878

Published: 20 May 2012 Publication History

Abstract

Given a large set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem. We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error. We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.

References

[1]

Mechanical Turk. http://mturk.com.

[2]

A. Feng et al. Crowddb: Query processing with the vldb crowd (demo). In VLDB, 2011.

[3]

A. Marcus et al. Crowdsourced databases: Query processing with people. In CIDR, 2011.

[4]

A. Marcus et al. Demonstration of qurk: a query processor for human operators. In SIGMOD, 2011.

Digital Library

[5]

A. Parameswaran et al. Human-assisted graph search: it's okay to ask questions. In VLDB, 2011.

Digital Library

[6]

Omar Alonso, Daniel E. Rose, and Benjamin Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 2008.

Digital Library

[7]

A. Doan, R. Ramakrishnan, and A.Y. Halevy. Crowdsourcing systems on the world-wide web. Communications of the ACM, 54(4):86--96, 2011.

Digital Library

[8]

E. Bakshy et al. Everyone's an influencer: quantifying influence on twitter. In WSDM, 2011.

Digital Library

[9]

A. Parameswaran et al. Crowdscreen: Algorithms for filtering data with humans. Technical report, http://ilpubs.stanford.edu:8090/1011/.

[10]

G. Little et al. Turkit: tools for iterative tasks on mechanical turk. In HCOMP, 2009.

Digital Library

[11]

J. Whitehill et al. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS. 2009.

[12]

M. J. Franklin et al. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011.

Digital Library

[13]

Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5, September 2011.

Digital Library

[14]

Robert McCann, Warren Shen, and AnHai Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE '08.

Digital Library

[15]

P. Donmez et al. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009.

Digital Library

[16]

P. Perona P. Welinder. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In CVPR, 2010.

[17]

A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, 2011.

[18]

Alexander J. Quinn and Benjamin B. Bederson. Human computation: a survey and taxonomy of a growing field. In CHI, 2011.

Digital Library

[19]

R. Gomes et al. Crowdclustering. In NIPS, 2011.

[20]

R. Snow et al. Cheap and fast-but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP, 2008.

Digital Library

[21]

Tim Roughgarden. Algorithmic game theory. Commun. ACM, 53(7):78--86, 2010.

Digital Library

[22]

Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.

[23]

V. Raykar et al. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In ICML, 2009.

Digital Library

[24]

V. S. Sheng et al. Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD, 2008.

Digital Library

[25]

Larry Wasserman. All of Statistics. Springer, 2003.

[26]

Omar F. Zaidan and Chris Callison-Burch. Feasibility of human-in-the-loop minimum error rate training. In EMNLP, 2009.

Digital Library

Cited By

Zhang CLiu YZeng PWu TChen LHui PHao F(2024)Similarity-driven and task-driven models for diversity of opinion in crowdsourcing marketsThe VLDB Journal10.1007/s00778-024-00853-0Online publication date: 17-May-2024
https://doi.org/10.1007/s00778-024-00853-0
De Capitani di Vimercati SForesti SJajodia SParaboschi SSamarati PSassi R(2023)Sentinels and Twins: Effective Integrity Assessment for Distributed ComputationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321586334:1(108-122)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3215863
Zhang HHuang WSu ZChen JJiang DFan LZhang CLian DWu K(2023)Hierarchical Crowdsourcing for Data Labeling with Heterogeneous Crowd2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00099(1234-1246)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00099
Show More Cited By

Index Terms

CrowdScreen: algorithms for filtering data with humans
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Multiverse: crowd algorithms on existing interfaces
CHI EA '13: CHI '13 Extended Abstracts on Human Factors in Computing Systems

Crowd-powered systems implement crowd algorithms to improve crowd work through techniques like redundancy, iteration, and task decomposition. Existing approaches require substantial programming to package tasks for the crowd and apply crowd algorithms. ...
Read More
Mechanical turk as an ontology engineer?: using microtasks as a component of an ontology-engineering workflow
WebSci '13: Proceedings of the 5th Annual ACM Web Science Conference

Ontology evaluation has proven to be one of the more difficult problems in ontology engineering. Researchers proposed numerous methods to evaluate logical correctness of an ontology, its structure, or coverage of a domain represented by a corpus. ...
Read More
How many crowdsourced workers should a requester hire?

Recent years have seen an increased interest in crowdsourcing as a way of obtaining information from a potentially large group of workers at a reduced cost. The crowdsourcing process, as we consider in this paper, is as follows: a requester hires a ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

May 2012

886 pages

ISBN:9781450312479

DOI:10.1145/2213836

General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '12

Sponsor:

SIGMOD

SIGMOD/PODS '12: International Conference on Management of Data

May 20 - 24, 2012

Arizona, Scottsdale, USA

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

165
Total Citations
View Citations
1,294
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Zhang CLiu YZeng PWu TChen LHui PHao F(2024)Similarity-driven and task-driven models for diversity of opinion in crowdsourcing marketsThe VLDB Journal10.1007/s00778-024-00853-0Online publication date: 17-May-2024
https://doi.org/10.1007/s00778-024-00853-0
De Capitani di Vimercati SForesti SJajodia SParaboschi SSamarati PSassi R(2023)Sentinels and Twins: Effective Integrity Assessment for Distributed ComputationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321586334:1(108-122)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3215863
Zhang HHuang WSu ZChen JJiang DFan LZhang CLian DWu K(2023)Hierarchical Crowdsourcing for Data Labeling with Heterogeneous Crowd2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00099(1234-1246)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00099
Bastanfard AShahabipour MAmirkhani D(2023)Crowdsourcing of labeling image objects: an online gamification application for data collectionMultimedia Tools and Applications10.1007/s11042-023-16325-683:7(20827-20860)Online publication date: 4-Aug-2023
https://doi.org/10.1007/s11042-023-16325-6
Wu JSong CChang W(2023)Crowdsourcing as a Future Collaborative Computing ParadigmMobile Crowdsourcing10.1007/978-3-031-32397-3_1(3-32)Online publication date: 21-Apr-2023
https://doi.org/10.1007/978-3-031-32397-3_1
Cong QTang JHan KHuang YChen LChee YZhang ARangwala H(2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539267
Zhang CZhang HXie WLiu NLi QJiang DLin PWu KChen L(2022)Cleaning Uncertain Data With Crowdsourcing - A General Model With Diverse Accuracy RatesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.302754534:8(3629-3642)Online publication date: 1-Aug-2022
https://doi.org/10.1109/TKDE.2020.3027545
Yin BWei X(2022)Efficient Crowdsourced Pareto-Optimal Queries Over Partial Orders With Quality GuaranteeIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.301719810:1(297-311)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TETC.2020.3017198
Cong QTang JHuang YChen LChee Y(2022)Cost-Effective Algorithms for Average-Case Interactive Graph Search2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00091(1152-1165)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00091
Ali SAhmad MHassan UAsad Khan MAlam SKhan I(2022)Efficient Data Analytics on Augmented Similarity Triplets2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10021104(5871-5880)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10021104
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents