Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2213836.2213878acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

CrowdScreen: algorithms for filtering data with humans

Published: 20 May 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Given a large set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem. We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error. We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.

    References

    [1]
    Mechanical Turk. http://mturk.com.
    [2]
    A. Feng et al. Crowddb: Query processing with the vldb crowd (demo). In VLDB, 2011.
    [3]
    A. Marcus et al. Crowdsourced databases: Query processing with people. In CIDR, 2011.
    [4]
    A. Marcus et al. Demonstration of qurk: a query processor for human operators. In SIGMOD, 2011.
    [5]
    A. Parameswaran et al. Human-assisted graph search: it's okay to ask questions. In VLDB, 2011.
    [6]
    Omar Alonso, Daniel E. Rose, and Benjamin Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42, 2008.
    [7]
    A. Doan, R. Ramakrishnan, and A.Y. Halevy. Crowdsourcing systems on the world-wide web. Communications of the ACM, 54(4):86--96, 2011.
    [8]
    E. Bakshy et al. Everyone's an influencer: quantifying influence on twitter. In WSDM, 2011.
    [9]
    A. Parameswaran et al. Crowdscreen: Algorithms for filtering data with humans. Technical report, http://ilpubs.stanford.edu:8090/1011/.
    [10]
    G. Little et al. Turkit: tools for iterative tasks on mechanical turk. In HCOMP, 2009.
    [11]
    J. Whitehill et al. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS. 2009.
    [12]
    M. J. Franklin et al. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011.
    [13]
    Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5, September 2011.
    [14]
    Robert McCann, Warren Shen, and AnHai Doan. Matching schemas in online communities: A web 2.0 approach. In ICDE '08.
    [15]
    P. Donmez et al. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009.
    [16]
    P. Perona P. Welinder. Online crowdsourcing: rating annotators and obtaining cost-effective labels. In CVPR, 2010.
    [17]
    A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. In CIDR, 2011.
    [18]
    Alexander J. Quinn and Benjamin B. Bederson. Human computation: a survey and taxonomy of a growing field. In CHI, 2011.
    [19]
    R. Gomes et al. Crowdclustering. In NIPS, 2011.
    [20]
    R. Snow et al. Cheap and fast-but is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP, 2008.
    [21]
    Tim Roughgarden. Algorithmic game theory. Commun. ACM, 53(7):78--86, 2010.
    [22]
    Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.
    [23]
    V. Raykar et al. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In ICML, 2009.
    [24]
    V. S. Sheng et al. Get another label? improving data quality and data mining using multiple, noisy labelers. In KDD, 2008.
    [25]
    Larry Wasserman. All of Statistics. Springer, 2003.
    [26]
    Omar F. Zaidan and Chris Callison-Burch. Feasibility of human-in-the-loop minimum error rate training. In EMNLP, 2009.

    Cited By

    View all
    • (2024)Similarity-driven and task-driven models for diversity of opinion in crowdsourcing marketsThe VLDB Journal10.1007/s00778-024-00853-0Online publication date: 17-May-2024
    • (2023)Sentinels and Twins: Effective Integrity Assessment for Distributed ComputationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321586334:1(108-122)Online publication date: 1-Jan-2023
    • (2023)Hierarchical Crowdsourcing for Data Labeling with Heterogeneous Crowd2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00099(1234-1246)Online publication date: Apr-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
    May 2012
    886 pages
    ISBN:9781450312479
    DOI:10.1145/2213836
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 May 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crowdsourcing
    2. filtering
    3. human computation
    4. predicates

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '12
    Sponsor:

    Acceptance Rates

    SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Similarity-driven and task-driven models for diversity of opinion in crowdsourcing marketsThe VLDB Journal10.1007/s00778-024-00853-0Online publication date: 17-May-2024
    • (2023)Sentinels and Twins: Effective Integrity Assessment for Distributed ComputationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321586334:1(108-122)Online publication date: 1-Jan-2023
    • (2023)Hierarchical Crowdsourcing for Data Labeling with Heterogeneous Crowd2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00099(1234-1246)Online publication date: Apr-2023
    • (2023)Crowdsourcing of labeling image objects: an online gamification application for data collectionMultimedia Tools and Applications10.1007/s11042-023-16325-683:7(20827-20860)Online publication date: 4-Aug-2023
    • (2023)Crowdsourcing as a Future Collaborative Computing ParadigmMobile Crowdsourcing10.1007/978-3-031-32397-3_1(3-32)Online publication date: 21-Apr-2023
    • (2022)Noisy Interactive Graph SearchProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539267(231-240)Online publication date: 14-Aug-2022
    • (2022)Cleaning Uncertain Data With Crowdsourcing - A General Model With Diverse Accuracy RatesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.302754534:8(3629-3642)Online publication date: 1-Aug-2022
    • (2022)Efficient Crowdsourced Pareto-Optimal Queries Over Partial Orders With Quality GuaranteeIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.301719810:1(297-311)Online publication date: 1-Jan-2022
    • (2022)Cost-Effective Algorithms for Average-Case Interactive Graph Search2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00091(1152-1165)Online publication date: May-2022
    • (2022)Efficient Data Analytics on Augmented Similarity Triplets2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10021104(5871-5880)Online publication date: 17-Dec-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media