Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2009916.2009947acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

Published: 24 July 2011 Publication History
  • Get Citation Alerts
  • Abstract

    The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out of reach of traditional methods that rely upon editorial relevance judgments. Increasingly, the use of crowdsourcing to collect relevance labels has been regarded as a viable alternative that scales with modest costs. However, crowdsourcing suffers from undesirable worker practices and low quality contributions. In this paper we investigate the design and implementation of effective crowdsourcing tasks in the context of book search evaluation. We observe the impact of aspects of the Human Intelligence Task (HIT) design on the quality of relevance labels provided by the crowd. We assess the output in terms of label agreement with a gold standard data set and observe the effect of the crowdsourced relevance judgments on the resulting system rankings. This enables us to observe the effect of crowdsourcing on the entire IR evaluation process. Using the test set and experimental runs from the INEX 2010 Book Track, we find that varying the HIT design, and the pooling and document ordering strategies leads to considerable differences in agreement with the gold set labels. We then observe the impact of the crowdsourced relevance label sets on the relative system rankings using four IR performance metrics. System rankings based on MAP and Bpref remain less affected by different label sets while the Precision@10 and nDCG@10 lead to dramatically different system rankings, especially for labels acquired from HITs with weaker quality controls. Overall, we find that crowdsourcing can be an effective tool for the evaluation of IR systems, provided that care is taken when designing the HITs.

    References

    [1]
    O. Alonso and R. A. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In Advances in Information Retrieval -- 33rd European Conference on IR Research (ECIR 2011), volume 6611 of LNCS, pages 153--164. Springer, 2011.
    [2]
    O. Alonso and S. Mizzaro. Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pages 15--16, 2009.
    [3]
    O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42:9--15, November 2008.
    [4]
    C. W. Cleverdon. The Cranfield tests on index language devices. Aslib, 19:173--192, 1967.
    [5]
    G. V. Cormack, C. R. Palmer, and C. L. A. Clarke. Efficient construction of large test collections. In Proceedings of the 21st annual international ACM SIGIR conference, SIGIR '98, pages 282--289, 1998.
    [6]
    J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. Are your participants gaming the system?: screening mechanical turk workers. In CHI, pages 2399--2402. ACM, 2010.
    [7]
    C. Grady and M. Lease. Crowdsourcing document relevance assessment with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, CSLDAMT '10, pages 172--179, 2010.
    [8]
    J. Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, 2008.
    [9]
    J. Kamps, S. Geva, C. Peters, T. Sakai, A. Trotman, and E. Voorhees. Report on the SIGIR 2009 workshop on the future of IR evaluation. SIGIR Forum, 43(2):17--36, December 2009.
    [10]
    A. Kapelner and D. Chandler. Preventing satisficing in online surveys: A 'kapcha' to ensure higher quality data. In The World's First Conference on the Future of Distributed Work (CrowdConf2010), 2010.
    [11]
    G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval -- 33rd European Conference on IR Research (ECIR 2011), volume 6611 of LNCS, pages 165--176. Springer, 2011.
    [12]
    G. Kazai, N. Milic-Frayling, and J. Costello. Towards methods for the collective gathering and quality control of relevance assessments. In Proceedings of the 32nd international ACM SIGIR conference, SIGIR '09, pages 452--459, 2009. ACM.
    [13]
    G. Kazai, M. Koolen, J. Kamps, A. Doucet, and M. Landoni. Overview of the INEX 2010 book track: Scaling up the evaluation using crowdsourcing. In Comparative Evaluation of Focused Retrieval : 9th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2010), LNCS. Springer, 2011.
    [14]
    A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, CHI '08, pages 453--456, 2008. ACM.
    [15]
    J. Le, A. Edmonds, V. Hester, and L. Biewald. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 21--26, 2010.
    [16]
    W. Mason and D. J. Watts. Financial incentives and the "performance of crowds". In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP '09, pages 77--85, 2009. ACM.
    [17]
    S. Nowak and S. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Proceedings of the international conference on Multimedia information retrieval, MIR '10, pages 557--566, 2010. ACM.
    [18]
    A. N. Oppenheim. Questionnaire Design, Interviewing and Attitude Measurement. Pinter Publishers, London, 1992.
    [19]
    B. Piwowarski and M. Lalmas. Providing consistent and exhaustive relevance assessments for XML retrieval evaluation. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, CIKM'04, pages 361--370, 2004. ACM.
    [20]
    A. J. Quinn and B. B. Bederson. Human computation: A survey and taxonomy of a growing field. In Proceedings of CHI 2011, 2011.
    [21]
    F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? In CIKM, pages 43--52. ACM, 2008.
    [22]
    V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceeding of the 14th ACM SIGKDD international conference, KDD '08, pages 614--622, 2008. ACM.
    [23]
    R. Snow, B. O'Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast--but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08, pages 254--263, 2008.
    [24]
    I. Soboroff, C. Nicholas, and P. Cahan. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th annual international ACM SIGIR conference, SIGIR '01, pages 66--73, 2001.
    [25]
    L. von Ahn and L. Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, CHI '04, pages 319--326, 2004. ACM.
    [26]
    E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36(5):697--716, 2000.
    [27]
    E. M. Voorhees and D. K. Harman, editors. TREC: Experimentation and Evaluation in Information Retrieval. MIT Press, 2005.
    [28]
    P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23, pages 2424--2432, 2010.
    [29]
    J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the 2009 Neural Information Processing Systems (NIPS) Conference, 2009.
    [30]
    D. Zhu and B. Carterette. An analysis of assessor behavior in crowdsourced preference judgments. In SIGIR 2010 Workshop on Crowdsourcing for Search Evaluation, pages 17--20, 2010.

    Cited By

    View all
    • (2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
    • (2023)Incentive Mechanism Design with Gold Standard Questions Based on Approval Voting in Crowdsourcing2023 7th International Conference on Management Engineering, Software Engineering and Service Sciences (ICMSS)10.1109/ICMSS56787.2023.10117884(29-35)Online publication date: 6-Jan-2023
    • (2023)Effects of user factors on user experience in virtual reality: age, gender, and VR experience as influencing factors for VR exergamesQuality and User Experience10.1007/s41233-023-00056-58:1Online publication date: 4-May-2023
    • Show More Cited By

    Index Terms

    1. Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
      July 2011
      1374 pages
      ISBN:9781450307574
      DOI:10.1145/2009916
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 July 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. book search
      2. crowdsourcing quality
      3. prove it

      Qualifiers

      • Research-article

      Conference

      SIGIR '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 12 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Human–Machine Collaboration for a Multilingual Service PlatformHuman-Centered Services Computing for Smart Cities10.1007/978-981-97-0779-9_3(57-101)Online publication date: 5-May-2024
      • (2023)Incentive Mechanism Design with Gold Standard Questions Based on Approval Voting in Crowdsourcing2023 7th International Conference on Management Engineering, Software Engineering and Service Sciences (ICMSS)10.1109/ICMSS56787.2023.10117884(29-35)Online publication date: 6-Jan-2023
      • (2023)Effects of user factors on user experience in virtual reality: age, gender, and VR experience as influencing factors for VR exergamesQuality and User Experience10.1007/s41233-023-00056-58:1Online publication date: 4-May-2023
      • (2022)In Search of Ambiguity: A Three-Stage Workflow Design to Clarify Annotation Guidelines for Crowd WorkersFrontiers in Artificial Intelligence10.3389/frai.2022.8281875Online publication date: 18-May-2022
      • (2022)Measuring the Impact of Crowdsourcing Features on Mobile App User Engagement and RetentionManagement Science10.1287/mnsc.2020.394368:2(1297-1329)Online publication date: 1-Feb-2022
      • (2022)Terratech: A Crowdsourcing Web Application that Aids Environmental Needs and ConcernsProceedings of the 4th International Conference on Management Science and Industrial Engineering10.1145/3535782.3535832(383-390)Online publication date: 28-Apr-2022
      • (2022)Does Evidence from Peers Help Crowd Workers in Assessing Truthfulness?Companion Proceedings of the Web Conference 202210.1145/3487553.3524236(302-306)Online publication date: 25-Apr-2022
      • (2022)A Collaborative Training Using Crowdsourcing and Neural Networks on Small and Difficult Image Classification DatasetsSN Computer Science10.1007/s42979-022-01076-23:2Online publication date: 2-Mar-2022
      • (2022)Privacy-Preserving Content-Based Task AllocationPrivacy-Preserving in Mobile Crowdsensing10.1007/978-981-19-8315-3_3(33-61)Online publication date: 21-Dec-2022
      • (2021)A Game Theory Approach for Estimating Reliability of Crowdsourced Relevance AssessmentsACM Transactions on Information Systems10.1145/348096540:3(1-29)Online publication date: 17-Nov-2021
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media