Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2009916.2010039acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Repeatable and reliable search system evaluation using crowdsourcing

Published: 24 July 2011 Publication History

Abstract

The primary problem confronting any new kind of search task is how to boot-strap a reliable and repeatable evaluation campaign, and a crowd-sourcing approach provides many advantages. However, can these crowd-sourced evaluations be repeated over long periods of time in a reliable manner? To demonstrate, we investigate creating an evaluation campaign for the semantic search task of keyword-based ad-hoc object retrieval. In contrast to traditional search over web-pages, object search aims at the retrieval of information from factual assertions about real-world objects rather than searching over web-pages with textual descriptions. Using the first large-scale evaluation campaign that specifically targets the task of ad-hoc Web object retrieval over a number of deployed systems, we demonstrate that crowd-sourced evaluation campaigns can be repeated over time and still maintain reliable results. Furthermore, we show how these results are comparable to expert judges when ranking systems and that the results hold over different evaluation and relevance metrics. This work provides empirical support for scalable, reliable, and repeatable search system evaluation using crowdsourcing.

References

[1]
O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008.
[2]
O. Alonso, R. Schenkel, and M. Theobald. Crowdsourcing assessments for XML ranked retrieval. In ECIR, pages 602--606, 2010.
[3]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674, New York, NY, USA, 2008. ACM.
[4]
K. Balog, A. P. de Vries, P. Serdyukov, P. Thomas, and T. Westerveld. Overview of the TREC 2009 Entity Track. In NIST Special Publication: SP 500--278, 2009.
[5]
C. Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using Amazon's Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 286--295, Singapore, August 2009. Association for Computational Linguistics.
[6]
B. Carpenter. Multilevel bayesian models of categorical data annotation. technical report. Technical report, Alias-I, 2008. http://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf.
[7]
L. Ding, T. Finin, A. Joshi, R. Pan, S. R. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In CIKM, pages 652--659, 2004.
[8]
S. Elbassuoni, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum. Language-model-based Ranking for Queries on RDF-graphs. In CIKM, pages 977--986. ACM, 2009.
[9]
R. Guha, R. McCool, and E. Miller. Semantic Search. In WWW, pages 700--709. ACM, 2003.
[10]
H. Halpin. A query-driven characterization of linked data. In Proceedings of the WWW Workshop on Linked Data on the Web, Madrid, Spain, 2009.
[11]
S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. J. Am. Soc. Inf. Sci., 47(1):37--49, 1996.
[12]
J. Kamps, S. Geva, A. Trotman, A. Woodley, and M. Koolen. Overview of the INEX 2008 Ad Hoc Track. Advances in Focused Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, pages 1--28, 2009.
[13]
Y. Luo, W. Wang, and X. Lin. SPARK: A Keyword Search Engine on Relational Databases. In ICDE, pages 1552--1555, 2008.
[14]
W. Mason and D. J. Watts. Financial Incentives and the "Performance of Crowds". In Human Computation Workshop (HComp2009), 2009.
[15]
S. Nowak and S. M. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Multimedia Information Retrieval, pages 557--566, 2010.
[16]
E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello. Sindice.com: a document-oriented lookup index for open linked data. IJMSO, 3(1):37--52, 2008.
[17]
J. Pound, P. Mika, and H. Zaragoza. Ad-hoc Object Ranking in the Web of Data. In Proceedings of the WWW, pages 771--780, Raleigh, USA, 2010.
[18]
I. Soboroff and D. Harman. Novelty detection: the trec experience. In HLT '05, USA, 2005. ACL.
[19]
T. Tran, H. Wang, and P. Haase. SearchWebDB: Data Web Search on a Pay-As-You-Go Integration Infrastructure, 2008.
[20]
E. Voorhees. The philosophy of information retrieval evaluation. In In Proceedings of the The Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370. Springer-Verlag, 2001.

Cited By

View all
  • (2024)ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce SearchAdvances in Information Retrieval10.1007/978-3-031-56066-8_1(3-11)Online publication date: 15-Mar-2024
  • (2023)A crowd-AI collaborative duo relational graph learning framework towards social impact aware photo classificationProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i12.26711(14637-14645)Online publication date: 7-Feb-2023
  • (2023)Perspectives on Large Language Models for Relevance JudgmentProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605136(39-50)Online publication date: 9-Aug-2023
  • Show More Cited By

Index Terms

  1. Repeatable and reliable search system evaluation using crowdsourcing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
    July 2011
    1374 pages
    ISBN:9781450307574
    DOI:10.1145/2009916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crowdsourcing
    2. evaluation
    3. retrieval
    4. search engines

    Qualifiers

    • Research-article

    Conference

    SIGIR '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 22 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce SearchAdvances in Information Retrieval10.1007/978-3-031-56066-8_1(3-11)Online publication date: 15-Mar-2024
    • (2023)A crowd-AI collaborative duo relational graph learning framework towards social impact aware photo classificationProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i12.26711(14637-14645)Online publication date: 7-Feb-2023
    • (2023)Perspectives on Large Language Models for Relevance JudgmentProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605136(39-50)Online publication date: 9-Aug-2023
    • (2023)Optimization of Web Service Testing Task Assignment in Crowdtesting EnvironmentJournal of Computer Science and Technology10.1007/s11390-022-0824-738:2(455-470)Online publication date: 30-Mar-2023
    • (2022)Privacy-Preserving Content-Based Task AllocationPrivacy-Preserving in Mobile Crowdsensing10.1007/978-981-19-8315-3_3(33-61)Online publication date: 21-Dec-2022
    • (2021)Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid ApproachJournal of Data and Information Quality10.1145/348482813:4(1-35)Online publication date: 15-Oct-2021
    • (2020)Challenges and strategies for running controlled crowdsourcing experiments2020 XLVI Latin American Computing Conference (CLEI)10.1109/CLEI52000.2020.00036(252-261)Online publication date: Oct-2020
    • (2019)The Practice of CrowdsourcingSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00904ED1V01Y201903ICR06611:1(1-149)Online publication date: 28-May-2019
    • (2019)Crowdsourcing for search engines: perspectives and challengesInternational Journal of Crowd Science10.1108/IJCS-12-2018-00263:1(49-62)Online publication date: 10-May-2019
    • (2018)To Phrase or Not to Phrase – Impact of User versus System Term Dependence upon RetrievalData and Information Management10.2478/dim-2018-00012:1(1-14)Online publication date: Jun-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media