research-article

Repeatable and reliable search system evaluation using crowdsourcing

Authors:

Daniel M. Herzig,

Henry S. Thompson,

Thanh Tran DucAuthors Info & Claims

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 923 - 932

https://doi.org/10.1145/2009916.2010039

Published: 24 July 2011 Publication History

Abstract

The primary problem confronting any new kind of search task is how to boot-strap a reliable and repeatable evaluation campaign, and a crowd-sourcing approach provides many advantages. However, can these crowd-sourced evaluations be repeated over long periods of time in a reliable manner? To demonstrate, we investigate creating an evaluation campaign for the semantic search task of keyword-based ad-hoc object retrieval. In contrast to traditional search over web-pages, object search aims at the retrieval of information from factual assertions about real-world objects rather than searching over web-pages with textual descriptions. Using the first large-scale evaluation campaign that specifically targets the task of ad-hoc Web object retrieval over a number of deployed systems, we demonstrate that crowd-sourced evaluation campaigns can be repeated over time and still maintain reliable results. Furthermore, we show how these results are comparable to expert judges when ranking systems and that the results hold over different evaluation and relevance metrics. This work provides empirical support for scalable, reliable, and repeatable search system evaluation using crowdsourcing.

References

[1]

O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2):9--15, 2008.

Digital Library

[2]

O. Alonso, R. Schenkel, and M. Theobald. Crowdsourcing assessments for XML ranked retrieval. In ECIR, pages 602--606, 2010.

Digital Library

[3]

P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: are judges exchangeable and does it matter. In SIGIR, pages 667--674, New York, NY, USA, 2008. ACM.

Digital Library

[4]

K. Balog, A. P. de Vries, P. Serdyukov, P. Thomas, and T. Westerveld. Overview of the TREC 2009 Entity Track. In NIST Special Publication: SP 500--278, 2009.

[5]

C. Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using Amazon's Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 286--295, Singapore, August 2009. Association for Computational Linguistics.

Digital Library

[6]

B. Carpenter. Multilevel bayesian models of categorical data annotation. technical report. Technical report, Alias-I, 2008. http://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf.

[7]

L. Ding, T. Finin, A. Joshi, R. Pan, S. R. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for the semantic web. In CIKM, pages 652--659, 2004.

Digital Library

[8]

S. Elbassuoni, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum. Language-model-based Ranking for Queries on RDF-graphs. In CIKM, pages 977--986. ACM, 2009.

Digital Library

[9]

R. Guha, R. McCool, and E. Miller. Semantic Search. In WWW, pages 700--709. ACM, 2003.

Digital Library

[10]

H. Halpin. A query-driven characterization of linked data. In Proceedings of the WWW Workshop on Linked Data on the Web, Madrid, Spain, 2009.

[11]

S. P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. J. Am. Soc. Inf. Sci., 47(1):37--49, 1996.

Digital Library

[12]

J. Kamps, S. Geva, A. Trotman, A. Woodley, and M. Koolen. Overview of the INEX 2008 Ad Hoc Track. Advances in Focused Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, pages 1--28, 2009.

Digital Library

[13]

Y. Luo, W. Wang, and X. Lin. SPARK: A Keyword Search Engine on Relational Databases. In ICDE, pages 1552--1555, 2008.

Digital Library

[14]

W. Mason and D. J. Watts. Financial Incentives and the "Performance of Crowds". In Human Computation Workshop (HComp2009), 2009.

Digital Library

[15]

S. Nowak and S. M. Rüger. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Multimedia Information Retrieval, pages 557--566, 2010.

Digital Library

[16]

E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello. Sindice.com: a document-oriented lookup index for open linked data. IJMSO, 3(1):37--52, 2008.

Digital Library

[17]

J. Pound, P. Mika, and H. Zaragoza. Ad-hoc Object Ranking in the Web of Data. In Proceedings of the WWW, pages 771--780, Raleigh, USA, 2010.

Digital Library

[18]

I. Soboroff and D. Harman. Novelty detection: the trec experience. In HLT '05, USA, 2005. ACL.

Digital Library

[19]

T. Tran, H. Wang, and P. Haase. SearchWebDB: Data Web Search on a Pay-As-You-Go Integration Infrastructure, 2008.

[20]

E. Voorhees. The philosophy of information retrieval evaluation. In In Proceedings of the The Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355--370. Springer-Verlag, 2001.

Digital Library

Cited By

Soviero BKuhn DSalle AMoreira V(2024)ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce SearchAdvances in Information Retrieval10.1007/978-3-031-56066-8_1(3-11)Online publication date: 15-Mar-2024
https://doi.org/10.1007/978-3-031-56066-8_1
Zhang YKou ZShang LZeng HYue ZWang DWilliams BChen YNeville J(2023)A crowd-AI collaborative duo relational graph learning framework towards social impact aware photo classificationProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i12.26711(14637-14645)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i12.26711
Faggioli GDietz LClarke CDemartini GHagen MHauff CKando NKanoulas EPotthast MStein BWachsmuth HYoshioka MKiseleva JAliannejadi M(2023)Perspectives on Large Language Models for Relevance JudgmentProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605136(39-50)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605136
Show More Cited By

Index Terms

Repeatable and reliable search system evaluation using crowdsourcing
1. Information systems
  1. Information retrieval

Recommendations

The influence of commercial intent of search results on their perceived relevance
iConference '11: Proceedings of the 2011 iConference

We carried out a retrieval effectiveness test on the three major web search engines (i.e., Google, Microsoft and Yahoo). In addition to relevance judgments, we classified the results according to their commercial intent and whether or not they carried ...
Brand and its effect on user perception of search engine performance

In this research we investigate the effect of search engine brand on the evaluation of searching performance. Our research is motivated by the large amount of search traffic directed to a handful of Web search engines, even though many have similar ...
What users see - Structures in search engine results pages

This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

July 2011

1374 pages

ISBN:9781450307574

DOI:10.1145/2009916

General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '11

Sponsor:

SIGIR

SIGIR '11: The 34th International ACM SIGIR conference on research and development in Information Retrieval

July 24 - 28, 2011

Beijing, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
801
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)5

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Soviero BKuhn DSalle AMoreira V(2024)ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce SearchAdvances in Information Retrieval10.1007/978-3-031-56066-8_1(3-11)Online publication date: 15-Mar-2024
https://doi.org/10.1007/978-3-031-56066-8_1
Zhang YKou ZShang LZeng HYue ZWang DWilliams BChen YNeville J(2023)A crowd-AI collaborative duo relational graph learning framework towards social impact aware photo classificationProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i12.26711(14637-14645)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i12.26711
Faggioli GDietz LClarke CDemartini GHagen MHauff CKando NKanoulas EPotthast MStein BWachsmuth HYoshioka MKiseleva JAliannejadi M(2023)Perspectives on Large Language Models for Relevance JudgmentProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605136(39-50)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605136
Tang WChen RZhang JHuang LZheng SGuo S(2023)Optimization of Web Service Testing Task Assignment in Crowdtesting EnvironmentJournal of Computer Science and Technology10.1007/s11390-022-0824-738:2(455-470)Online publication date: 30-Mar-2023
https://doi.org/10.1007/s11390-022-0824-7
Zhang CWu TLi YZhu LZhang CWu TLi YZhu L(2022)Privacy-Preserving Content-Based Task AllocationPrivacy-Preserving in Mobile Crowdsensing10.1007/978-981-19-8315-3_3(33-61)Online publication date: 21-Dec-2022
https://doi.org/10.1007/978-981-19-8315-3_3
Amaral GPiscopo AKaffee LRodrigues OSimperl E(2021)Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid ApproachJournal of Data and Information Quality10.1145/348482813:4(1-35)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3484828
Ramirez JBaez MCasati FCernuzzi LBenatallah B(2020)Challenges and strategies for running controlled crowdsourcing experiments2020 XLVI Latin American Computing Conference (CLEI)10.1109/CLEI52000.2020.00036(252-261)Online publication date: Oct-2020
https://doi.org/10.1109/CLEI52000.2020.00036
Alonso O(2019)The Practice of CrowdsourcingSynthesis Lectures on Information Concepts, Retrieval, and Services10.2200/S00904ED1V01Y201903ICR06611:1(1-149)Online publication date: 28-May-2019
https://doi.org/10.2200/S00904ED1V01Y201903ICR066
Moradi M(2019)Crowdsourcing for search engines: perspectives and challengesInternational Journal of Crowd Science10.1108/IJCS-12-2018-00263:1(49-62)Online publication date: 10-May-2019
https://doi.org/10.1108/IJCS-12-2018-0026
Lioma CLarsen BIngwersen P(2018)To Phrase or Not to Phrase – Impact of User versus System Term Dependence upon RetrievalData and Information Management10.2478/dim-2018-00012:1(1-14)Online publication date: Jun-2018
https://doi.org/10.2478/dim-2018-0001
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents