On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents

Hosseini, Mehdi; Cox, Ingemar J.; Milić-Frayling, Nataša; Kazai, Gabriella; Vinay, Vishwa

doi:10.1007/978-3-642-28997-2_16

Mehdi Hosseini²²,
Ingemar J. Cox²²,
Nataša Milić-Frayling²³,
Gabriella Kazai²³ &
…
Vishwa Vinay²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7224))

Included in the following conference series:

European Conference on Information Retrieval

2896 Accesses
32 Citations

Abstract

We consider the problem of acquiring relevance judgements for information retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance labels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the workers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced relevance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Hybrid Crowd-Machine Methods as Alternatives to Pooling and Expert Judgments

A unified statistical framework for crowd labeling

Article 05 October 2014

Crowdsourcing label quality: a theoretical analysis

Article 15 September 2015

References

Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for Book Search Evaluation: Impact of HIT Design on Comparative System Ranking. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information, pp. 205–214 (2011)
Google Scholar
Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? Using Mechanical Turk for relevance assessment. In: SIGIR 2009: Workshop on the Future of IR Evaluation, Boston (2009)
Google Scholar
Smucker, M.D., Jethani, C.P.: Measuring assessor accuracy: a comparison of nist assessors and user study participants. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, pp. 1231–1232 (2011)
Google Scholar
Cuadra, C.A., Katter, R.V.: Opening the Black Box of ‘Relevance’. Journal of Documentation 23(4), 291–303 (1967)
Article Google Scholar
Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage. 36(5), 697–716 (2000)
Article Google Scholar
Buckley, C., Voorhees, E.: Evaluating evaluation measure stability. In: Proceedings of SIGIR, pp. 33–40 (2000)
Google Scholar
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allen, J.: Evaluation Over Thousands of Queries. In: Proceedings of SIGIR, pp. 651–658 (2008)
Google Scholar
Aslam, J.A., Pavlu, V., Yilmaz, E.: A Statistical Method for System Evaluation Using Incomplete Judgments. In: Proceedings of SIGIR, pp. 541–548 (2006)
Google Scholar
Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 539–546 (2010)
Google Scholar
Bailey, P., et al.: Relevance assessment: are judges exchangeable and does it matter. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 667–674 (2008)
Google Scholar
Scholer, F., Turpin, A., Sanderson, M.: Quantifying Test Collection Quality Based on the Consistency of Relevance Judgements. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1063–1072 (2011)
Google Scholar
Winter, M., Duncan, W.: Financial incentives and the ”performance of crowd”. In: Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 77–85 (2009)
Google Scholar
Snow, R., O’Connor, B., Urafsky, D., Ng, A.Y.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, pp. 254–263 (2008)
Google Scholar
Smucker, M.D., Jethani, C.P.: The Crowd vs. the Lab: A Comparison of Crowd-Sourced and University Laboratory Participant Behavior. In: Proceedings of the SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval, Beijing (2011)
Google Scholar
Kumar, A., Lease, M.: Modeling Annotator Accuracies for Supervised Learning. In: WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining, Hong Kong (2011)
Google Scholar
Kasneci, G., Gael, J.V., Stern, D.H., Graepel, T.: CoBayes: bayesian knowledge corroboration with assessors of unknown areas of expertise. In: Proceedings of the Forth International Conference on Web Search and Web Data Mining, pp. 465–474 (2011)
Google Scholar
Welinder, P., Perona, P.: Online crowdsourcing: rating annotators and obtaining cost-effective labels. In: CVPR 2010: IEEE Conference on Computer Vision and Pattern, pp. 1526–1534 (2010)
Google Scholar
Ipeirotis, P.G., Provost, F., Wang, J.: Quality Management on Amazon Mechanical Turk. In: Proceedings of the ACM SIGKDD Workshop on Human Computation, pp. 64–67 (2010)
Google Scholar
Dawid, P., Skene, A.M.: Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics 28(1), 20–28 (1979)
Article Google Scholar
Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 736–743 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, University College London, UK
Mehdi Hosseini & Ingemar J. Cox
Microsoft Research, Cambridge, UK
Nataša Milić-Frayling, Gabriella Kazai & Vishwa Vinay

Authors

Mehdi Hosseini
View author publications
You can also search for this author in PubMed Google Scholar
Ingemar J. Cox
View author publications
You can also search for this author in PubMed Google Scholar
Nataša Milić-Frayling
View author publications
You can also search for this author in PubMed Google Scholar
Gabriella Kazai
View author publications
You can also search for this author in PubMed Google Scholar
Vishwa Vinay
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yahoo! Research, Diagonal 177, 08018, Barcelona, Spain
Ricardo Baeza-Yates & B. Barla Cambazoglu &
Centrum Wiskunde & Informatica, Science Park 123, Amsterdam, The Netherlands
Arjen P. de Vries
Websays, Nàpols 294 7-4, 08025, Barcelona, Spain
Hugo Zaragoza
Yahoo! Research, Diagnoal 177, 08018, Barcelona, Spain
Vanessa Murdock
Yahoo! Labs, Tower 3, Matam Park, 31905, Haifa, Israel
Ronny Lempel
ISTI-CNR, via G. Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Silvestri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hosseini, M., Cox, I.J., Milić-Frayling, N., Kazai, G., Vinay, V. (2012). On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-28997-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Hybrid Crowd-Machine Methods as Alternatives to Pooling and Expert Judgments

A unified statistical framework for crowd labeling

Crowdsourcing label quality: a theoretical analysis

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Hybrid Crowd-Machine Methods as Alternatives to Pooling and Expert Judgments

A unified statistical framework for crowd labeling

Crowdsourcing label quality: a theoretical analysis

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation