poster

What makes data robust: a data analysis in learning to rank

Authors:

Xiubo GengAuthors Info & Claims

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Pages 1191 - 1194

https://doi.org/10.1145/2600428.2609542

Published: 03 July 2014 Publication History

Abstract

When applying learning to rank algorithms in real search applications, noise in human labeled training data becomes an inevitable problem which will affect the performance of the algorithms. Previous work mainly focused on studying how noise affects ranking algorithms and how to design robust ranking algorithms. In our work, we investigate what inherent characteristics make training data robust to label noise. The motivation of our work comes from an interesting observation that a same ranking algorithm may show very different sensitivities to label noise over different data sets. We thus investigate the underlying reason for this observation based on two typical kinds of learning to rank algorithms (i.e.~pairwise and listwise methods) and three different public data sets (i.e.~OHSUMED, TD2003 and MSLR-WEB10K). We find that when label noise increases in training data, it is the \emph{document pair noise ratio} (i.e.~\emph{pNoise}) rather than \emph{document noise ratio} (i.e.~\emph{dNoise}) that can well explain the performance degradation of a ranking algorithm.

References

[1]

J. A. Aslam, E. Kanoulas, and et. al. Document selection methodologies for efficient and effective learning-to-rank. SIGIR '09, pages 468--475, 2009.

Digital Library

[2]

P. Bailey, N. Craswell, and et. al. Relevance assessment: are judges exchangeable and does it matter. SIGIR '08, pages 667--674, 2008.

Digital Library

[3]

Z. Cao, T. Qin, and et. al. Learning to rank: from pairwise approach to listwise approach. ICML '07, pages 129--136.

Digital Library

[4]

Y. Freund, R. Iyer, and et. al. An efficient boosting algorithm for combining preferences. JMLR, 4:933--969, 2003.

Digital Library

[5]

X. Geng, T. Qin, and et. al. Selecting optimal training data for learning to rank. IPM, 47:730-- 741, 2011.

Digital Library

[6]

V. Jain and M. Varma. Learning to re-rank: query-dependent image re-ranking using click data. WWW '11, pages 277--286.

Digital Library

[7]

T. Joachims. Optimizing search engines using clickthrough data. KDD '02, pages 133--142, 2002.

Digital Library

[8]

E. Kanoulas, S. Savev, and et. al. A large-scale study of the effect of training set characteristics over learning-to-rank algorithms. SIGIR '11, pages 1243--1244, 2011.

Digital Library

[9]

G. Kazai, N. Craswell, and et. al. An analysis of systematic judging errors in information retrieval. CIKM '12, pages 105--114.

Digital Library

[10]

A. Kumar and M. Lease. Learning to rank from a noisy crowd. SIGIR '11, pages 1221--1222, 2011.

Digital Library

[11]

T.-Y. Liu. Introduction. In Learning to rank for information retrieval, chapter 1, pages 3--30. 2011.

[12]

C. Macdonald, R. Santos, and I. Ounis. The whens and hows of learning to rank for web search. Information Retrieval, 16(5):584--628, 2013.

Digital Library

[13]

T. Qin, T.-Y. Liu, and et. al. Letor: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval Journal, 13:346--374, 2010.

Digital Library

[14]

U. Rebbapragada and C. E. Brodley. Class noise mitigation through instance weighting. ECML '07, pages 708--715, 2007.

Digital Library

[15]

F. Scholer, A. Turpin, and M. Sanderson. Quantifying test collection quality based on the consistency of relevance judgements. SIGIR '11, pages 1063--1072, 2011.

Digital Library

[16]

V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. KDD '08, pages 614--622, 2008.

Digital Library

[17]

W. C. J. V. Carvalho, J. Elsas. A meta-learning approach for robust rank learning. In Proceedings of SIGIR 2008 LR4IR.

[18]

E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. SIGIR '98, pages 315--323.

Digital Library

[19]

J. Vuurens, A. P. De Vries, and C. Eickho. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proceedings of SIGIR 2011 Workshop on CIR.

[20]

J. L. S.-F. C. Wei Liu, Yugang Jiang. Noise resistant graph ranking for improved web image search. In CVPR, 2011.

Digital Library

[21]

J. Xu, C. Chen, and et. al. Improving quality of training data for learning to rank using click-through data. WSDM '10, pages 171--180, 2010.

Digital Library

[22]

J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. SIGIR '07, pages 391--398, 2007.

Digital Library

Cited By

Xu BLin HLin YXu K(2022)Context-aware ranking refinement with attentive semi-supervised autoencodersSoft Computing10.1007/s00500-022-07433-w26:24(13941-13952)Online publication date: 25-Aug-2022
https://doi.org/10.1007/s00500-022-07433-w
Albuquerque AAmador TFerreira RVeloso AZiviani N(2018)Learning to Rank with Deep Autoencoder Features2018 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2018.8489646(1-8)Online publication date: Jul-2018
https://doi.org/10.1109/IJCNN.2018.8489646
Lin YWu JXu BXu KLin H(2017)Learning to rank using multiple loss functionsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-017-0730-4Online publication date: 12-Oct-2017
https://doi.org/10.1007/s13042-017-0730-4
Show More Cited By

Index Terms

What makes data robust: a data analysis in learning to rank
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Which noise affects algorithm robustness for learning to rank
Abstract
When applying learning to rank algorithms in real search applications, noise in human labeled training data becomes an inevitable problem which will affect the performance of the algorithms. Previous work mainly focused on studying how noise ...
Robust Training of Graph Neural Networks via Noise Governance
WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

Graph Neural Networks (GNNs) have become widely-used models for semi-supervised learning. However, the robustness of GNNs in the presence of label noise remains a largely under-explored problem. In this paper, we consider an important yet challenging ...
Noise filtering to improve data and model quality for crowdsourcing

Crowdsourcing services provide an easy means of acquiring labeled training data for supervised learning. However, the labels provided by a single crowd worker are often unreliable. Repeated labeling can be used to solve this problem. After multiple ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

July 2014

1330 pages

ISBN:9781450322577

DOI:10.1145/2600428

General Chairs:
Shlomo Geva
Queensland University of Technology
,
Andrew Trotman
University of Dunedin
,
Program Chairs:
Peter Bruza
Queensland University of Technology
,
Charles L.A. Clarke
University of Waterloo
,
Kal Järvelin
University of Tampere

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

SIGIR '14

Sponsor:

SIGIR

SIGIR '14: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 6 - 11, 2014

Queensland, Gold Coast, Australia

Acceptance Rates

SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
367
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu BLin HLin YXu K(2022)Context-aware ranking refinement with attentive semi-supervised autoencodersSoft Computing10.1007/s00500-022-07433-w26:24(13941-13952)Online publication date: 25-Aug-2022
https://doi.org/10.1007/s00500-022-07433-w
Albuquerque AAmador TFerreira RVeloso AZiviani N(2018)Learning to Rank with Deep Autoencoder Features2018 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2018.8489646(1-8)Online publication date: Jul-2018
https://doi.org/10.1109/IJCNN.2018.8489646
Lin YWu JXu BXu KLin H(2017)Learning to rank using multiple loss functionsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-017-0730-4Online publication date: 12-Oct-2017
https://doi.org/10.1007/s13042-017-0730-4
Ibrahim MMurshed M(2016)From Tf-Idf to Learning-to-RankBusiness Intelligence10.4018/978-1-4666-9562-7.ch063(1245-1292)Online publication date: 2016
https://doi.org/10.4018/978-1-4666-9562-7.ch063
Ibrahim MMurshed M(2016)From Tf-Idf to Learning-to-RankHandbook of Research on Innovations in Information Retrieval, Analysis, and Management10.4018/978-1-4666-8833-9.ch003(62-109)Online publication date: 2016
https://doi.org/10.4018/978-1-4666-8833-9.ch003
Jameel SLam WSchockaert SBing LBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-RankProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806482(103-112)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806482
Jameel SLam WBing L(2015)Supervised topic models with word order structure for document classification and retrieval learningInformation Retrieval Journal10.1007/s10791-015-9254-218:4(283-330)Online publication date: 4-Jun-2015
https://doi.org/10.1007/s10791-015-9254-2

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents