research-article

On the Theory of Weak Supervision for Information Retrieval

Authors:

W. Bruce CroftAuthors Info & Claims

ICTIR '18: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

Pages 147 - 154

https://doi.org/10.1145/3234944.3234968

Published: 10 September 2018 Publication History

Abstract

Neural network approaches have recently shown to be effective in several information retrieval (IR) tasks. However, neural approaches often require large volumes of training data to perform effectively, which is not always available. To mitigate the shortage of labeled data, training neural IR models with weak supervision has been recently proposed and received considerable attention in the literature. In weak supervision, an existing model automatically generates labels for a large set of unlabeled data, and a machine learning model is further trained on the generated "weak" data. Surprisingly, it has been shown in prior art that the trained neural model can outperform the weak labeler by a significant margin. Although these obtained improvements have been intuitively justified in previous work, the literature still lacks theoretical justification for the observed empirical findings. In this paper, we provide a theoretical insight into weak supervision for information retrieval, focusing on learning to rank. We model the weak supervision signal as a noisy channel that introduces noise to the correct ranking. Based on the risk minimization framework, we prove that given some sufficient constraints on the loss function, weak supervision is equivalent to supervised learning under uniform noise. We also find an upper bound for the empirical risk of weak supervision in case of non-uniform noise. Following the recent work on using multiple weak supervision signals to learn more accurate models, we find an information theoretic lower bound on the number of weak supervision signals required to guarantee an upper bound for the pairwise error probability. We empirically verify a set of presented theoretical findings, using synthetic and real weak supervision data.

References

[1]

N. Asadi, D. Metzler, T. Elsayed, and J. Lin . 2011. Pseudo Test Collections for Learning Web Search Ranking Functions SIGIR '11. 1073--1082.

Digital Library

[2]

R. Attar and A. S. Fraenkel . 1977. Local Feedback in Full-Text Retrieval Systems. J. ACM, Vol. 24, 3 (1977), 397--417.

Digital Library

[3]

L. Azzopardi, M. de Rijke, and K. Balog . 2007. Building Simulated Queries for Known-item Topics: An Analysis Using Six European Languages SIGIR '07. 455--462.

Digital Library

[4]

Alan Joseph Bekker and Jacob Goldberger . 2016. Training deep neural-networks based on unreliable labels ICASSP '16. 2682--2686.

[5]

Avrim Blum and Tom Mitchell . 1998. Combining Labeled and Unlabeled Data with Co-training COLT' 98. 92--100.

Digital Library

[6]

Daniel Cohen and W. Bruce Croft . 2018. A Hybrid Embedding Approach to Noisy Answer Passage Retrieval ECIR '18. 127--140.

[7]

Gordon V. Cormack, Mark D. Smucker, and Charles L. Clarke . 2011. Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets. Inf. Retr., Vol. 14, 5 (Oct. . 2011), 441--465.

Digital Library

[8]

W. B. Croft and D. J. Harper . 1979. Using Probabilistic Models of Document Retrieval Without Relevance Information. J. Doc., Vol. 35, 4 (1979), 285--295.

[9]

W. Bruce Croft, Donald Metzler, and Trevor Strohman . 2009. Search Engines: Information Retrieval in Practice (bibinfoedition1st ed.). Addison-Wesley Publishing Company.

Digital Library

[10]

Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, and Bernhard Schölkopf . 2018. Fidelity-Weighted Learning. In ICLR '18.

[11]

M. Dehghani, A. Severyn, S. Rothe, and J. Kamps . 2017 a. Avoiding Your Teacher's Mistakes: Training Neural Networks with Controlled Weak Supervision. CoRR Vol. abs/1711.00313 (2017).

[12]

Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft . 2017 b. Neural Ranking Models with Weak Supervision. In SIGIR '17. 65--74.

Digital Library

[13]

Aritra Ghosh, Himanshu Kumar, and P. S. Sastry . 2017. Robust Loss Functions under Label Noise for Deep Neural Networks AAAI '18. 1919--1925.

[14]

Aritra Ghosh, Naresh Manwani, and P. S. Sastry . 2015. Making risk minimization tolerant to label noise. Neurocomputing Vol. 160 (2015), 93--107.

Digital Library

[15]

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft . 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval CIKM '16. 55--64.

Digital Library

[16]

Thorsten Joachims . 2002. Optimizing Search Engines Using Clickthrough Data. KDD '02. 133--142.

Digital Library

[17]

Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel . 2017. Unbiased Learning-to-Rank with Biased Feedback. In WSDM '17. 781--789.

Digital Library

[18]

Ashish Khetan, Zachary C. Lipton, and Anima Anandkumar . 2018. Learning From Noisy Singly-labeled Data. In ICLR '18.

[19]

Diederik P. Kingma and Jimmy Ba . 2015. Adam: A Method for Stochastic Optimization. In ICLR '15.

[20]

Victor Lavrenko and W. Bruce Croft . 2001. Relevance Based Language Models. In SIGIR '01. 120--127.

Digital Library

[21]

Gideon S. Mann and Andrew McCallum . 2010. Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data. J. Mach. Learn. Res. Vol. 11 (2010), 955--984.

Digital Library

[22]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean . 2013. Distributed Representations of Words and Phrases and their Compositionality NIPS '13. 3111--3119.

Digital Library

[23]

Yifan Nie, Alessandro Sordoni, and Jian-Yun Nie . 2018. Multi-level Abstraction Convolutional Model with Weak Supervision for Information Retrieval SIGIR '18. 985--988.

Digital Library

[24]

Greg Pass, Abdur Chowdhury, and Cayley Torgeson . 2006. A Picture of Search InfoScale '06.

Digital Library

[25]

Jeffrey Pennington, Richard Socher, and Christopher Manning . 2014. GloVe: Global Vectors for Word Representation. EMNLP '14. 1532--1543.

[26]

Jay M. Ponte and W. Bruce Croft . 1998. A Language Modeling Approach to Information Retrieval SIGIR '98. 275--281.

Digital Library

[27]

David Rolnick, Andreas Veit, Serge J. Belongie, and Nir Shavit . 2017. Deep Learning is Robust to Massive Label Noise. CoRR Vol. abs/1705.10694 (2017).

[28]

Robert E. Schapire and Yoav Freund . 2012. Boosting: Foundations and Algorithms. The MIT Press.

Digital Library

[29]

Nikos Voskarides, Edgar Meij, Ridho Reinanda, Abhinav Khaitan, Miles Osborne, Giorgio Stefanoni, Prabhanjan Kambadur, and Maarten de Rijke . 2018. Weakly-supervised Contextualization of Knowledge Graph Facts SIGIR '18. 765--774.

Digital Library

[30]

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power . 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling SIGIR '17. 55--64.

Digital Library

[31]

Hamed Zamani, Michael Bendersky, Xuanhui Wang, and Mingyang Zhang . 2017. Situational Context for Ranking in Personal Search WWW '17. 1531--1540.

Digital Library

[32]

Hamed Zamani and W. Bruce Croft . 2017. Relevance-based Word Embedding. In SIGIR '17. 505--514.

Digital Library

[33]

Hamed Zamani, W. Bruce Croft, and J. Shane Culpepper . 2018 a. Neural Query Performance Prediction Using Weak Supervision from Multiple Signals SIGIR '18. 105--114.

Digital Library

[34]

Hamed Zamani, Mostafa Dehghani, Fernando Diaz, Hang Li, and Nick Craswell . 2018 b. SIGIR 2018 Workshop on Learning from Limited or Noisy Data for Information Retrieval SIGIR '18. 1439--1440.

Digital Library

[35]

Chengxiang Zhai and John Lafferty . 2001. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval SIGIR '01. 334--342.

Digital Library

Cited By

Wang JHuang JTu XWang JHuang ALaskar MBhuiyan A(2024)Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and ChallengesACM Computing Surveys10.1145/364847156:7(1-33)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1145/3648471
Lien YZamani HCroft B(2024)Generalized Weak Supervision for Neural Information RetrievalACM Transactions on Information Systems10.1145/364763942:5(1-26)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3647639
Oliveira VNogueira GFaleiros TMarcacini R(2024)Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documentsArtificial Intelligence and Law10.1007/s10506-023-09388-1Online publication date: 15-Feb-2024
https://doi.org/10.1007/s10506-023-09388-1
Show More Cited By

Index Terms

On the Theory of Weak Supervision for Information Retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Learning to rank
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Selective Weak Supervision for Neural Information Retrieval
WWW '20: Proceedings of The Web Conference 2020

This paper democratizes neural information retrieval to scenarios where large scale relevance training signals are not available. We revisit the classic IR intuition that anchor-document relations approximate query-document relevance and propose a ...
Neural Ranking Models with Weak Supervision
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Despite the impressive improvements achieved by unsupervised deep neural networks in computer vision and NLP tasks, such improvements have not yet been observed in ranking for information retrieval. The reason may be the complexity of the ranking ...
Generalized Weak Supervision for Neural Information Retrieval
Neural ranking models (NRMs) have demonstrated effective performance in several information retrieval (IR) tasks. However, training NRMs often requires large-scale training data, which is difficult and expensive to obtain. To address this issue, one can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICTIR '18: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

September 2018

238 pages

ISBN:9781450356565

DOI:10.1145/3234944

General Chairs:
Dawei Song
Beijing Institute of Technology, China and The Open University, UK
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Le Sun
Chinese Information Processing Society of China, China
,
Program Chairs:
Peter Bruza
Queensland University of Technology, Australia
,
Massimo Melucci
University of Padua, Italy
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy
,
Grace Hui Yang
Georgetown University, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 September 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICTIR '18

Sponsor:

SIGIR

ICTIR '18: The 2018 ACM SIGIR International Conference on the Theory of Information Retrieval

September 14 - 17, 2018

Tianjin, China

Acceptance Rates

ICTIR '18 Paper Acceptance Rate 19 of 47 submissions, 40%;

Overall Acceptance Rate 235 of 527 submissions, 45%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
397
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang JHuang JTu XWang JHuang ALaskar MBhuiyan A(2024)Utilizing BERT for Information Retrieval: Survey, Applications, Resources, and ChallengesACM Computing Surveys10.1145/364847156:7(1-33)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1145/3648471
Lien YZamani HCroft B(2024)Generalized Weak Supervision for Neural Information RetrievalACM Transactions on Information Systems10.1145/364763942:5(1-26)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3647639
Oliveira VNogueira GFaleiros TMarcacini R(2024)Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documentsArtificial Intelligence and Law10.1007/s10506-023-09388-1Online publication date: 15-Feb-2024
https://doi.org/10.1007/s10506-023-09388-1
Salemi ARafiee MZamani HYoshioka MKiseleva JAliannejadi M(2023)Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question AnsweringProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605137(169-176)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605137
Hashemi HZhuang YKothur SPrasad SMeij ECroft WYoshioka MKiseleva JAliannejadi M(2023)Dense Retrieval Adaptation using Target Domain DescriptionProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605127(95-104)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605127
Tekumalla RBanda J(2023)Leveraging Large Language Models and Weak Supervision for Social Media Data Annotation: An Evaluation Using COVID-19 Self-reported Vaccination TweetsHCI International 2023 – Late Breaking Papers10.1007/978-3-031-48044-7_26(356-366)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.1007/978-3-031-48044-7_26
Zamani HBendersky MMetzler DZhuang HWang XCrestani FPasi GGaussier E(2022)Stochastic Retrieval-Conditioned RerankingProceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3539813.3545141(81-91)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3539813.3545141
Silavong FMoran SGeorgiadis ASaphal ROtter RLo DMcIntosh SNovielli N(2022)SenatusProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3527947(511-523)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1145/3524842.3527947
Zamani HDiaz FDehghani MMetzler DBendersky MAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Retrieval-Enhanced Machine LearningProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531722(2875-2886)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531722
Dahbi SChoma JMokgatitswane GRuan XLieberman BMellado BCelik T(2022)Machine learning approach for the search of resonances with topological features at the Large Hadron ColliderInternational Journal of Modern Physics A10.1142/S0217751X2150241937:03Online publication date: 4-Feb-2022
https://doi.org/10.1142/S0217751X21502419
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten