research-article

Open access

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Authors:

Xueqi ChengAuthors Info & Claims

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Pages 283 - 291

https://doi.org/10.1145/3437963.3441777

Published: 08 March 2021 Publication History

Abstract

Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings.

Supplementary Material

MP4 File (March 10_Session 7_5-Xinyu Ma_419.mp4)

Presentation video-PROP

Download
58.96 MB

References

[1]

Avi Arampatzis and Jaap Kamps. 2008. A Study of Query Length. In Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 811--812.

Digital Library

[2]

Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo Test Collections for Learning Web Search Ranking Functions. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 1073--1082.

Digital Library

[3]

Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building Simulated Queries for Known-Item Topics: An Analysis Using Six European Languages. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 455--462.

Digital Library

[4]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR, 1137--1155.

[5]

Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2019. Pre-training Tasks for Embedding-based Large-scale Retrieval. In ICLR.

[6]

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training Text Encoders as Discriminators rather than Generators. In ICLR.

[7]

Prem C Consul and Gaurav C Jain. 1973. A Generalization of the Poisson Distribution. In Technometrics, Vol. 15. Taylor & Francis, 791--799.

[8]

Gordon V Cormack, Mark D Smucker, and Charles LA Clarke. 2011. Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets. In Information retrieval, Vol. 14. Springer, 441--465.

[9]

Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In Proceedings of the 42th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 985--988.

Digital Library

[10]

Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 65--74.

Digital Library

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, Stroudsburg, PA, USA, 4171--4186.

[12]

Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengxiang Zhai, and Xueqi Cheng. 2018. Modeling Diverse Relevance Patterns in Ad-hoc Retrieval. In The 41th international ACM SIGIR conference on research & development in information retrieval. ACM, New York, NY, USA, 375--384.

[13]

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016a. A Deep Relevance Matching Model for Ad-hoc Retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, New York, NY, USA, 55--64.

Digital Library

[14]

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016b. Semantic Matching by Non-linear Word Transportation for Information Retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, New York, NY, USA, 701--710.

Digital Library

[15]

Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. ACL, Stroudsburg, PA, USA, 328--339.

[16]

Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, Stroudsburg, PA, USA, 1049--1058.

[17]

Samuel Huston and W Bruce Croft. 2014. Parameters Learned in the Comparison of Retrieval Models using Term Dependencies. In IR, UMASS. Citeseer.

[18]

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, Stroudsburg, PA, USA, 6086--6096.

[19]

Xiaoyong Liu and W Bruce Croft. 2005. Statistical Language Modeling for Information Retrieval. In IR, UMASS.

[20]

Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019 a. CEDR: Contextualized Embeddings for Document Ranking. In Proceedings of the 42th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 1101--1104.

Digital Library

[21]

Sean MacAvaney, Andrew Yates, Kai Hui, and Ophir Frieder. 2019 b. Content-Based Weak Supervision for Ad-Hoc Re-Ranking. In Proceedings of the 42th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 993--996.

Digital Library

[22]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Frontmatter .Cambridge University Press, i--iv.

[23]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.

[24]

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage Document Ranking with BERT. arXiv preprint arXiv:1910.14424.

[25]

Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. Deeprank: A New Deep Architecture for Relevance Ranking in Information Retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, New York, NY, USA, 257--266.

Digital Library

[26]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, Stroudsburg, PA, USA, 2227--2237.

[27]

Jay M Ponte and W Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 275--281.

Digital Library

[28]

Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597.

[29]

Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey. arXiv preprint arXiv:2003.08271.

[30]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.

[31]

Stephen E Robertson and Steve Walker. 1994. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer, 232--241.

[32]

Erik F Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. ACL, Stroudsburg, PA, USA, 142--147.

Digital Library

[33]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.

[34]

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked Sequence to Sequence Pre-training for Language Generation. In ICML. 11328--11339.

[35]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 3104--3112.

[36]

Wilson L Taylor. 1953. 'Cloze Procedure': A New Tool for Measuring Readability. In Journalism quarterly, Vol. 30. 415--433.

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in neural information processing systems. 5998--6008.

[38]

Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. 2019. Structbert: Incorporating Language Structures into Pre-training for Deep Language Understanding. In ICLR.

[39]

Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end Neural Ad-hoc Ranking with Kernel Pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, 55--64.

Digital Library

[40]

Wei Yang, Haotian Zhang, and Jimmy Lin. 2019 b. Simple Applications of BERT for Ad Hoc Document Retrieval. arXiv preprint arXiv:1903.10972 (2019).

[41]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019 a. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 5753--5763.

[42]

ChengXiang Zhai. 2008. Statistical Language Models for Information Retrieval. Synthesis lectures on human language technologies, Vol. 1, 1 (2008), 1--141.

[43]

Chengxiang Zhai and John Lafferty. 2001. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In ACM SIGIR Forum. New York, NY, USA, ACM, 268--276.

[44]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In ICML. 11328--11339.

Cited By

Zhang CWang BSong D(2024)On Elastic Language ModelsACM Transactions on Information Systems10.1145/367737542:6(1-29)Online publication date: 18-Oct-2024
https://doi.org/10.1145/3677375
Su WTang YAi QWang CWu ZLiu YSakai TIshita EOhshima HHasibi FMao JJose J(2024)Mitigating Entity-Level Hallucination in Large Language ModelsProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698403(23-31)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698403
Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Show More Cited By

Index Terms

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval
1. Information systems
  1. Information retrieval

Recommendations

B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pre-training and fine-tuning have achieved remarkable success in many downstream natural language processing (NLP) tasks. Recently, pre-training methods tailored for information retrieval (IR) have also been explored, and the latest success is the PROP ...
Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Designing pre-training objectives that more closely resemble the downstream tasks for pre-trained language models can lead to better performance at the fine-tuning stage, especially in the ad-hoc retrieval area. Existing pre-training approaches tailored ...
A Deep Relevance Matching Model for Ad-hoc Retrieval
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

In recent years, deep neural networks have led to exciting breakthroughs in speech recognition, computer vision, and natural language processing (NLP) tasks. However, there have been few positive results of deep models on ad-hoc retrieval tasks. This is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

March 2021

1192 pages

ISBN:9781450382977

DOI:10.1145/3437963

General Chairs:
Liane Lewin-Eytan
Amazon, Israel
,
David Carmel
Amazon, Israel
,
Elad Yom-Tov
Microsoft, Israel
,
Program Chairs:
Eugene Agichtein
Emory University and Amazon, USA
,
Evgeniy Gabrilovich
Google Health, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Lenovo-CAS Joint Lab Youth Scientist Project
Frontier Research Key Program of Chongqing Science and Technology Commission
K.C.Wong Education Foundation, and the Foundation
Beijing Academy of Artificial Intelligence
National Natural Science Foundation of China
the Youth Innovation Promotion Association CAS
National Key RD Program of China

Conference

WSDM '21

Sponsor:

WSDM '21: The Fourteenth ACM International Conference on Web Search and Data Mining

March 8 - 12, 2021

Virtual Event, Israel

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
1,054
Total Downloads

Downloads (Last 12 months)235
Downloads (Last 6 weeks)34

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang CWang BSong D(2024)On Elastic Language ModelsACM Transactions on Information Systems10.1145/367737542:6(1-29)Online publication date: 18-Oct-2024
https://doi.org/10.1145/3677375
Su WTang YAi QWang CWu ZLiu YSakai TIshita EOhshima HHasibi FMao JJose J(2024)Mitigating Entity-Level Hallucination in Large Language ModelsProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698403(23-31)Online publication date: 8-Dec-2024
https://dl.acm.org/doi/10.1145/3673791.3698403
Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Wei CJi YChen ZXu JLiu ZHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)LLMGR: Large Language Model-based Generative Retrieval in Alipay SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661364(2847-2851)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3661364
Deng ZDou ZZhu YWen JAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)CL4DIV: A Contrastive Learning Framework for Search Result DiversificationProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635851(171-180)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635851
Sun XBi KGuo JYang SZhang QLiu ZZhang GCheng XAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)A Multi-Granularity-Aware Aspect Learning Model for Multi-Aspect Dense RetrievalProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635770(674-682)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635770
Kim JChung SChi S(2024)Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language ModelsJournal of Construction Engineering and Management10.1061/JCEMD4.COENG-14273150:6Online publication date: Jun-2024
https://doi.org/10.1061/JCEMD4.COENG-14273
Liu SWang XMacdonald COunis I(2024)A Social-aware Gaussian Pre-trained model for effective cold-start recommendationInformation Processing & Management10.1016/j.ipm.2023.10360161:2(103601)Online publication date: Mar-2024
https://doi.org/10.1016/j.ipm.2023.103601
Li DDing RXie PHe X(2024)MCFC: A Momentum-Driven Clicked Feature Compressed Pre-trained Language Model for Information RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9431-7_6(69-82)Online publication date: 1-Nov-2024
https://doi.org/10.1007/978-981-97-9431-7_6
Abood MKadhem S(2024)A Survey of Next Words Prediction ModelsForthcoming Networks and Sustainability in the AIoT Era10.1007/978-3-031-62871-9_14(165-185)Online publication date: 26-Jun-2024
https://doi.org/10.1007/978-3-031-62871-9_14
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten