Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3437963.3441777acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article
Open access

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Published: 08 March 2021 Publication History

Abstract

Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical language model for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document language model, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked Language Model (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings.

Supplementary Material

MP4 File (March 10_Session 7_5-Xinyu Ma_419.mp4)
Presentation video-PROP

References

[1]
Avi Arampatzis and Jaap Kamps. 2008. A Study of Query Length. In Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 811--812.
[2]
Nima Asadi, Donald Metzler, Tamer Elsayed, and Jimmy Lin. 2011. Pseudo Test Collections for Learning Web Search Ranking Functions. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 1073--1082.
[3]
Leif Azzopardi, Maarten de Rijke, and Krisztian Balog. 2007. Building Simulated Queries for Known-Item Topics: An Analysis Using Six European Languages. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 455--462.
[4]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. JMLR, 1137--1155.
[5]
Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2019. Pre-training Tasks for Embedding-based Large-scale Retrieval. In ICLR.
[6]
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training Text Encoders as Discriminators rather than Generators. In ICLR.
[7]
Prem C Consul and Gaurav C Jain. 1973. A Generalization of the Poisson Distribution. In Technometrics, Vol. 15. Taylor & Francis, 791--799.
[8]
Gordon V Cormack, Mark D Smucker, and Charles LA Clarke. 2011. Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets. In Information retrieval, Vol. 14. Springer, 441--465.
[9]
Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In Proceedings of the 42th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 985--988.
[10]
Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 65--74.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, Stroudsburg, PA, USA, 4171--4186.
[12]
Yixing Fan, Jiafeng Guo, Yanyan Lan, Jun Xu, Chengxiang Zhai, and Xueqi Cheng. 2018. Modeling Diverse Relevance Patterns in Ad-hoc Retrieval. In The 41th international ACM SIGIR conference on research & development in information retrieval. ACM, New York, NY, USA, 375--384.
[13]
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016a. A Deep Relevance Matching Model for Ad-hoc Retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, New York, NY, USA, 55--64.
[14]
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016b. Semantic Matching by Non-linear Word Transportation for Information Retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, New York, NY, USA, 701--710.
[15]
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. ACL, Stroudsburg, PA, USA, 328--339.
[16]
Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. 2017. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, Stroudsburg, PA, USA, 1049--1058.
[17]
Samuel Huston and W Bruce Croft. 2014. Parameters Learned in the Comparison of Retrieval Models using Term Dependencies. In IR, UMASS. Citeseer.
[18]
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, Stroudsburg, PA, USA, 6086--6096.
[19]
Xiaoyong Liu and W Bruce Croft. 2005. Statistical Language Modeling for Information Retrieval. In IR, UMASS.
[20]
Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019 a. CEDR: Contextualized Embeddings for Document Ranking. In Proceedings of the 42th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 1101--1104.
[21]
Sean MacAvaney, Andrew Yates, Kai Hui, and Ophir Frieder. 2019 b. Content-Based Weak Supervision for Ad-Hoc Re-Ranking. In Proceedings of the 42th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 993--996.
[22]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Frontmatter .Cambridge University Press, i--iv.
[23]
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085.
[24]
Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage Document Ranking with BERT. arXiv preprint arXiv:1910.14424.
[25]
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. 2017. Deeprank: A New Deep Architecture for Relevance Ranking in Information Retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, New York, NY, USA, 257--266.
[26]
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, Stroudsburg, PA, USA, 2227--2237.
[27]
Jay M Ponte and W Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 275--281.
[28]
Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597.
[29]
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained Models for Natural Language Processing: A Survey. arXiv preprint arXiv:2003.08271.
[30]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
[31]
Stephen E Robertson and Steve Walker. 1994. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer, 232--241.
[32]
Erik F Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. ACL, Stroudsburg, PA, USA, 142--147.
[33]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631--1642.
[34]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked Sequence to Sequence Pre-training for Language Generation. In ICML. 11328--11339.
[35]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, USA, 3104--3112.
[36]
Wilson L Taylor. 1953. 'Cloze Procedure': A New Tool for Measuring Readability. In Journalism quarterly, Vol. 30. 415--433.
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in neural information processing systems. 5998--6008.
[38]
Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. 2019. Structbert: Incorporating Language Structures into Pre-training for Deep Language Understanding. In ICLR.
[39]
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. 2017. End-to-end Neural Ad-hoc Ranking with Kernel Pooling. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval. ACM, New York, NY, USA, 55--64.
[40]
Wei Yang, Haotian Zhang, and Jimmy Lin. 2019 b. Simple Applications of BERT for Ad Hoc Document Retrieval. arXiv preprint arXiv:1903.10972 (2019).
[41]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019 a. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 5753--5763.
[42]
ChengXiang Zhai. 2008. Statistical Language Models for Information Retrieval. Synthesis lectures on human language technologies, Vol. 1, 1 (2008), 1--141.
[43]
Chengxiang Zhai and John Lafferty. 2001. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In ACM SIGIR Forum. New York, NY, USA, ACM, 268--276.
[44]
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In ICML. 11328--11339.

Cited By

View all
  • (2024)On Elastic Language ModelsACM Transactions on Information Systems10.1145/367737542:6(1-29)Online publication date: 18-Oct-2024
  • (2024)Mitigating Entity-Level Hallucination in Large Language ModelsProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698403(23-31)Online publication date: 8-Dec-2024
  • (2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
  • Show More Cited By

Index Terms

  1. PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining
    March 2021
    1192 pages
    ISBN:9781450382977
    DOI:10.1145/3437963
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ad-hoc retrieval
    2. pre-training
    3. statistical language model

    Qualifiers

    • Research-article

    Funding Sources

    • Lenovo-CAS Joint Lab Youth Scientist Project
    • Frontier Research Key Program of Chongqing Science and Technology Commission
    • K.C.Wong Education Foundation, and the Foundation
    • Beijing Academy of Artificial Intelligence
    • National Natural Science Foundation of China
    • the Youth Innovation Promotion Association CAS
    • National Key RD Program of China

    Conference

    WSDM '21

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)235
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On Elastic Language ModelsACM Transactions on Information Systems10.1145/367737542:6(1-29)Online publication date: 18-Oct-2024
    • (2024)Mitigating Entity-Level Hallucination in Large Language ModelsProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698403(23-31)Online publication date: 8-Dec-2024
    • (2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
    • (2024)LLMGR: Large Language Model-based Generative Retrieval in Alipay SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661364(2847-2851)Online publication date: 10-Jul-2024
    • (2024)CL4DIV: A Contrastive Learning Framework for Search Result DiversificationProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635851(171-180)Online publication date: 4-Mar-2024
    • (2024)A Multi-Granularity-Aware Aspect Learning Model for Multi-Aspect Dense RetrievalProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635770(674-682)Online publication date: 4-Mar-2024
    • (2024)Cross-Lingual Information Retrieval from Multilingual Construction Documents Using Pretrained Language ModelsJournal of Construction Engineering and Management10.1061/JCEMD4.COENG-14273150:6Online publication date: Jun-2024
    • (2024)A Social-aware Gaussian Pre-trained model for effective cold-start recommendationInformation Processing & Management10.1016/j.ipm.2023.10360161:2(103601)Online publication date: Mar-2024
    • (2024)MCFC: A Momentum-Driven Clicked Feature Compressed Pre-trained Language Model for Information RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9431-7_6(69-82)Online publication date: 1-Nov-2024
    • (2024)A Survey of Next Words Prediction ModelsForthcoming Networks and Sustainability in the AIoT Era10.1007/978-3-031-62871-9_14(165-185)Online publication date: 26-Jun-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media