research-article

Open access

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Authors:

Xueqi ChengAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 848 - 858

https://doi.org/10.1145/3477495.3531772

Published: 07 July 2022 Publication History

Abstract

Dense retrieval has shown promising results in many information retrieval (IR) related tasks, whose foundation is high-quality text representation learning for effective search. Some recent studies have shown that autoencoder-based language models are able to boost the dense retrieval performance using a weak decoder. However, we argue that 1) it is not discriminative to decode all the input texts and, 2) even a weak decoder has the bypass effect on the encoder. Therefore, in this work, we introduce a novel contrastive span prediction task to pre-train the encoder alone, but still retain the bottleneck ability of the autoencoder. In this way, we can 1) learn discriminative text representations efficiently with the group-wise contrastive learning over spans and, 2) avoid the bypass effect of the decoder thoroughly. Comprehensive experiments over publicly available retrieval benchmark datasets show that our approach can outperform existing pre-training methods for dense retrieval significantly.

Supplementary Material

MP4 File (SIGIR22-fp1304.mp4)

This is the presentation video of the paper "Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction".

Download
24.00 MB

References

[1]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09. ACM Press. https://doi.org/10.1145/1553374. 1553380

Digital Library

[2]

Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. Pre-training Tasks for Embedding-based Large-scale Retrieval. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net.

[3]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13--18 July 2020, Virtual Event, Vol. 119. PMLR, 1597--1607.

[4]

Xinlei Chen and Kaiming He. 2021. Exploring Simple Siamese Representation Learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 15745--15753.

[5]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2020 Deep Learning Track. ArXiv abs/2102.07662 (2020).

[6]

Zhuyun Dai and Jamie Callan. 2019. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. ArXiv abs/1910.10687 (2019).

[7]

Zhuyun Dai and Jamie Callan. 2020. Context-Aware Document Term Weighting for Ad-Hoc Search. Proceedings of The Web Conference 2020 (2020).

Digital Library

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics. https://doi.org/10.18653/v1/n19--1423

[9]

Yixing Fan, Xiaohui Xie, Yinqiong Cai, Jia Chen, Xinyu Ma, Xiangsheng Li, Ruqing Zhang, Jiafeng Guo, and Yiqun Liu. 2021. Pre-training Methods in Information Retrieval. CoRR abs/2111.13853 (2021). arXiv:2111.13853 https://arxiv.org/abs/ 2111.13853

[10]

Jianfeng Gao, Chenyan Xiong, Paul Bennett, and Nick Craswell. 2022. Neural Approaches to Conversational Information Retrieval. ArXiv abs/2201.05176 (2022).

[11]

Luyu Gao and Jamie Callan. 2021. Condenser: a Pre-training Architecture for Dense Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https: //doi.org/10.18653/v1/2021.emnlp-main.75

[12]

Luyu Gao, Xueguang Ma, Jimmy J. Lin, and Jamie Callan. 2022. Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval. ArXiv abs/2203.05765 (2022).

[13]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP.

[14]

John Giorgi, Osvald Nitski, Bo Wang, and Gary Bader. 2021. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics. https://doi.org/10. 18653/v1/2021.acl-long.72

[15]

Jean-Bastien Grill, Florian Strub, Florent Altch'e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Ávila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. ArXiv abs/2006.07733 (2020).

[16]

Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic models for the first-stage retrieval: A comprehensive review. ACM Transactions on Information Systems (TOIS) 40, 4 (2022), 1--42.

Digital Library

[17]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy J. Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021).

Digital Library

[18]

Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In ICLR.

[19]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. 2020. Dense Passage Retrieval for OpenDomain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.550

[20]

O. Khattab and Matei A. Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2020).

Digital Library

[21]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual.

[22]

Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Chen. 2021. Learning Dense Representations of Phrases at Scale. In ACL/IJCNLP.

[23]

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. https://doi.org/10.18653/v1/p19--1612

[24]

Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, Baolin Peng, Yizhe Zhang, and Jianfeng Gao. 2020. Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space. In EMNLP.

[25]

Jimmy J. Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021).

[26]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP2021). Association for Computational Linguistics. https://doi.org/10.18653/v1/ 2021.repl4nlp-1.17

[27]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692 (2019).

[28]

Shuqi Lu, Di He, Chenyan Xiong, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tie-Yan Liu, and Arnold Overwijk. 2021. Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2780--2791.

[29]

Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xiang Ji, and Xueqi Cheng. 2021. PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021).

Digital Library

[30]

Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Yingyan Li, and Xueqi Cheng. 2021. B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021).

Digital Library

[31]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 (CEUR Workshop Proceedings, Vol. 1773).

[32]

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.466

[33]

Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3 (2009), 333--389.

Digital Library

[34]

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).

[35]

Wilson L. Taylor. 1953. "Cloze Procedure": A New Tool for Measuring Readability. Journalism & Mass Communication Quarterly 30 (1953), 415 -- 433.

[36]

Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. 2015 IEEE Information Theory Workshop (ITW) (2015), 1--5.

[37]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. ArXiv abs/2007.00808 (2021).

[38]

Shih Yuan Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021. Few-Shot Conversational Dense Retrieval. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021).

Digital Library

[39]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing Dense Retrieval Model Training with Hard Negatives. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021).

Digital Library

[40]

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. ArXiv abs/2006.15498 (2020).

[41]

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. 2015 IEEE International Conference on Computer Vision (ICCV) (2015), 19--27.

Digital Library

[42]

Yunchang Zhu, Liang Pang, Yanyan Lan, Huawei Shen, and Xueqi Cheng. 2021. Adaptive Information Seeking for Open-Domain Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3615--3626.

Cited By

Zhou YShen TGeng XTao CShen JLong GXu CJiang DWooldridge MDy JNatarajan S(2024)Fine-grained distillation for long document retrievalProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i17.29947(19732-19740)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i17.29947
Zhang CWang BSong D(2024)On Elastic Language ModelsACM Transactions on Information Systems10.1145/367737542:6(1-29)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3677375
Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Show More Cited By

Index Terms

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Zero-Shot Dense Retrieval Based on Query Expansion
Artificial Intelligence Security and Privacy
Abstract
Dense retrieval is an effective information retrieval technique that utilizes semantic embedding similarity to retrieve documents. There are two typical kinds of dense retrieval methods, i.e., training-based methods and zero-shot methods. Training-...
A Contrastive Pre-training Approach to Discriminative Autoencoder for Dense Retrieval
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Dense retrieval (DR) has shown promising results in information retrieval. In essence, DR requires high-quality text representations to support effective search in the representation space. Recent studies have shown that pre-trained autoencoder-based ...
Improving zero-shot retrieval using dense external expansion
Abstract
Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on ...
Highlights
- Dense external expansion improves zero-shot retrieval performance.
- High quality ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Lenovo-CAS Joint Lab Youth Scientist Project
Youth Innovation Promotion Association CAS
Foundation and Frontier Research Key Program of Chongqing Science and Technology Commission

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
947
Total Downloads

Downloads (Last 12 months)294
Downloads (Last 6 weeks)50

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou YShen TGeng XTao CShen JLong GXu CJiang DWooldridge MDy JNatarajan S(2024)Fine-grained distillation for long document retrievalProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i17.29947(19732-19740)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i17.29947
Zhang CWang BSong D(2024)On Elastic Language ModelsACM Transactions on Information Systems10.1145/367737542:6(1-29)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3677375
Tang YZhang RGuo Jde Rijke MChen WCheng X(2024)Listwise Generative Retrieval Models via a Sequential Learning ProcessACM Transactions on Information Systems10.1145/365371242:5(1-31)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3653712
Zhao WLiu JRen RWen J(2024)Dense Text Retrieval Based on Pretrained Language Models: A SurveyACM Transactions on Information Systems10.1145/363787042:4(1-60)Online publication date: 9-Feb-2024
https://dl.acm.org/doi/10.1145/3637870
Fang YZhan JAi QMao JSu WChen JLiu YHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Scaling Laws For Dense RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657743(1339-1349)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657743
Sun XBi KGuo JYang SZhang QLiu ZZhang GCheng XAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)A Multi-Granularity-Aware Aspect Learning Model for Multi-Aspect Dense RetrievalProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635770(674-682)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635770
Zhang CWu SZhang HXu TGao YHu YChen EChua TNgo CKa-Wei Lee RKumar RLauw H(2024)NoteLLM: A Retrievable Large Language Model for Note RecommendationCompanion Proceedings of the ACM on Web Conference 202410.1145/3589335.3648314(170-179)Online publication date: 13-May-2024
https://doi.org/10.1145/3589335.3648314
Yan CFu XYou XWu JLiu X(2024)Graph-Based Cross-Granularity Message Passing on Knowledge-Intensive TextIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.347330832(4409-4419)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3473308
Wang RRen PLiu XChang SHuang H(2024)DCTMInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10378561:5Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.ipm.2024.103785
Li DDing RXie PHe X(2024)MCFC: A Momentum-Driven Clicked Feature Compressed Pre-trained Language Model for Information RetrievalNatural Language Processing and Chinese Computing10.1007/978-981-97-9431-7_6(69-82)Online publication date: 1-Nov-2024
https://doi.org/10.1007/978-981-97-9431-7_6
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten