SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Wang, Liang; Yang, Nan; Huang, Xiaolong; Jiao, Binxing; Yang, Linjun; Jiang, Daxin; Majumder, Rangan; Wei, Furu

Computer Science > Information Retrieval

arXiv:2207.02578 (cs)

[Submitted on 6 Jul 2022 (v1), last revised 12 May 2023 (this version, v2)]

Title:SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Authors:Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

View PDF

Abstract:In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at this https URL .

Comments:	Accepted to ACL 2023
Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:2207.02578 [cs.IR]
	(or arXiv:2207.02578v2 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2207.02578

Submission history

From: Liang Wang [view email]
[v1] Wed, 6 Jul 2022 10:51:33 UTC (199 KB)
[v2] Fri, 12 May 2023 10:26:03 UTC (211 KB)

Computer Science > Information Retrieval

Title:SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators