Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, Liang; Yang, Nan; Huang, Xiaolong; Jiao, Binxing; Yang, Linjun; Jiang, Daxin; Majumder, Rangan; Wei, Furu

Computer Science > Computation and Language

arXiv:2212.03533 (cs)

[Submitted on 7 Dec 2022 (v1), last revised 22 Feb 2024 (this version, v2)]

Title:Text Embeddings by Weakly-Supervised Contrastive Pre-training

Authors:Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

View PDF HTML (experimental)

Abstract:This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Comments:	17 pages, v2 fixes the SummEval numbers
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2212.03533 [cs.CL]
	(or arXiv:2212.03533v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.03533

Submission history

From: Liang Wang [view email]
[v1] Wed, 7 Dec 2022 09:25:54 UTC (105 KB)
[v2] Thu, 22 Feb 2024 06:21:51 UTC (105 KB)

Computer Science > Computation and Language

Title:Text Embeddings by Weakly-Supervised Contrastive Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Text Embeddings by Weakly-Supervised Contrastive Pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators