survey

Open access

Contrastive Learning Models for Sentence Representations

Authors:

Qing LiAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 14, Issue 4

Article No.: 67, Pages 1 - 34

https://doi.org/10.1145/3593590

Published: 15 June 2023 Publication History

PDF eReader

Abstract

Sentence representation learning is a crucial task in natural language processing, as the quality of learned representations directly influences downstream tasks, such as sentence classification and sentiment analysis. Transformer-based pretrained language models such as bidirectional encoder representations from transformers (BERT) have been extensively applied to various natural language processing tasks, and have exhibited moderately good performance. However, the anisotropy of the learned embedding space prevents BERT sentence embeddings from achieving good results in the semantic textual similarity tasks. It has been shown that contrastive learning can alleviate the anisotropy problem and significantly improve sentence representation performance. Therefore, there has been a surge in the development of models that utilize contrastive learning to fine-tune BERT-like pretrained language models to learn sentence representations. But no systematic review of contrastive learning models for sentence representations has been conducted. To fill this gap, this article summarizes and categorizes the contrastive learning based sentence representation models, common evaluation tasks for assessing the quality of learned representations, and future research directions. Furthermore, we select several representative models for exhaustive experiments to illustrate the quantitative improvement of various strategies on sentence representations.

1 Introduction

Learning sentence representations¹ has long been an important research hotspot in Natural Language Processing (NLP). Sentence representations encapsulate key semantic and syntactic information about sentences, which directly affects the performance of downstream tasks, such as sentence classification [60, 61], sentiment analysis [99, 135], and semantic matching [37, 66]. Numerous sentence representation learning models have been proposed, among which BERT (the bidirectional encoder representations from transformers) [27], a transformer-based pretrained model, stands out for its excellent performance in various NLP tasks.

However, some studies [29, 34, 58] have found that the sentence representation performance of transformer-based Pretrained Language Models (PLMs) is constrained by the anisotropic embedding space, in which word embeddings occupy a narrow cone in the vector space. In other words, high-frequency words are close to the origin and closely dispersed, whereas low-frequency words are far away from the origin and sparsely dispersed. Therefore, even if a high-frequency word and a low-frequency word are semantically equivalent, the difference in word frequency produces a large distance bias such that the distance between word embeddings cannot accurately indicate their semantic similarity. This results in the BERT sentence representations failing to outperform the non-contextualized average GloVe embeddings [84] in the Semantic Textual Similarity (STS) tasks, as indicated by Li et al. [58] and Reimers and Gurevych [88]. BERT-flow [58] and BERT-whitening [98] have been proposed to address the anisotropy problem by converting the BERT embedding space into an isotropic space.

In the work of Gao et al. [35], researchers theoretically and empirically demonstrated that the contrastive learning based sentence representation model SimCSE can ease the anisotropy problem by pushing negative pairs apart, and optimize alignment by pulling positive pairs close, which cannot be achieved in BERT-flow and BERT-whitening. Positive pairs are usually different views of the same instance, which are generated through various Data Augmentation (DA) strategies. Negative pairs are typically the remaining in-batch samples. With the success of SimCSE, fine-tuning BERT-like PLMs [27, 63, 90] through contrastive learning has become an increasingly popular strategy for generating high-quality sentence representations. These contrastive learning based sentence embedding models, such as SimCSE [35], MixCSE [138], and DiffCSE [22], yield competitive performance without supervision, even outperforming the state-of-the-art supervised sentence embedding model sentence-BERT (SBERT) [88] in STS tasks.

The two dominant frameworks,² SimCLR [16] and MoCo [41], have greatly facilitated the development of contrastive learning for their excellent performance in unsupervised representation learning. In addition, the contrastive learning performance is further enhanced by effective positive and negative samples. Section 3 focuses on the classification of sentence representation models using contrastive learning. Numerous studies [12, 35, 36, 48, 100, 119, 125, 137] have attempted to design various DA strategies to generate positive pairs, whereas others [59, 133, 134, 138, 142] have centered on generating useful negative samples to enhance sentence representation learning. Furthermore, some studies [22, 76, 107, 136] have demonstrated that incorporating external training data can help capture additional semantic features.

To assess the quality of the learned sentence representations, we discuss common evaluation tasks in Section 4, mainly including the STS tasks [1, 2, 3, 4, 5, 13, 68], transfer tasks [25], and short text clustering tasks [85, 86, 121, 129]. Section 5 provides a quantitative study of several representative models and their performance on the aforementioned evaluation tasks. Compared with vanilla BERT, nearly all of the contrastive learning models exhibited significant improvements on STS tasks and short text clustering tasks in terms of Spearman’s correlation and clustering accuracy. We also analyzed the alignment and uniformity [109] performance of these models on the development set of the STS benchmark [13]. Moreover, the t-SNE [102] visualization of Stack Overflow [121] representations derived by these models showed that high-quality representations could better group semantically similar sentences. Finally, we outline potential research directions.

Our article aims to provide a comprehensive and systematic review of contrastive learning based sentence representation models shown in Figure 1. Given the extensive studies in this domain, we identify an appropriate method for categorizing recently proposed models and present the evolution of the related studies. This review provides a bird’s eye view of the development trends, the limitations associated with existing methods, and future research directions in this field. To the best of our knowledge, this is the first in-depth survey on contrastive learning based sentence representation approaches. The main contributions to this survey are as follows:

Fig. 1.

•

We identify strategies for enhancing the performance of contrastive sentence representation learning from the perspective of positive pairs, negative samples, and external training data.

•

We summarize various evaluation tasks and datasets involved in sentence representation learning.

•

We conduct exhaustive quantitative experiments to quantify and compare the effects of various improvement strategies on sentence representation learning and also demonstrate the effectiveness of contrastive learning.

2 Background

2.1 Contrastive Learning

This survey focuses on sentence representation models based on contrastive learning. First, we review contrastive learning to highlight the role of the contrastive learning algorithm. Contrastive learning is defined as a learning scenario in which the model is trained to pull semantically similar (positive) pairs together and push dissimilar (negative) pairs apart [40]. The outstanding performance of contrastive learning in the unsupervised setting has increased its popularity.

2.1.1 Contrastive Learning Framework.

Contrastive learning is considered an algorithm that learns representations through the comparison of different input samples (i.e., anchor, positive, and negative samples). The advent of MoCo [41] and SimCLR [16] in Computer Vision (CV) has further promoted the development of contrastive learning, and MoCo and SimCLR have become the two most mainstream frameworks, widely adopted by subsequent models [11, 19, 30, 36, 118, 125, 134, 136]. The specific frameworks of MoCo and SimCLR are presented in Figure 2.

Fig. 2.

MoCo emphasizes that the number of negative samples is critical to improving representation learning and employs a momentum encoder to dynamically update negative examples. The momentum encoder relies on the query encoder to update its parameters; the momentum is set to a relatively large value (e.g., m = 0.99) because a larger momentum value results in better performance than a smaller value. In addition, MoCo involves the building of a dynamic dictionary for contrastive learning, in which the latest batch of samples is added to the queue and the oldest batch is removed from the queue in each iteration. The introduction of the queue decouples the dictionary size from the mini-batch size, allowing the dictionary to be larger than the mini-batch size to achieve much better performance. Specifically, a given augmented sample (also known as a query) generated via the DA strategy and a key in the queue are considered a positive pair if they are different views of the same instance and a negative pair if they are not.

SimCLR constructs positive pairs and combines multiple DAs to learn high-quality representations. In SimCLR, one data instance is first transformed by the DA module to form a positive pair, which is then fed into the encoder and the projector to extract semantic features. The projector is theoretically unnecessary because the encoder outputs can be directly used as metric representations. However, it has been shown that the projector in SimCLR [16] further compresses the redundant information contained in the encoder outputs, capturing more general representation information and improving contrastive learning efficiency. Therefore, the projector is usually added when using SimCLR as the contrastive learning framework. Compared with MoCo, SimCLR can be more easily implemented, but a large batch size of 8,192 is required to obtain relatively good performance.

In contrastive sentence representation learning, BERT and its variants typically serve as sentence encoders for feature learning. DisCo [117], however, involves the use of contrastive knowledge distillation [117] to obtain a more powerful student model from the larger teacher model sentence-T5 [75] as the sentence encoder. The sentence momentum encoder in MoCo is also a BERT-like PLM. The projector can be either a simple linear projection function or a nonlinear Multi-Layer Perceptron (MLP). Most studies have adopted the MLP, as Chen et al. [16] demonstrated that the MLP could produce better results than simple linear projection function.

2.1.2 Contrastive Loss Function.

The loss function, an indispensable component of contrastive learning, is the most significant difference between contrastive learning models and other neural network models. Representation learning models can be generally categorized into generative and discriminative models. The loss function of generative models is the metric distance between the inputs and generated data, and that of discriminative models is the distance between the labels and predicted results. The contrastive loss function is measured by the representation distance of the inputs in the projection space, where the projection distance can be the canonical L1-norm (Manhattan distance), the L2-norm (Euclidean distance), the dot product, cosine similarity, or bilinear similarity.

The design of the loss function is critical in contrastive learning, as it guides the training direction and objectives. As illustrated in Table 1, the contrastive loss function has evolved from the original pair loss [20], triplet loss [91], and N-pair loss [96] to the InfoNCE loss [80]. However, the objective of the contrastive loss function remains to maximize the distance between positive pairs and minimize the distance between negative pairs. Mutual Information (MI) loss is another frequently used loss function in representation learning, and MI loss based models learn representations by maximizing the MI between the inputs and their hidden vector representations [80, 101]. Oord et al. [80] showed that maximizing the lower bound on the MI loss is equivalent to minimizing the InfoNCE loss.

Table 1.

Loss Type	Mathematical Expression
Pair loss [20]	\(\mathcal {L}(x_{i},x_{j};f) = \mathbb {1}_{y_{i}=y_{j}}\vert \|f_{i}-f_{j}\vert \|_{2}^{2} +\mathbb {1}_{y_{i} \ne y_{j}}max(0,m-\vert \|f_{i}-f_{j}\vert \|_{2})^{2}\)
Triplet loss [91]	\(\mathcal {L}(x_{i},x^{+},x^{-};f) = max(0, \vert \|f-f^{+}\vert \|_{2}^{2} - \vert \|f-f^{-}\vert \|_{2}^{2} +m)\)
N-pair loss [96]	\(\mathcal {L}(x_{i},x^{+},\lbrace x_{i}\rbrace _{i=1}^{N-1};f) = -\log \frac{\exp (f^{T}f^{+})}{\exp (f^{T}f^{+}) + \sum _{i=1}^{N-1}\exp (f^{T}f_{i})}\)
InfoNCE loss [80]	\(\mathcal {L}(x_{i},x^{+},\lbrace x_{i}\rbrace _{i=1}^{N-1};f) = -\log \frac{\exp (s(f, f^{+})/\tau)}{\exp (s(f, f^{+})/\tau)) + \sum _{i=1}^{N-1}\exp (s(f, f_{i})/\tau)}\)

Table 1. Mathematical Expression of Various Contrastive Loss Functions, in Which f Is the Hidden Representation of Sample, \(m \gt 0\) Acts as a Radius of the Query, \(s( )\) Refers to the Metric Distance Function, and \(\tau\) Is the Temperature Hyper-Parameter

InfoNCE loss is the most widely used contrastive loss function and is regarded as a multi-label classification version of NCE (noise contrastive estimation) [39]. NCE is employed for discriminating data samples from noise samples and learning the distinctions between them to extract features from the data. Thus, InfoNCE loss uses categorical cross-entropy loss to classify the positive samples correctly from other negative samples. Assuming that we have a mini-batch of samples \(\lbrace x_{1}, x_{2}, \ldots , x_{N}\rbrace\) and each sample \(x_{i}\) has one corresponding positive sample \(x_{i}^{+}\), we can express the representations of these samples as \(\lbrace z_{1}, z_{2}, \ldots , z_{N}\rbrace\) and \(\lbrace z_{1}^{+}, z_{2}^{+}, \ldots , z_{N}^{+}\rbrace\). Let \(S()\) denote the metric distance function that computes the representation distance between two inputs, and let \(\tau\) be a temperature hyper-parameter. The InfoNCE loss for \((x_{i}, x_{i}^{+})\) with a mini-batch of N pairs is as follows:

\begin{equation} \mathcal {L}_{i} = -\log \frac{\exp (S(z_{i},z_{i}^{+})/\tau)}{\exp (S(z_{i},z_{i}^{+})/\tau) + \sum _{j=1, j\ne i}^{N}\exp (S(z_{i},z_{j})/\tau)}. \end{equation}

(1)

Both MoCo and SimCLR employ the InfoNCE loss as the contrastive loss function, but MoCo adopts the dot product as the distance function, whereas SimCLR uses the cosine similarity (L2-normalized dot product) as the distance function. The feature representations can be mapped into a unit hypersphere using cosine similarity, thereby improving training stability and model performance. The supervised contrastive loss function [52] incorporates the label information based on the loss function of SimCLR [16], and it allows for many positive samples per anchor sample (augmented samples and samples with the same class).

The temperature hyper-parameter \(\tau\) also plays an important role in the InfoNCE loss. The empirical results from Wang and Liu [106] showed that InfoNCE loss allows for the self-discovery of hard negatives, and \(\tau\) can regulate the degree of focus on hard negatives. The smaller the \(\tau\) value, the greater the attention paid to the separation of anchor points from the hard negative samples. However, \(\tau\) cannot be too small because the InfoNCE loss will degenerate to the triplet loss that only considers hard negatives as \(\tau\) approaches 0. At present, almost all of the contrastive sentence representation models adopt \(\tau = 0.05\) for model training. Moreover, a larger mini-batch size is usually beneficial for enhancing contrastive learning, as it partially increases the likelihood of hard negatives in negative samples.

2.2 Sentence Representation Models

Sentence representations, which encapsulate the semantic and syntactic features of a sentence in the form of dense vector embeddings, have been extensively studied. A straightforward solution is to use the composition of word embeddings, which considers that sentence representations can be composed by the average [71, 84] or weighted average [7, 89] of all of the word embeddings in the sentence. However, these methods fail to learn the information related to the word order and semantics of a whole sentence.

Deep neural network models such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) [23, 51, 55] allow for the automatic and sequential encoding of sentences into vector embeddings. However, RNN-based models lack transferability, and CNN-based models require careful setting of the convolution kernel size. Some studies [33, 42, 56, 65] have proposed constructing an encoder-decoder framework to learn sentence representations, in which the encoder maps sentences into fixed-length vectors, and the decoder predicts the surrounding sentences. An exception is the Quick-thought [65] framework, which regards the prediction of context sentences as a classification problem.

The emergence of Transformer [103] has facilitated the development of sentence representation learning. Transformer also adopts the encoder-decoder framework, but instead of using conventional CNNs and RNNs as the sentence encoders, the multi-head self-attention mechanism is employed to mine inter-word contextual information. For instance, the transformer-based conditional masked language model (CMLM) [127] learns representations through the recovery of masked tokens in a sentence conditioned on its contextual sentence representations. The sentence-level language model (SLM) [57] and the transformer-based sequential denoising auto-encoder (TSDAE) [108] capture robust sentence representations through the reconstruction of correct sentences from shuffled input sentences. Additionally, transformer-based PLMs such as BERT can be applied directly to learn sentence representations.

However, the vanilla BERT-like models suffer from poor sentence representations of the [CLS] token and underperform the average GloVe embeddings [84], attributable to the non-smooth and anisotropic BERT token embedding distribution [58, 88]. The token embeddings of high-frequency words are clustered and close to the origin, whereas low-frequency words are sparsely distributed, resulting in a high similarity between any sentence pair. To ameliorate this issue, BERT-flow [58] and BERT-whitening [98] have been proposed to convert an anisotropic space into a smooth and isotropic space.

Several studies have also utilized labeled datasets to learn sentence representations. InferSent [6] leverages supervised Stanford Natural Language Inference (SNLI) datasets [10] to train a Siamese bidirectional long short-term memory (BiLSTM) network with a max-pooling layer output. Likewise, the universal sentence encoder (USE) [14] uses annotated data from the SNLI corpus to train the transformer network for representation learning. Reimers and Gurevych [88] proposed SBERT, which fine-tunes Siamese and triplet networks [91] on the combination of the SNLI [10] and Multi-Genre Natural Language Inference (MNLI) [114] datasets.

Motivated by the success of contrastive learning in CV, many models have employed contrastive learning to fine-tune BERT-like PLMs to learn sentence representations. More importantly, these contrastive learning models (see Section 3) outperform the aforementioned unsupervised models. Some of these models even show comparable performance to supervised sentence embedding models such as InferSent and USE and outperform the state-of-the-art SBERT.

Sentence representation learning has long been a research focus in NLP, possibly because sentences, as the basic unit of NLP tasks, can express more complete semantic information than individual words and can be more easily studied than the complex semantics of documents. Furthermore, sentence representations can be applied to almost all NLP tasks, including sentence classification [60, 61], sentiment analysis [74, 99, 135], machine translation [81], and semantic matching [37, 66].

3 Contrastive Learning Methods for Sentence Representations

More recently, there has been a surge in the development of approaches that employ contrastive learning to learn sentence representations in both self-supervised and supervised settings, all of which exhibit excellent performance. Moreover, unsupervised representation learning approaches, such as BYOL (Bootstrap Your Own Latent) [38], SimSiam [18], and Barlow Twins [131], all implicitly employ contrastive learning. Such studies are not included because they did not involve negative samples during model training and thus cannot be viewed as contrastive learning models.

This section first summarizes the contrastive learning frameworks used by various models and then presents a taxonomy based on the construction of positive pairs, negative samples, and external training data. The improved performance of these models is attributable to the combined effect of multiple factors, particularly the classification characteristics of the models. We provide a comprehensive overview of contrastive learning based sentence representation models and highlight their main features inTable 2.

Table 2.

Model	Framework	Positive Samples	Negative Samples	Main Features
IS-BERT [137]	DIM	Global and local feature of the instance	Global and local features of other instances	MI loss function.
CERT [30]	MoCo	Back-translation	Samples in negative queue	Predict whether augmented sentences from the same sentence.
ESimCSE [118]		Dropout; Word repetition		Remove length feature between the positives and negatives.
PT-BERT [100]		Parameters difference between encoders		Map the positives and negatives into pseudo token embeddings of the same length and syntactic structure.
MoCoSE [11]		Dropout; FGSM	Explore how the negative queue length affects performance of contrastive learning.
AdCSE [59]	Parameters difference between encoders	Utilize adversarial training to produce hard negatives	High-quality negative representations help learn powerful sentence embeddings.
CLINE [105]	SimCLR	Synonym replacement	Antonym replacement	Only use one negative sample for representation learning.
DeCLUTR [36]		Two nearby segments in the same document	Text segments from other documents	Employ documents for sentence representation learning.
CLEAR [119]		Span deletion; Synonym replacement	In-batch negatives	Attempt various DA combinations.
ConSERT [125]		Token shuffling; Feature cutoff		Solve the collapse issue of BERT-derived sentence representations.
SG-OPT [54]		Parameters difference between encoders		Use the signal of fixed BERT to help tune another BERT.
Prompt-BERT [48]		Prompt-based augmentation		Avoid degradation caused by invalid BERT and token embedding bias.
SimCSE [35]		Dropout		Produces an excellent result with a simple augmentation method.
DCPCSE [49]		Dropout		Add multi-layer continuous prompts to the inputs, reducing training parameters.
USCAL [70]		Dropout; Adversarial examples	Use the gradient of contrastive loss to generate adversarial examples.
VaSCL [134]		Dropout	Use embeddings of anchors with optimal noise as hard negatives	Leverage the k-nearest in-batch neighbors of an instance to obtain the optimal noise.
MixCSE [138]		Dropout	Mix positive and negative features as hard negatives	Prove that hard negatives are essential in maintaining strong gradient signals.
CARDS [110]		Switch-case augmentation	Use text retrieval to find hard negatives	Combine the novel augmentation and hard negatives to yield a better result.
PairSupCon [133]		Entailment pairs of NLI datasets	Use negative samples with highest importance as hard negatives	Optimize a pairwise entailment and contradiction reasoning jointly.
DCLR [142]		Dropout	Extend in-batch negatives with noise-based negatives	Punish the false negatives to guarantee the uniformity of the representation space.
EASE [76]		Dropout; Entity	In-batch negatives and hard negative entities	Use Wikipedia entity for entity-aware contrastive learning.
DiffCSE [22]		Dropout	In-batch negatives	Generate the edited samples to help model discern the difference in sentences.
SNCSE [107]		Prompt-based augmentation	In-batch negatives	Introduce soft negative pairs to mitigate feature suppression.
MCSE [136]		Dropout; Sentence-image pairs	Use sentence-image pairs for multimodal contrastive learning.
ArcCSE [139]	Dropout; 20\(\%\) masked anchor samples	In-batch negatives; 40\(\%\) masked anchor samples	Add an angular m to the representations of positive pairs in InfoNCE loss.

Table 2. Overview of Contrastive Learning Models for Sentence Representations

3.1 Framework

The frameworks MoCo and SimCLR have been extensively applied to contrastive learning. The majority of contrastive sentence representation learning models use SimCLR as the contrastive learning framework rather than MoCo. For instance, contrastive learning based sentence representation models proposed in other works [22, 35, 36, 48, 49, 54, 70, 76, 92, 107, 117, 119, 125, 133, 134, 136, 138, 139, 142] employ SimCLR as their contrastive learning framework, whereas only a small number of models [11, 30, 59, 100, 118] adopt MoCo as their contrastive learning framework. The higher usage of SimCLR is attributable to its focus on obtaining more effective positive pairs, which favors the design of different models via the DA strategy [35, 36, 48, 54, 119, 125], whereas MoCo is less sensitive to the effects of positive pairs. In addition, the success of SimCSE [35], a contrastive sentence representation model built using the SimCLR framework, has inspired the development of numerous models [22, 76, 134, 136, 136, 138, 142] using SimCLR.

Info-Sentence BERT (IS-BERT) [137] uses the deep InfoMax (DIM) [44] framework and MI maximization for sentence representation learning, in which CNNs with different window sizes are used to produce local n-gram token embeddings that form positive pairs with the global sentence embeddings. Zhang et al. [139] discovered that the InfoNCE loss was insufficient to separate dissimilar sentences apart and pull sentences with similar semantics close. They then proposed the additive angular margin contrastive (ArcCon) loss, in which an additive angular margin of m is added between the hidden representations of positive pairs to make the model more tolerant and robust to noise. For a mini-batch of samples \(\lbrace x_{1}, x_{2}, \ldots , x_{N}\rbrace\), with the assumption that \(h_{i}\) and \(h_{i}^{*}\) are two representations of sentence \({x_{i}}\) and that cos () is the cosine similarity function, the ArcCon loss can be expressed as follows:

\begin{equation} \mathcal {L}_{arc} = -\log \frac{\exp (cos(\theta _{i,i^{*}}+m)/\tau)}{\exp (cos(\theta _{i,i^{*}}+m)/\tau) + \sum _{j\ne i}\exp (cos(\theta _{i,j})/\tau)}, \end{equation}

(2)

in which angular \(\theta _{i,j} = \arccos (\frac{{h_{i}^{T}}h_{j}}{{\vert | h_{i}\vert |} \cdot {\vert | h_{j}\vert |}})\).

3.2 Positive Pairs

Since one of the training goals of contrastive learning is to pull positive pairs close, the quality of positive pairs is crucial for noise-invariant sentence feature extraction. Nonetheless, contrastive learning heavily relies on DA to generate positive pairs. Therefore, the DA strategy greatly influences the effectiveness of representation learning. Text augmentation strategies for contrastive sentence representation learning can be broadly classified into token-level augmentation, sentence-level augmentation, document-level augmentation, PLM-based augmentation, and prompt-based augmentation. In the following, these text augmentation strategies are discussed in detail.

3.2.1 Token-Level Augmentation.

This strategy involves the acquisition of augmented samples by making transformations at the token embedding layer of sentences or subword tokenization. For instance, ConSERT [125] leverages four text augmentation methods: token shuffling, token cutoff, feature cutoff [93], and dropout [43], and randomly selects two DAs to form positive pairs. All of these DA strategies work at the token embedding layer. ConSERT performs best on the STS task when positive pairs are generated by token shuffling and feature cutoff. The term dropout [43] refers to setting the elements in the token embedding matrix to zero with a certain probability. In particular, ConSERT shows that token frequency is the primary cause of poor BERT representations, whereas contrastive learning reshapes the original representation space of BERT. Contrastive learning with augmented and retrieved data for sentence embedding (CARDS) [110] develops a novel augmentation strategy, switch-case augmentation, to mitigate the token embedding bias of BERT-like PLMs. Substitution, division, fusion, and regrouping are four transformations of switch-case augmentation, which we illustrate with an example in Table 3. Switch-case augmentation is case sensitive and therefore cannot be applied to uncased BERT-like PLMs such as BERT-base-uncased and RoBERTa-base-uncased.

Table 3.

Original	The book recommended is natural-istic. (Tokenization)
	The book recommended is Natural-istic. (Substitution)
Case Switched	The book recommended is Natural-is-tic. (Division)
	The book recommended is Naturalistic. (Fusion)
	The book recommended is Na-turalistic. (Regrouping)

Table 3. Augmented Sentence Examples for Switch-Case Augmentation

3.2.2 Sentence-Level Augmentation.

All transformations that manipulate a sentence to produce another sentence while leaving the semantic meaning unchanged are considered in this review to be sentence-level text augmentation, which covers traditional text augmentation approaches such as synonym replacement and Back-Translation (BT), and several random DA strategies. CLINE [105] generates positive and negative samples by substituting words with synonyms and antonyms from WordNet [72], respectively, and then minimizes the N-pair loss [96] of positive and negative samples and the original text to learn representations. CERT [30] employs BT to obtain augmented samples and then fine-tunes the BERT encoder by predicting whether the two augmented sentences are derived from the same sentence so that the encoder can capture global semantic features. Contrastive learning for sentence representation (CLEAR) [119] chose DA strategies such as word deletion, span deletion, span swap, and synonym replacement for contrastive learning. The experimental results demonstrated that different augmentations can obtain different sentence features and that mixed DAs are not always stronger than single DAs, whereas the combination of synonym replacement and span deletion achieves better overall performance.

3.2.3 Document-Level Augmentation.

This text augmentation technique is typically applied to the learning of contrastive textual representations of long documents. For example, DeCLUTR [36] assumes that the two nearby textual segments in the same document are more likely to be semantically similar and can thus be regarded as a positive pair. It learns informative document representations by minimizing the distance between the embeddings of positive pairs.

3.2.4 PLM-Based Augmentation.

Contrastive sentence representations are learned by fine-tuning BERT-like PLMs using contrastive learning. This type of text augmentation encodes the same sentence to produce positive pairs by leveraging the parameter differences or dropout mask between two BERT-like sentence encoders.

Parameters Difference Between Encoders.This augmentation strategy employs two encoders with different parameters to encode the same sentence, yielding two distinct representations as positive pairs. The self-guided contrastive learning for sentence representations (SG-OPT) [54] clones BERT into two copies: one with fixed parameters for computing the hidden representation of the intermediate layers, and the other for fine-tuning sentences to obtain the [CLS] representation. This way, two distinct hidden representations of the sentence are obtained: the intermediate layer representation and the [CLS] representation, which can be viewed as a positive pair. Contrastive tension [12] adopts a Siamese network-like architecture, in which two pretrained BERT encoders with the same structure but different parameters are used to encode the same sentence to create two sentence embeddings as the positive samples. Similarly, strategies involving the encoding of the same sentence, leveraging the parameters’ difference between MoCo’s gradient and momentum encoders to form a positive pair, have also been reported [59, 100]. In particular, PT-BERT [100] adds an extra pseudo-token embedding layer and an attention layer to the gradient encoder and momentum encoder to ensure that the generated positive and negative samples have the same length and syntactic structure, thereby eliminating the adverse effects of these differences.

Dropout Augmentation. Dropout augmentation can be seen as a minimal augmentation strategy that leverages the randomness of the dropout mask [97] in the fully connected layer and attention layers of transformer-based PLMs [27, 63]. Specifically, the same sentence is passed twice to the BERT-like sentence encoders to obtain two sentence embeddings as a positive pair. SimCSE [35] is the first contrastive sentence embedding model that applies dropout augmentation to create positive samples for sentence representation learning but yields a new state-of-the-art result. Because of the simplicity and superiority of the dropout augmentation strategy, it has been used to obtain positive instances to realize further improvements. The unsupervised contrastive adversarial learning (USCAL) model [70] first applies the dropout strategy to produce a positive pair and, then uses the gradient of dropout contrastive loss as adversarial noise to generate an adversarial example as another positive sample. Accordingly, the objective of USCAL is to minimize the contrastive loss between anchors and augmented examples (dropout) and that between anchors and adversarial examples. The motivation of ESimCSE [118] is that all positive pairs built by SimCSE have the same length, which could lead the model to misinterpret this as a distinguishing feature between positives and negatives. Therefore, a word repetition feature is applied to change the length of input sentences while keeping the semantics unchanged. The anchor sentence and its modified counterpart are then fed into the pretrained encoder with a random dropout to obtain positive views. DCPCSE [49] also leverages dropout augmentation to produce positive pairs, but it prepends multi-layer trainable dense vectors as continuous prefix prompts to the input sentences. During model training, DCPCSE only optimizes the deep continuous prefix prompts while freezing the parameters of pretrained BERT, reducing the complexity of parameter fine-tuning and tedious searching for handcrafted prompts. SupCL-Seq [92] adopts dropout augmentation to provide additional positive pairs for supervised sentence representation learning in addition to using samples with the same label as positive samples.

3.2.5 Prompt-Based Augmentation.

This text augmentation technique is designed to mitigate the poor performance of the original BERT caused by the invalid BERT layer and token embedding bias. In particular, prompt-based augmentation [48, 107] aims to exploit the differences in prompt templates to obtain positive pairs. It first employs two discrete prompt templates to map the same sentence to the [MASK] token and then feeds this [MASK] token into BERT-like PLMs to produce two sentence embeddings as positive pairs. The commonly used discrete prompts can be found in Table 4, where [X] is used to place the input sentence, [MASK] represents the [MASK] token, and the hidden vector representation of the [MASK] token is viewed as the final sentence representation.

Table 4.

		[X] [MASK].
Relationship Prompts		[X] is [MASK].
		[X] mean [MASK].
		[X] means [MASK].
		This [X] means [MASK].
Prefix Prompts		This sentence of [X] means [MASK].
		This sentence of “[X]” means [MASK].
		This sentence: “[X]” means [MASK].

Table 4. Optional Templates for Sentence Representations

The last two templates in bold perform better and are used as prompt templates to represent sentences in the work of Jiang et al. [48] and Wang et al. [107].

3.3 Negative Samples

Although positive pairs are important for the success of contrastive learning, negative pairs, particularly hard negatives, are equally significant [50, 77, 91, 106, 122, 141]. Much attention has thus been placed on the construction of hard negatives, but research on how to create effective negative samples to facilitate sentence representations is limited.

3.3.1 Hard Negatives.

Hard negative samples are samples whose labels are different from the anchors but whose semantics are similar to the anchors such that it is difficult to distinguish them from the anchors. In the field of NLP, the methods frequently used to generate hard negatives mainly involve leveraging the label information of supervised datasets [35, 76, 117] and designing various algorithms [59, 110, 133, 134, 138].

Both supervised SimCSE [35] and Disco [117] employ the contradiction pairs of supervised Natural Language Inference (NLI) datasets as hard negatives, whereas the entity-aware contrastive learning of sentence embedding (EASE) model [76] leverages the Wikipedia entity supervision to obtain hard negatives. The NLI datasets are a combination of the MNLI dataset [114] and the SNLI dataset [10]. Each sample in the NLI datasets includes a sentence pair (premise-hypothesis) and a relationship label (entailment, neutral, or contradiction) for the pair, which can be illustrated using three examples in Table 5. PairSupCon [133] also adopts supervised NLI datasets as training corpus with entailment pairs as positive views. However, it leverages importance sampling to select negative samples with relatively high importance as hard negatives, which are constructed in an unsupervised manner. The negative samples in PairSupCon are other in-batch entailment pairs, as this model aims to jointly optimize pairwise entailment and contradiction reasoning to capture the high-level categorical semantic structure.

Table 5.

	ID	Sentence	Label
Premise	0	A person on a horse jumps over a broken down airplane.
	1	A person is outdoors, on a horse.	Entailment
Hypothesis	2	A person is training his horse for a competition.	Neutral
	3	A person is at a diner, ordering an omelette.	Contradiction

Table 5. Three Examples of Sentence Pairs with Different Relationships in the NLI Datasets

The second hypothesis is neutral to the premise because “training” and “competition” are not reflected in the premise. The third hypothesis is a contradiction because it mentions a completely different event.

MixCSE [138], an unsupervised sentence representation learning approach, also highlights the significance of hard negatives. It extends SimCSE by mixing positive features and random negative features to produce hard negatives. Moreover, MixCSE shows that hard negatives are essential for keeping strong gradients and that randomly sampled negative examples are ineffective for contrastive sentence representations. AdCSE [59] was inspired by the success of AdCo [46] in CV. It utilizes adversarial training to train the negative sample queue in MoCo to produce hard negatives in an unsupervised manner. Because contrastive learning aims to minimize the contrastive loss by pulling the negatives apart from the positive samples, whereas the negative adversaries tend to confuse the neural network by maximizing contrastive loss, this adversarial training strategy leads to the generation of hard negatives.

VaSCL [134] utilizes a more generic technique for obtaining hard negatives. It first creates a neighborhood of an instance according to the k-nearest in-batch neighbors in the representation space. An instance-level contrastive loss is then defined to maximize the distance between instances and instances mixed with Gaussian noise while simultaneously minimizing the distance between instances mixed with Gaussian noise and neighborhood samples to obtain the optimal noise. The embeddings of the original instances with the optimal noise can be regarded as hard negatives of the original instances. Experimental results have indicated that VaSCL (which adopts dropout augmentation) outperforms SimCSE, verifying the effectiveness of the hard negatives. CARDS [139] retrieves the top k negative samples from the training corpus for each sample using text retrieval [120, 132], then computes the cosine similarity between the anchor and retrieved negative instance to screen out hard negatives. The introduction of such retrieved hard negatives greatly improves the model performance.

3.3.2 Effective Negative Samples.

Beyond generating hard negatives, other operations can also be conducted on the negative samples to boost sentence representations. For instance, debiased contrastive learning of unsupervised sentence representations (DCLR) [142] assumes that the use of in-batch negatives may cause false negatives or negatives with anisotropy to be involved in representation learning, degrading the uniformity of the representation space. Therefore, virtual adversarial training [73] was first used to generate noise-based negatives to extend in-batch negatives, and the similarity scores between original sentences and their negative samples were used to design an instance weighting approach to punish false negatives. In the work of Cao et al. [11], a MoCo-style sentence embedding model, MoCoSE, that investigates how the negative queue length affects the performance of contrastive learning was proposed. Empirical results showed that there was an optimal range of historical information for the negative sample queue and that the negative samples near the middle of the queue performed better.

3.4 External Training Data

Different from positive pairs and negative samples generated on top of the training corpus, additional data are typically reconstructed based on specific tasks or are generated from carefully designed datasets. In addition, the inclusion of additional data can either help models capture subtle sentence features or provide effective training signals for learning more general semantic information.

Difference-based contrastive learning for sentence embeddings (DiffCSE) [22], a sentence embedding model inspired by equivariant contrastive learning [26] in CV, was designed to improve the model sensitivity to the differences in sentences. DiffCSE adds a new task to SimCSE [35], which employs ELECTRA [24] to produce edited sentences. The approach for generating edited sentences is similar to BERT’s masked language modeling in that some words in the original sentences are first randomly masked, then words are generated at the location [MASK]. The discriminator in ELECTRA is used to learn the differences between the original input and the edited sentences, making the model sensitive to the difference between these sentences. DiffCSE can thus learn more robust representations by combining the contrastive loss on insensitive text augmentation (e.g., dropout) with the prediction loss on sensitive text transformation (e.g., edited sentences).

In SNCSE [107], researchers argued that most existing models suffer from feature suppression—that is, they fail to discriminate and decouple textual similarity from semantic similarity. To remedy this issue, the authors innovatively took the negations of original sentences as soft negative samples and measured the cosine similarity difference between positive pairs and soft negative pairs using the bidirectional margin loss. The texts of these soft negative samples were highly similar to the original text, but the semantics were completely different, which mitigates the problem of feature suppression. Additionally, SNCSE borrows the idea of PromptBERT [48], in which two discrete prompts are used to encode the same sentence to obtain a positive pair.

EASE [76] leverages Wikipedia entity supervision to learn sentence representations according to entity-aware contrastive learning loss between sentences and corresponding related entities, and the contrastive loss between sentences with dropout noise. Entity-aware contrastive learning treats sentences and their corresponding semantically related entities as positive pairs and collects a hard negative entity for each positive entity. Incorporating entity-aware contrastive learning provides rich training signals for sentence representations, as entities have been shown to be a powerful indicator of text semantics [32, 62, 123, 124].

In addition to Wikipedia entity supervision, multimodal contrastive learning of sentence embeddings (MCSE) [136] leverages both visual and textual information to learn sentence representations. In the work of Zhang et al. [136], the researchers combined SimCSE with a multimodal contrastive learning objective, in which a collection of sentence-image pairs were used as training data to pull semantically related sentence-image pairs close and push unrelated pairs apart in the contrastive learning framework. Experimental results demonstrated that incorporating small-scale multimodal data into the text-only corpus enhanced the alignment of textual embedding and markedly improved the model performance.

4 Evaluation Tasks for Sentence Representations

In this section, we discuss typically used evaluation tasks for measuring the quality of sentence representation. The STS task is undoubtedly the most widely used method for assessing sentence representations and has been adopted by many sentence embedding models. As complementary tasks to STS tasks, some studies have used transfer tasks and short text clustering tasks to further illustrate the superiority of learned sentence representations. Specific datasets involved in these evaluation tasks are detailed in Table 6.

Table 6.

Dataset	Size	Task Type	Metrics	Source
STS Tasks
STS12	3.1k
STS13	1.5k
STS14	3.8k
STS15	3.0k	Semantic similarity	Spearman corr.	Misc.
STS16	1.2k
STS-B	8.6k
SICK-R	9.9k
*Transfer Tasks*
MR (2)	11k	Sentiment analysis		Movie reviews
CR (2)	4k	Sentiment analysis		Product reviews
SST-2 (2)	70k	Sentiment analysis		Movie reviews
SUBJ (2)	10k	Subjectivity/objectivity	Classification acc.	Movie review snippets
MPQA (2)	11k	Opinion polarity		Newswire
TREC (6)	6k	Question type		TREC
MRPC (2)	5.7k	Paraphrase detection		Web news
*Short Text Clustering Tasks*
Ag News (4)	8k	News title		News
Search Snippets (8)	12.3k	Search snippets		Web
Stack Overflow (20)	20k	Question title	Clustering acc.	Kaggle
Biomedical (20)	20k	Paper title		PubMed
Tweet (89)	2.5k	Social media content		Tweet
Google News (152)	11.1k	News title and snippets		Google

Table 6. Details of the Dataset in the STS Tasks, Transfer Tasks, and Short Text Clustering Tasks

The bold numbers in the “Dataset” column denote the number of classes in the transfer tasks and the number of clusters in the short text clustering tasks, respectively. The size of each dataset refers to the approximate number of samples in the dataset.

4.1 STS Tasks

STS tasks include the datasets STS 2012-2016 [1, 2, 3, 4, 5], STS-B (STS Benchmark) [13], and SICK-R (SICK-Relatedness) [68], which are the most frequently used tasks for evaluating sentence representations [12, 35, 36, 118, 119, 125, 133, 136, 137, 138, 142]. Datasets STS12-16 include only the test dataset, whereas STS-B and SICK-R include the training, test, and development sets. The development set of STS-B serves as the evaluation dataset for selecting the best checkpoint during model training. Each sample in these datasets comprises a sentence pair and a human-annotated score ranging from 0.0 (different) to 5.0 (equivalent), where a higher score indicates a higher semantic similarity of the sentence pair.

To assess the quality of sentence presentation, we first calculate the semantic similarity of each sample pair in the test dataset and then compute the correlation coefficient between the calculated similarity and the human-annotated similarity. The default SentEval³ usually applies a linear regressor on top of frozen sentence embeddings for STS-B and SICK-R and trains the regressor on the training sets of the two datasets. Most studies directly adopt the raw sentence embeddings for cosine similarity calculation. The correlation coefficient metric between the calculated cosine similarity (–1.0 to 1.0) and the gold similarity of the sentence pair (0.0 to 5.0) can be obtained using both Pearson’s and Spearman’s correlation. Reimers et al. [87] showed that Spearman’s correlation, which measures ranking rather than actual scores, is more suitable for assessing sentence representations.

The aggregation method is also an important part of Spearman’s correlation computation. The STS12-16 datasets have multiple subsets, and the results for these subsets can be collected via two approaches. The first approach involves concatenating all of the subsets and then calculating the overall Spearman’s correlation, whereas the second approach involves separately computing the results for each subset and then averaging them. Most models [35, 59, 88, 100, 118, 125, 137, 138] adopt the first approach, as it combines data from different domains and brings the evaluation closer to the real-world environment.

4.2 Transfer Tasks

The transferability of sentence representations is the ability to capture informative semantic features and apply them to various tasks. It can be evaluated using the transfer tasks⁴ [25], which comprise seven sentence classification tasks from diverse domains. These are the sentiment analysis tasks MR (Movie Review) [83], CR (Customer Review) [45], and SST-2 (Stanford Sentiment Treebank) [95]; the subjectivity/objectivity classification task SUBJ (Subjectivity) [82]; the opinion polarity classification task MPQA (Multi-Perspective Question Answering) [113]; the question-type classification task TREC (Text REtrieval Conference) [104]; and the paraphrase detection task MRPC (Microsoft Research Paraphrase Corpus) [28], which is a sentence-pair classification task that judges whether the sentence pairs capture semantic equivalence relationships.

For sentence classification tasks, a logistic regression classifier or an MLP with one hidden layer is trained using frozen sentence representations generated by various approaches. The classification accuracy is reported as a measure of the model’s performance on a certain dataset, with higher classification accuracy indicating better performance.

4.3 Short Text Clustering Tasks

The ability of a model to encode high-level category information into the representations, which has only been considered in recent works [76, 133, 134], is also important in sentence representation evaluation. The short text clustering task is ideal for evaluating this capability, as it requires high-level semantic representation information. Short text usually contains only a few words. Thus, its vector representations tend to be quite sparse, so clustering semantically similar texts together could be difficult. Consequently, only the model that learns high-level categorical structures can cluster similar sentences in the representation space.

Search Snippets [85], Stack Overflow [121], Biomedical [121], AgNews [86], Tweet [129], and Google News [129] are some of the most extensively used short text clustering datasets. These datasets are collected from Web search engines, news articles, question titles (Kaggle⁵), biomedical databases (PubMed), and social media (Tweet). The average sentence length of these datasets ranges from 6 to 28, and the number of clusters in these datasets has been determined in some works [85, 86, 121, 129] using several advanced algorithms. We cluster the sentence embeddings of these short text datasets⁶ using the well-known algorithm k-means [64, 67] owing to its simplicity. The clustering accuracy is reported to measure the quality of learned sentence representations on these datasets.

4.4 Connection to Downstream Applications

The preceding evaluation tasks are also downstream applications of sentence representations. The STS tasks, which calculate the semantic similarities between sentence pairs, can be expanded to applications such as semantic matching [37, 66]. The transfer tasks are sentence classification tasks from diverse domains. The clustering of short texts into groups of similar texts also plays a critical role in numerous real-world applications such as topic discovery [53], trend detection [69], and recommendation [9]. Additionally, sentence representations can be used for machine translation, question answering, and almost all NLP tasks. Sentence representation is the basic processing unit of NLP tasks, and thus its quality extensively influences a variety of downstream tasks.

5 Experiment

Experiments were conducted on the evaluation tasks discussed in Section 4. Without loss of generality, we selected some representative models for the experiments. We ran experiments with the models on the STS, transfer, and short text clustering tasks, and comprehensively analyzed the quantitative results. Finally, we compared the alignment and uniformity values of these models on the STS-B development set, and the t-SNE visualization results of Stack Overflow representation obtained using these models.

5.1 Training Setup

5.1.1 Training Models.

We used SimCSE as the baseline model and selected several representative models according to their most salient features for experiments. ArcCSE [139] and CARDS [139] were not included because the code for ArcCSE is not publicly available, and CARDS cannot be applied to uncased BERT-like PLMs owing to its switch-case augmentation property. The experimental models were selected according to the generation of positive pairs, negative samples, and additional training data:

•

Models that focus on the generation of positive pairs: We first constructed the new models “SR-BERT,” “BT-BERT,” and “TSFC-BERT” based on the SimCLR framework; the models were named according to the employed DA techniques, namely synonym replacement (SR), BT, and token shuffling and feature cutoff (TSFC). We changed the framework of ESimCSE [118] from MoCo to SimCLR and renamed the model “EsimCSE-SimCLR.” Moreover, the models SimCSE [35], DCPCSE [49], PT-BERT [100], and PromptBERT [48] were also considered. The dropout was closed during the application of DA TSFC to pretrained BERT-like models.

•

Models that focus on the construction of negative samples: AdCSE [59] and MixCSE [138] were chosen for experiments to indicate the benefits of hard negatives, whereas the model DCLR [142] was selected to illustrate the significance of noise-based negative adversarial samples.

•

Models that focus on exploring additional data: DiffCSE [22], SNCSE-Dropout, EASE [76], and MCSE [136] were chosen to test whether the addition of extra training data to SimCSE can help the model learn more general semantic features. SNCSE-Dropout refers to a variant of the SNCSE [107] model that uses dropout augmentation rather than prompt-based augmentation. This model can more effectively illustrate the effect of soft negative samples on contrastive sentence representations and aligns with the DA strategies of the other three models.

•

A model that combines positive pairs, hard negatives, and additional training data: As stated previously, positive pairs, negative samples, and external training data considerably influence the quality of contrastive sentence representations. Thus, we constructed a new model, “MixCSE-DiffCSE,” by integrating MixCSE and DiffCSE.

•

Models that do not involve contrastive learning: These models include GloVe embedding⁷ [84] and the vanilla BERT model [27]. The post-processing methods BERT-flow [58] and BERT-whitening [98] were also considered for unsupervised comparison.

5.1.2 Training Details.

All contrastive learning based sentence representation models were trained on 1 million randomly sampled unlabeled sentences from the English Wikipedia.⁸ BERT-flow and BERT-whitening were trained on the unsupervised NLI datasets, whereas GloVe embeddings and vanilla BERT were pretrained on a large-scale corpus. Our entire empirical implementation was based on the Hugging Face Transformers library⁹ [115]. The training was performed using the uncased pretrained BERT\(_{base}\) model (110 million parameters) as the sentence encoder. The model was fine-tuned with the contrastive learning objective. We reproduced the evaluation results using the code provided in the original papers, except the codes for the models SR-BERT, BT-BERT, TSFC-BERT, ESimCSE-SimCLR, and SNCSE-Dropout were rewritten on the basis of SimCSE, and MixCSE-DiffCSE was rewritten based on MixCSE and DiffCSE. All experiments were implemented on an NVIDIA 3060 with the PyTorch version of 1.11.0 \(+\) CUDA 11.3.

SR-BERT, BT-BERT, TSFC-BERT, ESimCSE-SimCLR, SNCSE-Dropout, and MixCSE-DiffCSE adopted the same batch size and learning rate as the unsupervised SimCSE. For the remaining models, the training parameters such as batch size and learning rate were consistent with the corresponding original papers, indicating that these models were trained with different parameter settings. We trained all contrastive learning models for one epoch, except for DiffCSE (which was trained for two epochs), as well as DCLR and MCSE (which were trained for three epochs). We evaluated these contrastive learning models every 125 training steps (250 training steps for SimCSE and EASE) on the development set of STS-B and obtained the best checkpoints for the final evaluation on the test set. This appears to violate the training parameter consistency criterion. The optimal results of these models were obtained according to the relevant parameters. Table 7 shows the specific batch size and learning rate adopted by the models during training. The temperature parameter \(\tau\) was set to 0.05 in all experiments involving contrastive learning.

Table 7.

Model	Batch Size	Learning Rate
BERT-flow, BERT-whitening	16	2e-05
SR-BERT, BT-BERT,
TSFC-BERT, UnsupSimCSE,
ESimCSE-SimCLR, AdCSE,	64	3e-05
MixCSE, SNCSE-Dropout,
MCSE, MixCSE-DiffCSE
EASE	8	3e-05
DiffCSE	64	7e-06
DCLR	128	3e-05
PromptBERT	256	1e-05
DCPCSE	256	3e-02

Table 7. Batch Size and Learning Rate of Experimental Models

5.2 Quantitative Results and Analysis

The experiments were conducted on STS tasks, transfer tasks, and short text clustering tasks, the details of which are given in Section 4. The primary purpose of sentence representations, according to Gao et al. [35] and Reimers and Gurevych [88], is to group semantically similar sentences. Hence, the results on the STS tasks were used for the main comparison, with the results on the transfer tasks and short text clustering tasks serving as supplements. Tables 8, 9, and 10 present the reproduced results of various sentence embedding models. The reproduced results may differ from those reported in the original papers owing to differences in GPU and CUDA versions during training. However, because all reproduced results were obtained in the same experimental environment, the comparison of experimental results was relatively fair.

Table 8.

Model	STS12	STS13	STS14	STS15	STS16	STS-B	SICK-R	Avg.
*Self-Supervised Models*
GloVe embeddings (avg.)	57.48	70.99	60.70	70.85	63.84	60.91	54.81	62.80
BERT\(_{base}\) ([CLS])	21.54	32.11	21.28	37.89	44.24	20.29	42.42	31.40
BERT\(_{base}\) (first-last-avg.)	39.69	59.37	49.67	66.03	66.19	53.88	62.06	56.70
BERT\(_{base}\) (avg.)	30.87	59.89	47.73	60.29	63.73	47.29	58.22	52.57
BERT-flow\(^{\heartsuit }\)	58.40	67.10	60.85	75.16	71.22	68.66	64.47	66.55
BERT-whitening	61.68	65.72	66.05	75.16	73.21	68.29	63.65	67.68
SR-BERT\(^{\diamondsuit }\)	59.98	69.45	61.45	73.49	70.04	67.92	68.96	67.33
BT-BERT\(^{\diamondsuit }\)	69.52	77.98	69.26	79.76	75.38	73.10	70.67	73.67
TSFC-BERT\(^{\diamondsuit }\)	66.92	80.13	71.77	81.51	77.48	77.64	66.89	74.62
SimCSE	66.84	78.93	72.47	80.50	78.27	76.94	71.72	75.10
ESimCSE-SimCLR\(^{\diamondsuit }\)	68.95	82.05	75.28	81.82	78.44	78.38	69.30	76.32
DCPCSE	71.30	82.75	73.77	81.93	78.58	77.81	69.55	76.53
PT-BERT	68.72	82.46	73.96	81.42	77.99	77.73	70.90	76.17
PromptBERT	69.67	83.33	75.13	83.86	78.78	80.86	69.78	77.34
AdCSE	67.51	81.64	72.76	80.87	78.78	76.48	72.48	75.79
MixCSE	71.03	82.06	75.02	82.54	79.82	79.47	70.48	77.20
DCLR	66.95	80.45	73.37	81.58	77.32	77.86	71.26	75.54
DiffCSE	69.14	82.87	74.27	82.85	80.10	78.93	71.08	77.03
SNCSE-Dropout\(^{\diamondsuit }\)	67.70	81.96	74.30	81.95	78.55	77.87	73.66	76.57
SNCSE	69.31	84.21	76.87	82.35	80.51	80.45	73.61	78.19
EASE\(^{\clubsuit }\)	70.01	81.81	72.72	82.21	78.47	79.02	68.35	76.08
MCSE\(^{\spadesuit }\)	70.78	81.10	74.30	83.58	78.30	79.76	71.52	77.05
MixCSE-DiffCSE\(^{\dagger }\)	71.75	82.77	76.19	83.48	78.61	79.63	70.73	77.59

Table 8. Sentence Representation Performance on the STS Tasks (Spearman’s Correlation \(\times 100\%\))

\({\heartsuit }\): Results from Gao et al. [35]. \({\diamondsuit }\): Results were obtained by retraining our rewritten SimCSE-based code. \({\clubsuit }\): Results were obtained by reproducing the checkpoint provided in the original paper. \({\spadesuit }\): MCSE was trained with five random seeds, and the means were reported. \({\dagger }\): The combination of MixCSE and DiffCSE is a new model that considers the positive pairs, hard negatives, and external training data.

Table 9.

Model	MR	CR	SUBJ	MPQA	SST-2	TREC	MRPC
GloVe embeddings (avg.)	75.82	78.38	89.18	84.90	78.91	82.40	71.42
BERT\(_{base}\) ([CLS])	79.66	83.68	93.96	84.70	85.17	83.00	68.81
BERT\(_{base}\) (first-last-avg.)	81.04	86.84	94.99	89.45	85.45	89.80	74.03
BERT\(_{base}\) (avg.)	81.52	86.73	95.22	87.75	85.94	90.60	73.68
SR-BERT\(^{\diamondsuit }\)	79.90	84.96	94.57	88.66	84.57	89.60	74.55
BT-BERT\(^{\diamondsuit }\)	78.40	85.34	94.43	89.11	83.60	84.15	74.43
TSFC-BERT\(^{\diamondsuit }\)	77.52	81.80	93.22	88.84	81.82	86.20	74.78
SimCSE	81.40	85.80	94.31	88.36	84.84	88.40	74.49
ESimCSE-SimCLR\(^{\diamondsuit }\)	81.79	86.57	94.62	88.76	85.50	85.60	73.57
DCPCSE	77.80	83.89	92.15	87.98	81.71	76.20	73.51
PT-BERT	81.27	85.93	94.59	89.52	86.33	88.00	74.61
PromptBERT	80.42	86.18	93.51	89.05	84.35	87.60	75.19
AdCSE	80.76	85.56	94.68	89.10	85.17	87.00	73.74
MixCSE	81.12	85.09	94.40	88.73	85.56	85.80	74.61
DCLR	81.12	86.69	94.98	89.72	86.35	84.15	74.95
DiffCSE	81.65	86.10	95.01	89.17	86.05	85.60	76.41
SNCSE-Dropout\(^{\diamondsuit }\)	80.09	85.41	94.53	89.10	85.06	87.20	73.51
EASE\(^{\clubsuit }\)	70.18	84.29	99.60	88.38	85.39	84.40	75.42
MCSE\(^{\spadesuit }\)	81.22	86.61	94.55	89.06	85.92	89.08	74.82

Table 9. Sentence Representation Performance on the Transfer Tasks (Classification Accuracy \(\times 100\%\))

Table 10.

Model	Ag	Search	Stack	Bio-	Tweet	Google	Avg.
	News	Snippets	Overflow	medical		News
GloVe embeddings (avg.)	79.71	72.51	37.10	33.02	49.40	62.22	55.66
BERT\(_{base}\) ([CLS])	27.62	17.30	8.38	10.85	14.20	13.77	15.35
BERT\(_{base}\) (first-last-avg.)	86.28	69.87	33.90	34.36	49.33	67.18	56.82
BERT\(_{base}\) (avg.)	79.59	64.19	21.59	32.54	44.57	61.82	50.72
SR-BERT\(^{\diamondsuit }\)	84.60	74.44	23.53	31.25	51.12	65.78	55.12
BT-BERT\(^{\diamondsuit }\)	78.89	70.09	39.26	29.64	52.56	65.13	55.93
TSFC-BERT\(^{\diamondsuit }\)	75.37	65.52	71.18	40.14	56.04	68.98	62.87
SimCSE	76.41	67.37	59.38	33.61	53.55	66.13	59.41
ESimCSE-SimCLR\(^{\diamondsuit }\)	80.23	66.48	70.89	36.04	53.81	66.50	62.33
DCPCSE	76.30	55.60	61.61	36.93	50.44	65.13	57.67
PT-BERT	76.80	57.94	63.32	35.50	51.83	66.81	58.70
PromptBERT\(^{\heartsuit }\)	–	–	–	–	–	–	–
AdCSE	72.80	57.60	70.81	35.69	51.98	65.97	59.14
MixCSE	78.27	73.31	66.49	36.57	54.52	67.75	62.82
DCLR	78.78	55.63	59.64	35.15	51.80	66.28	57.88
DiffCSE	81.97	73.34	63.86	37.70	53.37	67.10	62.89
SNCSE-Dropout\(^{\diamondsuit }\)	74.04	44.50	70.64	36.78	51.72	67.11	57.47
EASE\(^{\clubsuit }\)	85.40	72.31	68.16	35.94	56.78	69.40	64.67
MCSE\(^{\spadesuit }\)	79.98	64.43	60.58	36.22	53.64	66.58	60.24

Table 10. Sentence Representation Performance on the Short Text Clustering Tasks (Clustering Accuracy \(\times 100\%\))

\({\diamondsuit }\): Results were obtained by retraining our rewritten SimCSE-based code. \({\clubsuit }\): Results were obtained by reproducing the checkpoint provided in the original paper. \({\spadesuit }\): MCSE was trained with five random seeds, and the means were reported. \({\heartsuit }\): The dimension of PromptBERT representations does not match the dimension of corresponding labels, we cannot perform k-means clustering and thus short text clustering.

The manner in which the BERT sentence representations are represented is also important, and there are primarily five pooling methods. The first is the [CLS] representation, in which the representation of the [CLS] token is used as the final sentence representation [49, 138]. The second is the [CLS] representation without MLP, in which an MLP layer was kept over [CLS] during training but removed during testing. This representation method was first adopted in the work of Gao et al. [35] and produced better results than the [CLS] representation in unsupervised scenarios. Consequently, this representation method has also been adopted in several models [22, 59, 100, 107, 138, 139, 142]. The third frequently used method is the “avg.” representation [48, 76], which involves taking the average embeddings of the last layer of BERT. The fourth is the “first-last-avg.” representation, which is the average embeddings from the first and last layers of BERT [88]. The fifth is the “last-2-avg.” representation, which is the average embeddings from the last two layers of BERT [58, 98, 125].

5.2.1 Performance Evaluation on the STS Tasks.

Table 8shows the evaluation results on seven STS datasets, with Spearman’s correlation as the evaluation metric. The evaluation results provide several findings.

Contrastive learning boosted the quality of learned sentence representations.Compared with previous methods such as average GloVe embedding and post-processing methods BERT-flow and BERT-whitening, almost all contrastive sentence embedding models exhibited substantial performance improvements, confirming the effectiveness of contrastive learning.

Prompt-based augmentation outperformed other text augmentation strategies. In the framework of contrastive learning, sentence representation models with PLM-based augmentation outperformed models with token-level and sentence-level augmentation. The dropout augmentation adopted by SimCSE [35] features the simplest mechanism; however, it yielded a moderately good result. In contrast, the prompt-based augmentation proposed by PromptBERT achieved the best result among the augmentation strategies, with an average STS performance of 77.34\(\%\). One hypothesis is that during the fine-tuning of large PLMs for contrastive learning, using well-designed DA techniques usually provides more benefits in capturing semantic information.

Enhancing negative samples improved contrastive learning. The introduction of hard negatives boosted the contrastive learning performance, compared with the SimCSE results. In particular, MixCSE yielded an average STS of 77.2\(\%\), which was 2.1\(\%\) higher than that of SimCSE, indicating the effectiveness of hard negatives generated through the combination of positive and random negative features. Although the performance improvement of DCLR over SimCSE was negligible, it demonstrates the utility of noise-based negative samples.

MCSE outperformed the other three models that incorporate external data. DiffCSE, SNCSE-Dropout, EASE, and MCSE are contrastive learning based sentence representation models that add edited sentences, soft negative samples, Wikipedia entity supervision, and multimodal sentence-image pairs to SimCSE for model training. The experimental results showed that MCSE, which incorporates multimodal sentence-image pairs for multimodal contrastive learning, outperformed the other three models.

SNCSE outperformed the other models. With an average STS of 78.19\(\%\), SNCSE achieved the best results among all of the models, as well as the best results on the STS13, STS14, and STS16 datasets. The use of prompt-based sentence embeddings and soft negative samples possibly contributed to the outstanding performance of SNCSE. The effectiveness of prompt-based augmentation is demonstrated by the excellent performance of PromptBERT. The effect of soft negative samples can be illustrated by the performance of SNCSE-Dropout, which outperformed SimCSE by about 1.4\(\%\).

Vanilla BERT underperformed average GloVe embeddings in sentence representations. The three vanilla BERT models based on different pooling methods underperformed the average GloVe embeddings (non-contextualized embeddings trained with a simple model). The experimental results confirm the finding in the work of Li et al. [58] and Reimers and Gurevych [88]. In addition, both BERT-flow and BERT-whitening significantly improved the performance of vanilla BERT.

Combining three salient factors that influenced the quality of contrastive sentence representations could lead to better results. As MixCSE combines dropout augmentation and hard negatives, and DiffCSE combines dropout augmentation and external training data, the new model MixCSE-DiffCSE is a mixture of positive pairs, hard negatives, and additional training data. According to the experimental results at the bottom of Table 8, the average STS performance of MixCSE-DiffCSE was 77.59\(\%\), which was better than those of MixCSE (77.20\(\%\)) and DiffCSE (77.03\(\%\)), indicating that combining these three salient factors is likely to produce even better results.

5.2.2 Performance Evaluation on the Complementary Tasks.

Tables 9 and 10present the evaluation results on the seven sentence classification datasets from different domains and on the six short text clustering datasets. The evaluation results were analyzed, and several trends were observed.

Overall performance on transfer tasks outperformed that on short text clustering tasks. The average classification accuracy of these sentence embedding models remained at about \(85\%\) in the seven transfer tasks. In six short text clustering tasks, the average clustering accuracy of these sentence embedding models remained at around 60\(\%\). Extracting semantic features from short text is rather difficult, and the short text clustering tasks contained more clusters (e.g., the Google News dataset had 152 clusters), whereas sentence classification tasks were a binary classification problem, which explains the preceding results.

Contrastive learning improved performance on most transfer tasks. The performance of BERT and contrastive learning based models on seven transfer tasks was compared. The vanilla BERT model based on “first-last-avg.” and “avg.” representation achieved higher performance than contrastive learning baselines on the CR and TREC datasets. The relatively small size of the CR and TREC datasets affected the training of linear classifiers based on frozen sentence representations and may have influenced the evaluation results of the contrastive learning baselines. However, the contrastive learning baselines achieved higher performance on the MR, SUBJ, MPQA, SST-2, and MRPC datasets. Thus, contrastive learning is useful for most transfer tasks.

Contrastive learning improved performance on most short text clustering tasks. The performance of BERT and contrastive learning-based models on six short text clustering tasks was compared. We observed that contrastive learning models perform better than vanilla BERT on Search Snippets, Stack Overflow, Biomedical, Tweet, and Google News datasets, whereas the “first-last-avg.” representation of BERT performs better on the Ag News dataset. Moreover, the average performance of BERT’s [CLS] representation on the six short text clustering tasks was 15.35\(\%\), significantly lower than the GloVe average embeddings of 55.66\(\%\).

Model performance on the short text clustering datasets was unevenly distributed. The experimental models exhibited moderately good results on the Ag News and Google News datasets, with overall clustering accuracies of around 80\(\%\) and 66\(\%\), respectively. However, they performed poorly on the Biomedical and Tweet datasets, with overall clustering accuracies of about 35\(\%\) and 50\(\%\), respectively. Additionally, the performances of these models varied considerably on the Search Snippets and Stack Overflow datasets. In particular, we noticed that almost all of the models exhibited the worst performance on the Biomedical dataset, possibly because the Biomedical dataset differed significantly from the Wikipedia training corpus.

5.3 Alignment and Uniformity

To more effectively compare the performances of the aforementioned models, we utilized a contrastive representation analysis tool from Wang and Isola [109], in which alignment and uniformity were employed as general metrics to measure the quality of learned representations. Alignment measures the expected distance between the embeddings of positive pairs. Uniformity measures how well the embeddings of instances are uniformly distributed. Lower values of both uniformity and alignment represent better model performance. Following the setting in SimCSE [35], we took the sentence pairs in the STS-B development set with a similarity score higher than 4.0 as \(P_{pos}\) and all of the development datasets of STS-B as \(P_{data}\). We denote as \(f(x)\) the normalized representation of instance x. Then, alignment and uniformity can be defined as follows:

(3)

(4)

Figure 3 depicts the alignment and uniformity of various sentence embedding models and their average results across the seven STS tasks. We observed the following results:

Fig. 3.

•

Pretrained BERT embedding models exhibited competitive alignment but poor uniformity (i.e., the embedding space is anisotropic).

•

BERT-flow and BERT-whitening substantially enhanced the uniformity of vanilla BERT and demonstrated the effectiveness of the flow-based model and whitening operation; however, both models performed poorly in terms of alignment.

•

Most contrastive sentence embedding models significantly improved the uniformity of pretrained BERT embeddings while maintaining good alignment.

•

Among all models, PromptBERT and DiffCSE exhibited the best alignment and the worst uniformity among all models, possibly because PromptBERT adopted the hidden vector representation of [MASK] to eliminate the effects of invalid BERT layers and token embedding bias, whereas DiffCSE learned the difference between sentences through using edited sentences.

•

DCPCSE greatly enhanced the uniformity but performed poorly in terms of alignment. This may be caused by the absence of pretrained BERT during the model training.

•

MixCSE achieved the best overall alignment and uniformity, attributable to the combined effects of dropout augmentation and hard negatives.

•

SNCSE-Dropout exhibited nearly the same alignment and uniformity as unsupervised SimCSE, possibly because they shared the same positive and negative samples and the same model framework.

•

The performance of these models on the STS tasks was generally consistent with the overall performance with respect to alignment and uniformity.

5.4 Visualization of Sentence Representations

Visualizing sentence embeddings is useful for understanding how well these models learn from sentences. To be considered a high-quality representation, embeddings for sentences with the same semantics should be close to each other, and meanwhile, embeddings for sentences with different semantics should be pulled apart. To visualize high-dimensional vector embeddings, a widely used dimension reduction technique, t-SNE [102], is employed to transform high-dimensional vectors into a two-dimensional vector space.

Following the experimental setting in PairSupCon [133], the experimental dataset for our t-SNE visualization was the short text clustering dataset Stack Overflow, which is a subset of Kaggle, covering 20,000 question titles with 20 clusters. We chose Stack Overflow as the experimental dataset because the model performance on the Stack Overflow dataset can better reflect the models’ overall average performance on the six short text clustering tasks.

The visualization of Stack Overflow representations using t-SNE is shown in Figure 4. The [CLS] representation from BERT exhibited the worst performance and failed to achieve category clustering, with the “avg.” representation of BERT performing slightly better than the [CLS] representation. SR-BERT and BT-BERT using simple DA techniques to generate positive pairs outperformed vanilla BERT but underperformed models using well-designed DA techniques. Additionally, all of the contrastive learning based sentence representation models were superior to vanilla BERT. In particular, TSFC-BERT demonstrated a strong clustering quality. Improved models based on SimCSE, such as ESimCSE-SimCLR, MixCSE, and EASE, showed a more powerful clustering capacity than SimCSE. Overall, the experimental results agree with our expectations that contrastive learning is beneficial to BERT-derived sentence representation learning and that constructing high-quality positive pairs, negative samples, and additional training data allows the model to learn high-level category structure information.

Fig. 4.

6 Future Research Directions

The introduction of contrastive learning has greatly contributed to the development of sentence representation learning, and there is still room for further exploitation and exploration. In this section, we discuss the areas that can be further investigated.

6.1 Data Augmentation

6.1.1 Designing a DA Strategy.

According to recent studies on representation learning [16, 35, 36, 41, 139], contrastive learning is expected to be one of the standard paradigms for sentence representation learning. Positive pairs, as the crucial part of contrastive learning, have resulted in the development of a line of text augmentation strategies. These augmentation strategies comprise traditional text augmentations, such as BT, synonym replacement, random deletion, and swap, as well as well-designed text augmentations such as dropout augmentation and prompt-based augmentation. In particular, prompt-based augmentation achieves significant performance improvements by combining prompts to represent sentence embeddings. Thus, further research could follow this line to develop novel text DA strategies.

In addition, we observed that various DAs learned different sentence features. For instance, SR produced the best results in TREC and Search Snippets, dropout with word repetition augmentation yielded the highest classification accuracy in MR, and TSFC yielded the highest clustering accuracy in Stack Overflow and Biomedical. In other words, some specific augmentation strategies were particularly effective for certain downstream tasks, as discussed in the work of Wu et al. [119]. Therefore, designing a task-specific DA will be helpful for extracting specific sentence features.

6.1.2 Theoretical Explanation Lag Behind Empirical Novelties.

Although DA strategies are beneficial for contrastive learning, and the effectiveness of DAs for contrastive sentence representation learning has been proved by a large number of experiments [30, 35, 105, 119, 125], few studies have theoretically explained the mechanism of DA in contrastive learning. Recent works [47, 112] have investigated and analyzed the feature learning process of contrastive learning in CV, and how DA helps boost the performance of contrastive learning. These studies, however, have focused on images, and the image data samples were represented by a sparse coding model [78, 79] or a spiked covariance model [8, 128]. Similar theoretical research in NLP is nonexistent owing to the discrete nature of texts (i.e., the text is a discrete variable rather than a continuous variable, so it is difficult to design an appropriate function to express text). Researchers could commence such theoretical studies by considering how the augmented samples generated by a specific text DA affect the performance of contrastive learning.

6.2 Negative Sampling

6.2.1 Hard Negative Mining.

The importance of hard negatives to the improvement of contrastive learning has been emphasized by many works [50, 106, 122]. In the field of NLP, a straightforward solution is to use the contradiction pairs of the NLI datasets as hard negatives. Unfortunately, no work has explored the use of other supervised text datasets to obtain hard negatives.

Algorithms that are carefully designed to generate hard negatives are characterized by low generality or transferability, and most of them require cumbersome computation. As discussed in Section 3, PairSupCon [133] requires importance sampling, VaSCL [134] needs to obtain the optimal noise from the top-k neighborhood, AdCSE [59] requires external adversarial training, and CARDS [139] needs to leverage text retrieval to choose the top-k negatives before selecting the hard negatives using the cosine similarity between sentence pairs. These limitations restrict the application of hard negatives to a wide range of tasks involving contrastive learning. Thus, developing a more suitable technique for generating hard negatives is a promising direction.

6.2.2 Sampling Bias for Negative Samples.

According to debiased contrastive learning [21], using random sampling or remaining in-batch samples as negative samples in self-supervised contrastive learning can result in sampling bias. In other words, the absence of label information results in the intervention of false negatives (samples that have the same label as the anchor but are regarded as negative samples), which in turn deteriorates the contrastive learning performance. To overcome this issue, Chuang et al. [21] proposed the debiased contrastive loss, which utilizes the data distribution of positive samples and the probability of being chosen as a positive sample to derive the data distribution of negative samples. However, Chen et al. [17] pointed out that such a method cannot explicitly tackle false negatives; they devised an incremental clustering technique to dynamically detect and remove false negatives to eliminate the adverse effects of false negatives.

Although some studies [17, 21, 140] have demonstrated that removing sampling bias for negative samples in self-supervised contrastive learning can result in significant and substantial improvements, only DCLR [142] has identified sampling bias for negative samples in the field of sentence representation learning, and it mitigates the performance degradation caused by sampling bias by assigning a weight of zero to false negatives. DCLR detects false negatives by calculating the cosine similarity between the anchor and negative samples, with SimCSE acting as an encoder to generate sentence representations. However, this method of screening for false negatives necessitates the use of the model SimCSE, which complicates the implementation of the screening method. Prospective work could continue to explore ways of finding false negatives in negative samples.

6.3 Incorporate Effective Datasets

Constructing positive pairs and hard negatives is important for contrastive learning, and experimental results from the work of Chuang et al. [22] and Wang et al. [107] suggest that introducing datasets that are highly similar to the training corpus can help models capture delicate but meaningful semantic features. For example, DiffCSE [22] employs ELECTRA to regenerate edited sentences to improve the model’s sensitivity to subtle noise; SNCSE [107] introduces soft negative samples to enable the model to identify differences between anchor points and similar texts with different semantics, thereby improving the stability of the model. Moreover, EASE [76] shows that incorporating external supervision from the Wikipedia hyperlink for entity-aware contrastive learning provides rich training signals for sentence representations. Therefore, future research could strive to construct suitable textual datasets or use annotated textual datasets related to the training corpus so that the model can draw out more intrinsic sentence features.

6.4 Contrastive Loss Function

The design of the loss function has been a crucial component of contrastive learning. The contrastive loss function is defined by measuring the semantic representation distance of input examples in the projection space. Therefore, selecting a suitable distance function is important in contrastive learning. The widely used distance functions in contrastive loss are mainly Manhattan distance (L1-norm), Euclidean distance (L2-norm), and similarity functions (cosine similarity and bilinear similarity). In contrast, non-Euclidean distance functions such as the Jaccard distance and the Hamming distance have not been explored. Thus, researchers should attempt to use these distance functions to measure the similarity between representations. In addition to modifying the distance function, ArcCSE [139] adds a margin constant m to the cosine similarity of the positive pairs, making the model more resistant to noise and enhancing the model performance. Future research could attempt to appropriately modify the InfoNCE loss.

6.5 Multimodal Contrastive Learning

Multimodal learning is gaining popularity as the intra-relationship between multimodal data allows for better exploitation of the intrinsic data properties of each modality. One study [130] showed that employing contrastive learning to train models on multimodal data improved the quality of learned visual representations. Wang et al. [111] also showed that multimodal contrastive pretraining facilitated the extraction of good code representations for the syntax and structure information provided by multimodal data. Although contrastive learning with sentence-image pairs as the multimodal data has been used for sentence embedding [136], potential research could consider experimenting with sentence-audio pairs as multimodal data to learn sentence representations, or combining multimodal sentence-image-audio data to learn more general semantic features.

6.6 More CV-inspired Models

Numerous contrastive learning based sentence representation models, such as IS-BERT, ConSERT, and DiffCSE, have been inspired by analogous approaches in CV. Therefore, more CV-inspired models are expected to emerge. For instance, negative DA [94] involves generating out-of-distribution samples, which has been explored in CV but not studied in text [31]. The hard negative mining technique “Ring” [116] in CV, which uses a family of MI estimators to sample negatives in a ring around each positive sample, has also not yet been researched in text. Additionally, new contrastive learning models, such as mutual contrastive learning [126] and contrastive continual learning [15] in CV, could be considered for sentence representation learning.

7 Conclusion

This article presented a comprehensive and structured study of contrastive learning models for sentence representations. We observed that the success of contrastive learning heavily depends on the selection of the model framework, positive pairs, and negative pairs. In particular, we conducted a series of experiments on evaluation tasks. The experimental results showed that adopting effective DA techniques can be beneficial for transformer-based PLMs in contrastive learning. Both the construction of hard negatives and the incorporation of external training data associated with the Wikipedia training corpus improved the quality of learned representations. Alignment and uniformity analysis demonstrated that the models based on contrastive learning greatly improved the uniformity of pretrained BERT embeddings while maintaining good alignment. The t-SNE visualization of Stack Overflow representations revealed the performance differences between the models. Finally, we outlined the research directions that could be further explored.

Footnotes

We may interchangeably use the terms representation and embedding in this article.

We notice that there are some disputes about whether to classify unsupervised representation learning models that do not involve negative samples (e.g., BYOL (Bootstrap Your Own Latent), SimSiam, and Barlow Twins, inter alia) as contrastive learning models. Hereby, we specify that only models involving the comparison of positive and negative pairs are strictly defined as contrastive learning models. Therefore, BYOL, SimSiam, and BarlowTwins, among others, are not within the scope of our review.

https://github.com/facebookresearch/SentEval.

⁴

Transfer tasks are different from transfer learning. They are actually standard evaluation tasks for the quality of sentence representations. We follow this expression of “transfer tasks” to be consistent with statements in works such as SimCSE, MixCSE, EsimCSE, and DiffCSE.

⁵

Question title from Kaggle: https://www.kaggle.com/competitions/predict-closed-questions-on-stack-overflow/data?select=train.zip.

⁶

Experimental short text clustering datasets can be found at https://github.com/rashadulrakib/short-text-clustering-enhancement/tree/master/data.

⁷

https://huggingface.co/sentence-transformers/average_word_embeddings_glove.840B.300d.

⁸

https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt.

⁹

https://github.com/huggingface/transformers.

References

[1]

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, et al. 2015. SemEval-2015 Task 2: Semantic textual similarity, English, Spanish and Pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 252–263.

Abstract

1 Introduction

2 Background

2.1 Contrastive Learning

2.1.1 Contrastive Learning Framework.

2.1.2 Contrastive Loss Function.

2.2 Sentence Representation Models

3 Contrastive Learning Methods for Sentence Representations

3.1 Framework

3.2 Positive Pairs

3.2.1 Token-Level Augmentation.

3.2.2 Sentence-Level Augmentation.

3.2.3 Document-Level Augmentation.

3.2.4 PLM-Based Augmentation.

3.2.5 Prompt-Based Augmentation.

3.3 Negative Samples

3.3.1 Hard Negatives.

3.3.2 Effective Negative Samples.

3.4 External Training Data

4 Evaluation Tasks for Sentence Representations

4.1 STS Tasks

4.2 Transfer Tasks

4.3 Short Text Clustering Tasks

4.4 Connection to Downstream Applications

5 Experiment

5.1 Training Setup

5.1.1 Training Models.

5.1.2 Training Details.

5.2 Quantitative Results and Analysis

5.2.1 Performance Evaluation on the STS Tasks.

5.2.2 Performance Evaluation on the Complementary Tasks.

5.3 Alignment and Uniformity

5.4 Visualization of Sentence Representations

6 Future Research Directions

6.1 Data Augmentation

6.1.1 Designing a DA Strategy.

6.1.2 Theoretical Explanation Lag Behind Empirical Novelties.

6.2 Negative Sampling

6.2.1 Hard Negative Mining.

6.2.2 Sampling Bias for Negative Samples.

6.3 Incorporate Effective Datasets

6.4 Contrastive Loss Function

6.5 Multimodal Contrastive Learning

6.6 More CV-inspired Models

7 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Usr-mtl: an unsupervised sentence representation learning framework with multi-task learning

An effective negative sampling approach for contrastive learning of sentence embedding

Sentence salience contrastive learning for abstractive text summarization

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations