Experiments were conducted on the evaluation tasks discussed in Section
4. Without loss of generality, we selected some representative models for the experiments. We ran experiments with the models on the STS, transfer, and short text clustering tasks, and comprehensively analyzed the quantitative results. Finally, we compared the alignment and uniformity values of these models on the STS-B development set, and the t-SNE visualization results of Stack Overflow representation obtained using these models.
5.2 Quantitative Results and Analysis
The experiments were conducted on STS tasks, transfer tasks, and short text clustering tasks, the details of which are given in Section
4. The primary purpose of sentence representations, according to Gao et al. [
35] and Reimers and Gurevych [
88], is to group semantically similar sentences. Hence, the results on the STS tasks were used for the main comparison, with the results on the transfer tasks and short text clustering tasks serving as supplements. Tables
8,
9, and
10 present the reproduced results of various sentence embedding models. The reproduced results may differ from those reported in the original papers owing to differences in GPU and CUDA versions during training. However, because all reproduced results were obtained in the same experimental environment, the comparison of experimental results was relatively fair.
The manner in which the BERT sentence representations are represented is also important, and there are primarily five pooling methods. The first is the [CLS] representation, in which the representation of the [CLS] token is used as the final sentence representation [
49,
138]. The second is the [CLS] representation without MLP, in which an MLP layer was kept over [CLS] during training but removed during testing. This representation method was first adopted in the work of Gao et al. [
35] and produced better results than the [CLS] representation in unsupervised scenarios. Consequently, this representation method has also been adopted in several models [
22,
59,
100,
107,
138,
139,
142]. The third frequently used method is the “avg.” representation [
48,
76], which involves taking the average embeddings of the last layer of BERT. The fourth is the “first-last-avg.” representation, which is the average embeddings from the first and last layers of BERT [
88]. The fifth is the “last-2-avg.” representation, which is the average embeddings from the last two layers of BERT [
58,
98,
125].
5.2.1 Performance Evaluation on the STS Tasks.
Table
8shows the evaluation results on seven STS datasets, with Spearman’s correlation as the evaluation metric. The evaluation results provide several findings.
Contrastive learning boosted the quality of learned sentence representations.Compared with previous methods such as average GloVe embedding and post-processing methods BERT-flow and BERT-whitening, almost all contrastive sentence embedding models exhibited substantial performance improvements, confirming the effectiveness of contrastive learning.
Prompt-based augmentation outperformed other text augmentation strategies. In the framework of contrastive learning, sentence representation models with PLM-based augmentation outperformed models with token-level and sentence-level augmentation. The dropout augmentation adopted by SimCSE [
35] features the simplest mechanism; however, it yielded a moderately good result. In contrast, the prompt-based augmentation proposed by PromptBERT achieved the best result among the augmentation strategies, with an average STS performance of 77.34
\(\%\). One hypothesis is that during the fine-tuning of large PLMs for contrastive learning, using well-designed DA techniques usually provides more benefits in capturing semantic information.
Enhancing negative samples improved contrastive learning. The introduction of hard negatives boosted the contrastive learning performance, compared with the SimCSE results. In particular, MixCSE yielded an average STS of 77.2\(\%\), which was 2.1\(\%\) higher than that of SimCSE, indicating the effectiveness of hard negatives generated through the combination of positive and random negative features. Although the performance improvement of DCLR over SimCSE was negligible, it demonstrates the utility of noise-based negative samples.
MCSE outperformed the other three models that incorporate external data. DiffCSE, SNCSE-Dropout, EASE, and MCSE are contrastive learning based sentence representation models that add edited sentences, soft negative samples, Wikipedia entity supervision, and multimodal sentence-image pairs to SimCSE for model training. The experimental results showed that MCSE, which incorporates multimodal sentence-image pairs for multimodal contrastive learning, outperformed the other three models.
SNCSE outperformed the other models. With an average STS of 78.19\(\%\), SNCSE achieved the best results among all of the models, as well as the best results on the STS13, STS14, and STS16 datasets. The use of prompt-based sentence embeddings and soft negative samples possibly contributed to the outstanding performance of SNCSE. The effectiveness of prompt-based augmentation is demonstrated by the excellent performance of PromptBERT. The effect of soft negative samples can be illustrated by the performance of SNCSE-Dropout, which outperformed SimCSE by about 1.4\(\%\).
Vanilla BERT underperformed average GloVe embeddings in sentence representations. The three vanilla BERT models based on different pooling methods underperformed the average GloVe embeddings (non-contextualized embeddings trained with a simple model). The experimental results confirm the finding in the work of Li et al. [
58] and Reimers and Gurevych [
88]. In addition, both BERT-flow and BERT-whitening significantly improved the performance of vanilla BERT.
Combining three salient factors that influenced the quality of contrastive sentence representations could lead to better results. As MixCSE combines dropout augmentation and hard negatives, and DiffCSE combines dropout augmentation and external training data, the new model MixCSE-DiffCSE is a mixture of positive pairs, hard negatives, and additional training data. According to the experimental results at the bottom of Table
8, the average STS performance of MixCSE-DiffCSE was 77.59
\(\%\), which was better than those of MixCSE (77.20
\(\%\)) and DiffCSE (77.03
\(\%\)), indicating that combining these three salient factors is likely to produce even better results.
5.2.2 Performance Evaluation on the Complementary Tasks.
Tables
9 and
10present the evaluation results on the seven sentence classification datasets from different domains and on the six short text clustering datasets. The evaluation results were analyzed, and several trends were observed.
Overall performance on transfer tasks outperformed that on short text clustering tasks. The average classification accuracy of these sentence embedding models remained at about \(85\%\) in the seven transfer tasks. In six short text clustering tasks, the average clustering accuracy of these sentence embedding models remained at around 60\(\%\). Extracting semantic features from short text is rather difficult, and the short text clustering tasks contained more clusters (e.g., the Google News dataset had 152 clusters), whereas sentence classification tasks were a binary classification problem, which explains the preceding results.
Contrastive learning improved performance on most transfer tasks. The performance of BERT and contrastive learning based models on seven transfer tasks was compared. The vanilla BERT model based on “first-last-avg.” and “avg.” representation achieved higher performance than contrastive learning baselines on the CR and TREC datasets. The relatively small size of the CR and TREC datasets affected the training of linear classifiers based on frozen sentence representations and may have influenced the evaluation results of the contrastive learning baselines. However, the contrastive learning baselines achieved higher performance on the MR, SUBJ, MPQA, SST-2, and MRPC datasets. Thus, contrastive learning is useful for most transfer tasks.
Contrastive learning improved performance on most short text clustering tasks. The performance of BERT and contrastive learning-based models on six short text clustering tasks was compared. We observed that contrastive learning models perform better than vanilla BERT on Search Snippets, Stack Overflow, Biomedical, Tweet, and Google News datasets, whereas the “first-last-avg.” representation of BERT performs better on the Ag News dataset. Moreover, the average performance of BERT’s [CLS] representation on the six short text clustering tasks was 15.35\(\%\), significantly lower than the GloVe average embeddings of 55.66\(\%\).
Model performance on the short text clustering datasets was unevenly distributed. The experimental models exhibited moderately good results on the Ag News and Google News datasets, with overall clustering accuracies of around 80\(\%\) and 66\(\%\), respectively. However, they performed poorly on the Biomedical and Tweet datasets, with overall clustering accuracies of about 35\(\%\) and 50\(\%\), respectively. Additionally, the performances of these models varied considerably on the Search Snippets and Stack Overflow datasets. In particular, we noticed that almost all of the models exhibited the worst performance on the Biomedical dataset, possibly because the Biomedical dataset differed significantly from the Wikipedia training corpus.
5.4 Visualization of Sentence Representations
Visualizing sentence embeddings is useful for understanding how well these models learn from sentences. To be considered a high-quality representation, embeddings for sentences with the same semantics should be close to each other, and meanwhile, embeddings for sentences with different semantics should be pulled apart. To visualize high-dimensional vector embeddings, a widely used dimension reduction technique, t-SNE [
102], is employed to transform high-dimensional vectors into a two-dimensional vector space.
Following the experimental setting in PairSupCon [
133], the experimental dataset for our t-SNE visualization was the short text clustering dataset Stack Overflow, which is a subset of Kaggle, covering 20,000 question titles with 20 clusters. We chose Stack Overflow as the experimental dataset because the model performance on the Stack Overflow dataset can better reflect the models’ overall average performance on the six short text clustering tasks.
The visualization of Stack Overflow representations using t-SNE is shown in Figure
4. The [CLS] representation from BERT exhibited the worst performance and failed to achieve category clustering, with the “avg.” representation of BERT performing slightly better than the [CLS] representation. SR-BERT and BT-BERT using simple DA techniques to generate positive pairs outperformed vanilla BERT but underperformed models using well-designed DA techniques. Additionally, all of the contrastive learning based sentence representation models were superior to vanilla BERT. In particular, TSFC-BERT demonstrated a strong clustering quality. Improved models based on SimCSE, such as ESimCSE-SimCLR, MixCSE, and EASE, showed a more powerful clustering capacity than SimCSE. Overall, the experimental results agree with our expectations that contrastive learning is beneficial to BERT-derived sentence representation learning and that constructing high-quality positive pairs, negative samples, and additional training data allows the model to learn high-level category structure information.