or two independent encoders. As shown in Table 2,
we find that SimCSE performs much better than
the next-sentence objectives (82.5 vs 67.4 on STS-
B) and using one encoder instead of two makes a
significant difference in our approach.
Why does it work? To further understand the
role of dropout noise in unsupervised SimCSE, we
try out different dropout rates in Table 3 and ob-
serve that all the variants underperform the default
dropout probability p = 0.1 from Transformers.
We find two extreme cases particularly interesting:
“no dropout” (p = 0) and “fixed 0.1” (using default
dropout p = 0.1 but the same dropout masks for
the pair). In both cases, the resulting embeddings
for the pair are exactly the same, and it leads to
a dramatic performance degradation. We take the
checkpoints of these models every 10 steps during
training and visualize the alignment and uniformity
metrics3 in Figure 2, along with a simple data aug-
mentation model “delete one word”. As clearly
shown, starting from pre-trained checkpoints, all
models greatly improve uniformity. However, the
alignment of the two special variants also degrades
drastically, while our unsupervised SimCSE keeps
a steady alignment, thanks to the use of dropout
noise. It also demonstrates that starting from a pre-
trained checkpoint is crucial, for it provides good
initial alignment. At last, “delete one word” im-
proves the alignment yet achieves a smaller gain
on the uniformity metric, and eventually underper-
forms unsupervised SimCSE.