Towards Democratizing Joint-Embedding Self-Supervised Learning

Towards Democratizing Joint-Embedding Self-Supervised Learning
Florian Bordes1,2 , Randall Balestriero2 , Pascal Vincent1,2

1
Mila, Université de Montréal, 2 Meta AI
Abstract 60
ImageNet accuracy
arXiv:2303.01986v1 [cs.LG] 3 Mar 2023
50
Joint Embedding Self-Supervised Learning (JE-SSL) has 40
seen rapid developments in recent years, due to its promise 30 FFCV-SSL (8GPU Res 160 -> 224)
20 FFCV-SSL (8GPU 224)
to effectively leverage large unlabeled data. The develop- FFCV-SSL (1GPU Res 160 -> 224)
10 Torchvision (8GPU 224)
ment of JE-SSL methods was driven primarily by the search
0 5 10 15 20 25 30 35
for ever increasing downstream classification accuracies, us- Time (hours)
ing huge computational resources, and typically built upon
insights and intuitions inherited from a close parent JE-SSL Figure 1. ImageNet validation accuracy (y-axis) during training
of SimCLR with respect to the training time (x-axis). FFCV-
method. This has led unwittingly to numerous pre-conceived
SSL is our proposed library that is specifically optimized for
ideas that carried over across methods e.g. that SimCLR
Self-Supervised Learning, and that extends the original FFCV li-
requires very large mini batches to yield competitive accu- brary [17]. We compare FFCV-SSL with torchvision using vari-
racies; that strong and computationally slow data augmen- ous image’s resolution (224 means that a fixed resolution size of
tations are required. In this work, we debunk several such 224x224 is used when cropping the images while 160 -> 224 means
ill-formed a priori ideas in the hope to unleash the full poten- that the resolution is increasing during training from 160x160 to
tial of JE-SSL free of unnecessary limitations. In fact, when 224x224) –no other changes have been applied in the implemen-
carefully evaluating performances across different down- tation and the same hardware (GPU A100) is employed. Enabled
stream tasks and properly optimizing hyper-parameters of by FFCV-SSL, we are able to perform thorough empirical inves-
the methods, we most often –if not always– see that these tigations against preconceived failure modes of SSL models for
widespread misconceptions do not hold. For example we which we obtain novel conclusions e.g. 1) SimCLR can per-
form equally well with small or large mini-batch training, 2)
show that it is possible to train SimCLR to learn useful rep-
strong data-augmentations are not always necessary and crop-
resentations, while using a single image patch as negative
ping+grayscale is enough to reach competitive performances
example, and simple Gaussian noise as the only data aug- across SSL methods, and 3) it is now possible to train SSL
mentation for the positive pair. Along these lines, in the method, e.g. SimCLR in this figure, using only 1 GPU in a
hope to democratize JE-SSL and to allow researchers to reasonable amount of time.
easily make more extensive evaluations of their methods, we
introduce an optimized PyTorch library for SSL https:
//github.com/facebookresearch/FFCV-SSL.
negative examples, i.e. large batches, which limits the ac-
cessibility of contrastive method training to researchers who
1. Introduction have access to massive and costly hardware resources. An-
Interest in Self-Supervised Learning (SSL) has increased other common criticism is the requirement for a very specific
steadily since the work of [5]. By using very specific sets set of hand-crafted data augmentation to make such methods
of data augmentations to design positives pairs of examples, work. Moreover, in many instances, the use of these data
as well as using large mini-batch of images to define the augmentations can considerably increase the training time.
negatives examples, [5] demonstrated the competitiveness The computational burden of SSL directly impairs its
of SSL with respect to supervised baselines. Since then, widespread adoption since access to multi-GPU and large
other works tried to build upon the contrastive method of [5] scale training is not guaranteed, and because most available
by either increasing the scale [6], improving the negative resources are turned towards producing yet better perform-
example sampling scheme using a buffer [6] or by using new ing SSL variants. Two direct consequences are that (i) there
data augmentations [9]. Despite their successes, contrastive does not exist practical guideline in the literature prescribing
methods are subject to several frequent criticisms. The most how to perform more computationally friendly SSL even
common one is the assumed need for a large number of if it is at the cost of slightly reduced top-1 performances,
1
and (ii) that successively refined SSL methods rarely spend 2. Joint-Embedding Self-Supervised Methods
resources to question or contest the empirical guidelines that and Notations
were developed in previous studies. Those two points also
interplay with each other since (ii) for example commonly JE-SSL relies on processing multiple –semantically
prescribes the use of color-jitter data-augmentation with the related– views of a same input through a nonlinear map-
goal to produce state-of-the-art performances, but this aug- ping, commonly a deep network (DN), and enforcing that
mentation is also among the most computationally expansive the produced representations of those views are close to each
to apply in practice. other. This matching is enforced through the positive term
of the loss function, all while preventing the DN’s mapping
In this paper, we first show that several such widely held to collapse e.g. to a constant function through a collapse
ideas concerning Joint Embedding Self-Supervised Learning prevention term. Different flavors of positive and collapse
methods (JE-SSL) are misleading, and are an obstacle for prevention terms lead to different JE-SSL methods.
the democratization of JE-SSL methods. Here are further
Dataset, Data-Augmentation and Relation Matrix No-
illustrations of popular misconceptions that will be debunked
tations. Regardless of the loss and method employed, SSL
in this paper:
relies on having access to a set of observations i.e. input
samples X , [x1 , . . . , xN ]T ∈ RN ×D and a known posi-
• "We find that, when the number of training epochs is tive relationship between those samples e.g. in the form of
small (e.g. 100 epochs), larger batch sizes have a sig- a symmetric matrix G ∈ (R+ )N ×N where (G)i,j > 0 iff
nificant advantage over the smaller ones." [5] "SimCLR samples xi and xj are semantically related, and with 0 in
and SwAV both require a large batch (e.g., 4096) to work the diagonal. Commonly, one is only given a dataset X 0
well." [7] "Contrastive methods suffer from the need of of size N 0 , and artificially constructs the JE-SSL dataset
a lot of negative examples which can translate into the X and G from augmentations of X 0 , e.g. rotated (for im-
need for very large batch sizes" [2] ages) or noised versions of the original samples, obtained
through transformations t drawn from some distribution T .
Of course, such transformations are designed to preserve
• We find that [Byol] is not robust to removing some types the semantic information of the original inputs as those are
of data augmentations, like SIMCLR [11] employed to determine the positive views of the inputs as in
t1,1 (x1 )T
 
By reproducing many experiments of the original work of [6]
xT1 t1,2 (x1 )T
 
 
we will be able to debunk most of the aforementioned a pri-  ..  ti,j ∼T ,∀i,j 
 
..
 .  −−−−−−−→  ,

oris. In fact, it appears that some of the most influential  .

papers in the field may have paid insufficient attention to a T T
xN 0 tN 0 ,1 (xN 0 ) 
fundamental aspect in empirical machine learning research: tN 0 ,2 (xN 0 )T
hyper-parameters tuning and diversity in the evaluation pro-
tocol. Our main takeways will be that JE-SSL need special with (G)i,j = 1{j−1=i} + 1{j+1=i} and where in this case
care as their performance vary greatly with respect to the each original input is used to general two positive views.
employed loss’ hyper-parameters, and that solely looking Lastly, Z ∈ RN ×K denotes the matrix of feature maps
at Imagenet-1k [8] downstream performance is often mis- obtained from a model fθ : RD 7→ RK —commonly a Deep
leading when it comes to measuring the quality of a learned Network— as Z , [fθ (x1 ), . . . , fθ (xN )]T .
representation. Being free of such misconceptions, we then VICReg’s loss [2] is defined as a function of X and G
explore a more extreme scenario. We train SimCLR with a in the following triplet loss
single negative example which is taken from a small patch
of the positive pair. Doing this with only a Gaussian noise K
X q X
as data augmentation for the positive pair, leads on several L=α relu 1 − Cov(Z)k,k + β Cov(Z)2k,j
downstream tasks to results that are very close to the Sim- k=1 j6=k
CLR baseline. N X
N
γ X
All of our empirical analysis is enable by FFCV-SSL + (G)i,j kZi,. − Zj,. k22 . (1)
N
which we developed specifically to reduce data loading i=1 j=1
overhead when training JE-SSL methods, and is based on
the fast data loading library FFCV [17]. By using FFCV- We will refer to each term in Eq. (1) as Lvar , Lcov , and Linv
SSLhttps://github.com/facebookresearch/ respectively.
FFCV-SSL, we were able to run SSL experiment 3 times SimCLR’s loss [5] is slightly different and first produces
faster than before. an estimated relation matrix G(Z)
b [1] that is compared to
2
the ground-truth relation matrix G via 3. Debunking Popular Myths about Joint-
Embedding Failure Cases
eCoSim(zi ,zj )/τ
(G(Z))
b i,j = PN ,
eCoSim(zi ,zj )/τ Due to the large number of hyper-parameters that JE-SSL
j=1,j6=i
N X
N
rely on and the high computation cost required to train these
X models with existing software, most novel methods only
L=− (G)i,j log(G(Z))
b i,j . (2)
i=1 h=1
explore a small part of the whole hyper-parameter space. For
example only comparing with previously reported results
where CoSim denotes the cosine similarity, and τ > 0 is a using the same set-ups. For example, [5,21] study the impact
temperature parameter. The only difference between Sim- of the batch size but keep all other hyper-parameters fixed.
CLR and variants s.a. NNCLR [9] lies in how one defines Although such sensitivity experiments are useful in many
G. Hence, although we will particularly focus on SimCLR, ways, they also risk leading to misconceptions on the failure
our findings should easily extend to such variants. cases of JE-SSL. This is what we propose to investigate in
BarlowTwins’s loss [21] proposes yet a slightly different this section. Surprisingly, we will be able to debunk empiri-
approach where zi must be close to zj if Gi,j > 0. They do cally several observations that were put forward in multiple
so with different flavors of losses and constraints to facilitate previous studies, ultimately showcasing that most JE-SSL
training. Hence, and for these models only, it is common to methods are actually much more similar to one another than
explicitly group X into two subsets Xleft and Xright based previously thought, and suffer much less dramatic failures
on G so that ((Xleft )n , (Xright )n ), ∀n are all the positive than previously reported.
pairs from (X, G) . This does not lose any generality. In
fact, suppose that we have 5 samples a, b, c, d, e, and that 3.1. The Impact of Mini-Batch Size for SimCLR
G says that a, b, c are related to each other, and that d, e and BarlowTwins
are related to each other. Then, we can create the two data
Recall from Eq. (2) that SimCLR uses negative examples
matrices as
in the denominator term of its loss. It has been reasoned that
Xleft = [a, a, b, b, c, c, d, e], Xright = [b, c, a, c, a, b, e, d]. the role of this term is to perform negative sampling [15] and
thus that is will only be effective when the number of sam-
Once the two (left/right) views are obtained, the correspond- ples i.e. mini-batch size is large. Equivalently, BarlowTwins
ing embeddings Zleft , Zright can be computed and the Bar- (recall Eq. (3)) estimates correlation along the sample di-
lowTwins is then defined as mensions and thus also is expected to benefit from large
K mini-batch size for more accurate estimation. In this section,
we propose to refute that large mini-batch sizes are required
X
L= (CoSim((Zleft ).,k , (Zright ).,k ) − 1)2
k=1 to successfully train JE-SSL with these two methods.
K
X K
X The belief that many methods rely on large mini-batch
+α CoSim((Zleft ).,k , (Zright ).,k0 )2 . (3) size has been mentioned in many recent studies, e.g. [7].
k=1 k0 =1,6=k This belief was confirmed by one of the Figure in [5] that
display a 7% accuracy gap on ImageNet between a model
where CoSim computes the cosine similarity between the trained with a small (256) versus large batch size (8192).
two input vectors. One should notice that those terms cor- However, an important point that people often miss when
respond to the cross-correlation matrix between the two reading [5] is the critical influence of the optimizer and the
embeddings. choice of the learning rate when training with different batch
Projector Networks are multilayer perceptrons (MLPs) size. In their appendix, [5] show that when using a better
that are added on top of the DN "backbone" model that optimization technique, the gap in performances between
one aims to train with JE-SSL. Post-training, the projector bigger and smaller batch size are significantly reduced.
network is removed; giving back the original DN’s archi-
tecture but with much improved performance compared to
not employing a projector network. This technique popular- The impact of the downstream task One caveat of the
ized by [5] introduces additional hyper-parameters to tune batch size analysis in [5, 21] is that it concerns only the
e.g. the depth and width of that MLP. It also makes it less performances on ImageNet. Since one of the main moti-
clear what in truth is learned by the DN backbone (as this is vation behind SSL is to learn a model whose representa-
not the representation level on which the training objective tions can generalize to different tasks, we analyse the perfor-
is applied). [3, 4] demonstrated that one benefit of using a mances with different batch sizes across several downstream
projector is that it serves as a buffer to absorb the bias of tasks: ImageNet-1K [19], CIFAR10 [16], CLEVR [14], Eu-
possibly misspecified data-augmentations and/or suboptimal rosat [12], Inaturalist [13] and Places [22]. In Figure 2, we
JE-SSL loss hyper-parameters. plot the performances with SimCLR trained with several
3
100
Batch Size: 256 67
Batch Size: 512
80 Batch Size: 1024 66
60
Accuracy
Accuracy
65
40 64 Batch Size: 256
Batch Size: 512
20 63 Batch Size: 1024
Batch Size: 2048
0 62
0.1
5
0.5
5
10
1K
R-C
R-D
at
t18
0.1
0.1
0.1
0.2
0.2
0.7
20
ros
R
IN
Ina Temperature
EV
ces
EV
FA
Eu
CL
CL
CI
Pla
Datasets Figure 3. Validation accuracy on ImageNet with respect to the
temperature parameters in SimCLR and the batch size. When
Figure 2. Accuracy across different downstream tasks given by
having a temperature of 0.1, the gap between a large batch size can
probing SimCLR representations trained with different batch size.
be as high as 4%. However, when carefully running a grid search,
CLEVR-C corresponds to the task of counting the number of ob-
we observe that the optimal temperature might not be the same
jects in the images whereas CLEVR-D corresponds to the task of
depending of the batch size.
estimating the distance between objects,. Even if the performances
on ImageNet(IN1K) are better with a larger batch size, it’s not
necessarily the case for every downstream tasks.
66
64
Accuracy on ImageNet
batch sizes. We observe that there is a small gain when using

larger batch size with SimCLR on ImageNet (IN1K) how- 62
ever the benefit of using larger batch size is not consistent
across all of the downstream tasks. The performances on 60
the Eurosat, Places and CLEVR dataset are not better with a 58
large batch size. Batch Size: 256
This suggests that researchers should exert caution before 56 Batch Size: 512
broadly declaring that a given method requires a large batch Batch Size: 1024
size. While the claim may be empirically verified in one
54 Batch Size: 2048
setting (ImageNet), it might not be true in other scenarios. 0.1 0.3 0.5 0.7 0.9 1.2 1.5 1.7 2.0
Learning Rate
The impact of hyper-parameters of the loss Another Figure 4. Validation Accuracy on ImageNet with respect to the
learning rate (with LARS [20] as optimizer) for SimCLR. When
important aspect that interacts with the batch size is the opti-
having a learning rate of 0.3, the gap between a large batch size can
mization of the hyper-parameters of the SSL loss. Because of be as high as 4%. However, when carefully running a grid search,
the supposed large computational requirements of SSL, most we can see that the optimal learning rate might not be the same
researchers have performed their experiments by varying depending of the batch size.
only a given factor at a time without cross-validation. For ex-
ample, [5, 21] study the influence of the batch size only for a
given temperature for SimCLR and a given hyper-parameter
lambda for Barlow Twins. However the influence of the impact on the gap in accuracy between larger and smaller
temperature in Eq. (2) also depends on the batch size which batch size. When taking a single point as learning rate, for
has a direct impact on the scale. Fig. 3 shows the impact of example 0.1 we observe a 4% accuracy gap between a batch
the batch size with respect to SimCLR temperature. size of 2048 and 256. However, if we carefully tune the
learning rate, the gap becomes much smaller. We present a
The impact of the learning rate Another important pa- similar experiment with Barlow Twins in Fig. 5. There we
rameter that interacts with the batch size is the learning rate. also observe the high sensitivity to the learning rate depend-
In Fig. 4, we show the validation accuracy on ImageNet of a ing of the batch size. For Barlow Twins, we even see the
SimCLR trained with different batch sizes and learning rates training become very unstable when using high learning rate
using the LARS optimizer [20]. We observe a significant with large batch size.
4
100
69
80
68
Batch Size: 256 60
Accuracy
67 Batch Size: 512
Batch Size: 1024
66 Batch Size: 2048 40
Crop/Jitter/Gray/Sol.
Crop/Gray/Sol.
65 20 Crop/Gray
Crop
0.3 0.7 0.9 1.2 1.5 1.7 2.0 0
Learning Rate
0
1K
ist
at
t18
5
R1
20
ros
R-D
IN
ina
ces
FA
Eu
EV
CI
Figure 5. Validation accuracy on ImageNet with respect to the
Pla
CL
learning rate (with LARS [20] as optimizer) for Barlow Twins.
Like SimCLR, the optimal learning rate can be radically different
Datasets
depending the batch size one had decided to use to train the model. Figure 6. Accuracy for various downstream tasks for a model
So, it’s really important to perform a grid search over the learning trained with SimCLR with different data augmentations. So corre-
rate when changing the batch size. The dashed line corresponds to sponds to a Solarization transformation with is applied with a 20%
the situations when training Barlow Twins resulted in nan. probability, Gray corresponds to a Grayscale operation that is also
applied with a 20% probability, B. corresponds to a gaussian blur
BS/Layers 1 2 3 4 5 that is applied 100% of the time and Jitter is the ColorJitter opera-
tion, often used in SSL with a probability of 80%. For most of the
128 57.8 66.8 66.8 66.8 66.8
downstream tasks, there is an important gain in accuracy using all
256 59.3 66.4 68.1 68.4 68.4 the sets of augmentations available. Surprisingly, the performances
512 60.2 67.9 69.6 69.5 69.5 in only using cropping and a simple grayscale. transformation is
1024 61.3 69.3 70.3 70.3 70.5 very competitive.
2048 62.0 69.7 70.7 70.5 70.5
Table 1. Effect of the number of layers in the projector on the

ImageNet validation accuracy depending on different batch sizes
with SimCLR. For this setting, we use the same learning rate for like Color Jitter [5, 21]. However, ColorJitter comes with
all models. We use a non linear multilayer perceptron (two layers) an important computational cost (see Tab. 2). To determine
as representation probing and present the best validation accuracy how much these augmentations are important with respect
for each model. In this table, we observe directly that one can gain to classification performances, we follow a similar experi-
several accuracy percentage point in adding layers in the projec- mental protocol as in the previous section, except that we
tor even for small batch sizes. The main takeaway is that when study the impact of specific data augmentations on different
comparing SSL methods, one should use the same number of downstream tasks. In Fig. 6, we perform this experiment on
layers in the projectors. SimCLR. For some of the classification downstream tasks,
there is a important benefit in using all the data augmenta-
tions. But for others it is not as obvious. Surprisingly, we
The impact of Guillotine regularization We further ex-
found that a simple cropping with a simple grayscaling on
plore the batch dependency with respect to the number of
one branch gives very good performances with the bene-
layers in the projector. Since [3] demonstrated that the main
fit of being much computationally much cheaper than the
role of the projector in SSL methods is to absorb the bias
ColorJitter operation (Tab. 2).
of an ill-defined training pre-text task, we hypothesize that
using a deeper projector might help in reducing the gap in This observation also holds for other SSL-JE methods.
performances between different mini batch sizes. In Table Figure 7, shows the ImageNet validation accuracy obtained
1, we show the results of training SimCLR with a different by SimCLR and Barlow Twins using different data augmen-
number of layers in the projector. Adding layers in the pro- tations. We use similar data augmentations as the ones pre-
jector indeed helps bridge the gap in performances between sented in Fig. 6. Similarly, we observe that the grayscaling
a large and a smaller mini batch size. operation appears key for both SimCLR and Barlow Twins.
Consequently, it seems unfair to discard JE-SSL methods
3.2. The Impact of Data-Augmentation
due to their supposed need for strong and rich data augmen-
Another important point that is often discussed about JE- tation, as merely cropping and grayscaling is able to yield
SSL methods is their need for very strong data augmentations good performances.
5
65 SimCLR
Barlow Twins 60
ImageNet Accuracy
ImageNet Accuracy
60
55 40
50
45 20 Online linear probing
Offline linear probing
40 Offline MLP probing
(Cr) . it 0 20 40 60 80 100
Crops Cr+So Cr+Gr +S o +Gr+B o+Gr+B.+J Epochs
C r Cr+S
Data augmentations Figure 8. Depiction of the classifier probe trained to predict the
Imagenet-1k labels from the output of the backbone during training
Figure 7. Detailed impact of the data augmentations used during (online) and post-training (offline) using a linear or MLP classifiers.
SimCLR and Barlow Twins training on the ImageNet validation The cross in red corresponds to the best accuracy. In the offline
accuracy. As in Fig. 6, So corresponds to a Solarization trans- setting no data-augmentation is employed. We observe clearly that
formation applied with a 20% probability, Gray corresponds to a (i) when employing an MLP only a few epochs are needed and reg-
Grayscale operation that is also applied with a 20% probability, B. ularization or early-stopping should be employed, however, in the
corresponds to a Gaussian blur applied 100%of the time and Jit popular linear case, we clearly see that there is limited differences
is the ColorJitter operation with 80% probability. In this Figure, between the online and offline performances, and that over-fitting
we can clearly see that the addition of grayscaling have the most never occurs during either of the training cases.
significant impact on the ImageNet accuracy.
3.3. The impact of the evaluation protocol

the representation from collapsing, and their quality is thus
Another important dimension in Self-Supervised learn- a predominant concern in JE-SSL. This e.g. motivated the
ing is the protocol for evaluating the learned representa- idea of hard-negative sampling. In this section, we propose
tion, which is typically limited to linear probing and/or fine- to debunk the idea that it is necessary for the model to work
tuning, occasionally complemented by qualitative visualiza- hard (i.e. consider many other examples) to get good nega-
tions such as RCDM [4]. A more thorough evaluation of tives. We showcase the ability to generate useful negative
the quality of the learned representation and its suitability samples for each image, based on simple transformations of
for downstream tasks should also consider non linear prob- only that same image.
ing. In Fig. 8, we compare the performances with linear
probing in an online scenario (meaning training the linear Specifically we attempt to train SimCLR only using a
probe at the same time as training the SSL model with a cut single random patch as negative examples while using only
gradient on the classifier input) and an offline setting (when Gaussian noise as data augmentation to define the positive
training the linear probe occur only after training) as well pair. This set-up deliberately defies most current guide-
as an offline non linear probing (a two layers MLP of size lines, which favor aggressive data-augmentations and hard-
2048-2048-1000). We first observe that the performances negative sampling. However, despite that fact that we will
between online and offline linear probing are well correlated. leverage the same instance to define the positive and nega-
When using a non linear probing, we observe a significant tive pair, it shouldn’t be too surprising that different crops
boost in accuracy, however the optimal validation score is can produce relatively different images. In Fig. 9 we show
not necessarily the one at the last epoch. This is not surpris- the accuracy across diverse downstream tasks obtained with
ing since it is much easier for a non linear probe than a linear such an instance based SimCLR model. Surprisingly this
probe to overfit on its training set. somewhat extreme approach can nevertheless learn good
representations useful for several downstream tasks. We get
3.4. Per-Instance Positive and Negative Sample high accuracy on CIFAR10 or Eurosat, but only 40% on Ima-
Generation geNet. The performance drop on ImageNet is not surprising
given we reduced a lot the amount of inductive bias in the
Most JE-SSL methods, and in particular contrastive ones augmentations. However, the high performances on transfer
such as SimCLR, share a need to have positive (semantically on CIFAR10 is still surprising. We hope that our work will
similar) and negative (dissimilar) inputs. The latter prevent motivate further the exploration of such extreme scenario.
6
100 Crops +Blur +Gray. +Sol. +Jitter
Torchvision 7:00 9:25 9:26 9:30 13:20
80 FFCV-SSL 1:30 1:36 1:58 2:07 7:00
60
Accuracy
Table 2. Time (minutes: seconds) for one complete epoch over the
torchvision data loader for various data augmentations. We observe
40 that the Blur and ColorJitter operation add a considerable time in
the training.
20 SimCLR usual DA
SimCLR denoising
0 misleading, since the forward and backward pass of the
model in a training loop will take additional times that can
0
1K
ist
at
t18
05
R1
ros
es2
R-D
IN
Ina play with the caching process. When measuring the training
FA
Eu
c
EV
CI
Pla
time for a single epoch with torchvision, using the full set of
CL
Datasets data augmentations, we found that it takes on average 1141s

for one epoch while FFCV take around 428s. This shows
Figure 9. Accuracy on various downstream tasks using an instance that merely switching the data loading library can give us
based SimCLR (SimCLR denoising in the figure) and a SimCLR almost a three times speed up, for a single epoch.
using the usual data augmentations. Instead of using all the exam-
In Figure 1, we plot the accuracy on the ImageNet dataset
ples in the mini batch as negatives examples, we decided to use only
a small crop of the positive pair as negatives examples. We also with respect to the training time for models trained with
replaced all the SSL data augmentations by only a simple Gaussian FFCV-SSL and torchvision. Like the original FFCV, FFCV-
noise, thus instead of learning to be invariant to cropping or color- SSL has the ability to easily switch the resolution of the data
jitter, this instance SimCLR learn only to denoise a specific patch during training, which can also significantly improve the
of the image. Doing so allow the network to avoid collapse while training time. Using FFCV-SSL, we can train a SimCLR in
having competitive accuracy on downstream tasks like CIFAR10 or less than 8 hours using 8 GPUs A100. When using a single
the Eurosat dataset. The drop in accuracy on ImageNet is expected GPU (one A100), the training time take 35 hours which is
since we reduce a lot the amount of inductive bias since we only close to the training time of SimCLR using torchvision when
used Gaussian noise as data augmentation without cropping. trained on 8 GPUs. Having a way to faster train SSL-JE
models on 8 or even only 1 GPU will help democratize this
research area. This will also enable more thorough hyper-
4. FFCV-SSL: A Fast Data Loading Library parameter search, which should avoid the pitfall of drawing
Tuned to Improve JE-SSL Training Time flimsy scientific conclusion based on too few data points. In
Most implementations of Self-Supervised learning use addition, the codebase we developed allows to easily perform
Pytorch [18] along its vision library Torchvision. Since SSL training and evaluation of different SSL models. To straight-
methods rely on an important set of rich data augmentations, forwardly guarantee using the same experimental setup, –
we hypothesized that the data loading process could become which is important when comparing different methods – we
an important bottleneck when training SSL models. To wrote a simple file that supports many existing SSL losses.
verify our hypothesis, we replace Torchvision by the data In contrast with complex libraries such as VISSL [10], our
loading FFCV [17] library. More precisely we created a fork main code is self-contained in a single file. This makes
of this library that we called FFCV-SSL, to include most it easier to hack and to tune every-hyper parameters that
of the data augmentations that are currently used in Self- researchers might need.
Supervised learning. We also added the ability to manage One limitation of our work is that, as shown in Tab. 2,
multiple pipelines in parallel (which is needed in SSL-JE the data augmentation still take a lot of time, especially
since we have at least two different views of a given image). Grayscale and ColorJitter. One should probably be able
We started by investigating the time consumed to only to get additional speed up by using further optimized data
fetch the data. In Tab. 2, we show the time to pass through transformations.
the entire ImageNet dataset (when using no neural networks)
4.1. Enabling single GPU training with FFCV-SSL
with the torchvision and FFCV-SSL dataloader for various
data augmentations. We can see that adding augmentations As demonstrated in the previous section, the necessity of
significantly increases the time for a full epoch. The addition strong augmentation and large batch size in SimCLR is not
of the traditional SSL augmentations leads to almost a two as obvious as presented in the current literature. Knowing
times slow down, only for data processing. this, we attempted to train several SimCLR models using a
However, measuring only the data loading time can be single GPU.
7
computationally cheaper set of augmentation to employ.
60
• Imagenet-1k top-1 metric does not contain the full pic-

ture when measuring the quality of a learned representa-
40 tion, we believe that the considered set of OOD tasks in
Fig. 4 is more representative.
• The projector architecture also plays a crucial role for

20 final performance, and one should thus strive to compare
JE-SSL methods with the same projector network instead
of the official one that was found independently by each
0
0 20 40 60 80 100 of the methods.
Epochs
6. Conclusion
Figure 10. Imagenet validation accuracy on a wide cross valida-
tion performed on a single gpu with SimCLR. For this experiment, We provided a thorough series of experiments aiming
we performed a grid search on the temperature with the values at debunking some popular but misconceived ideas around
0.10, 0.15, 0.25, 0.5 and the learning rate with the following val- joint-embedding SSL. Because most studies capitalize on
ues: 0.3, 0.5, 0.7, 1.0, 1.2, 1.5, 2.0, 2.5, 3.0. We found that the prior existing ones to limit the computation burden of train-
best hyper-parameters for a Single GPU training, using a batch size ing JE-SSL models, we found that many pre-conceived
of 256, are a temperature of 0.15 and a learning rate of 1.0, this
ideas had remained unchallenged for years, thus impair-
lead to a 65.58 accuracy in online linear probing and 68.5% in non
linear probing.
ing the development of novel methods. For example we
deliver key findings regarding the requirement of rich data-
augmentations or the impact of mini-batch size. In additional
Equipped with the novel insights from Sec. 3, we are to these findings, we provide key strategies to speed up train-
now able to provide a minimal example of JE-SSL that ing and evaluation e.g. using an online classifier probe, or
enables single GPU training on Imagenet-1k. In Fig. 10, replacing the costly color jittering augmentation with the
we performed an extensive grid search on the temperature much cheaper grayscale one. We developed a dedicated Py-
and learning rate parameters. We show that the best hyper- Torch library FFCV-SSL, based on FFCV [17], that further
parameters for a single GPU training, using a batch size of enables rapid training of JE-SSL models e.g. producing on
256, are a temperature of 0.15 and a learning rate of 1.0, this a single GPU faster training than an 8-GPU set-up with the
lead to a 65.58 accuracy in online linear probing and 68.5% usual torchvision pipeline. We hope that the collection of
in non linear probing. Even if the accuracies we achieved rectified findings and software that this study produced will
are much higher than what has been showed in the literature enable a much broader deployment and seamless research
for small batch size, there still remains a small unexplained development of JE-SSL.
gap in performances on ImageNet.
References
5. Recalibrated Observations for Self- [1] Randall Balestriero and Yann LeCun. Contrastive and
supervised learning research non-contrastive self-supervised learning recover global
and local spectral embedding methods. arXiv preprint
Many misleading ideas about the importance of batch arXiv:2205.11508, 2022.
size and data augmentations were widely shared and spread [2] Adrien Bardes, Jean Ponce, and Yann LeCun. Vi-
among JE-SSL studies without challenging them due to com- creg: Variance-invariance-covariance regularization for self-
putational costs. Since this cost can be significantly allevi- supervised learning. In ICLR, 2022.
ated by the use of a better library for loading the data, we are [3] Florian Bordes, Randall Balestriero, Quentin Garrido, Adrien
now able to push back on some of those ideas and to provide Bardes, and Pascal Vincent. Guillotine regularization: Im-
a clearer recalibrated view of our findings: proving deep networks generalization by removing their head,
2022.
• When training SSL-JE models, adverse impact of the [4] Florian Bordes, Randall Balestriero, and Pascal Vincent. High
batch size on downstream performances can be largely fidelity visualization of what your self-supervised represen-
countered by adapting the learning rate accordingly. tation knows about. Transactions on Machine Learning Re-
search, 2022.
• The need for specific strong data augmentations is not [5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof-
clear. It is possible to reach 63.5% top-1 on Imagenet- frey E. Hinton. A simple framework for contrastive learning
1k using only grayscale and cropping which is also a of visual representations. In ICML, 2020.
8
[6] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Fox, and R. Garnett, editors, Advances in Neural Information
Norouzi, and Geoffrey Hinton. Big self-supervised models Processing Systems 32, pages 8024–8035. Curran Associates,
are strong semi-supervised learners. In NeurIPS, 2020. Inc., 2019.
[7] Xinlei Chen and Kaiming He. Exploring simple siamese [19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
representation learning. In CVPR, 2020. jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li
Fei-Fei. Imagenet: A large-scale hierarchical image database. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
In CVPR, 2009. IJCV, 115(3):211–252, 2015.
[9] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre [20] Yang You, Igor Gitman, and Boris Ginsburg. Large batch
Sermanet, and Andrew Zisserman. With a little help from training of convolutional networks, 2017.
my friends: Nearest-neighbor contrastive learning of visual [21] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane
representations. 2021 IEEE/CVF International Conference Deny. Barlow twins: Self-supervised learning via redundancy
on Computer Vision (ICCV), pages 9568–9577, 2021. reduction. arXiv preprint arxiv:2103.03230, 2021.
[10] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew [22] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vini- ralba, and Aude Oliva. Learning deep features for scene
cius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, recognition using places database. In Z. Ghahramani, M.
and Ishan Misra. Vissl. https : / / github . com / Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, ed-
facebookresearch/vissl, 2021. itors, Advances in Neural Information Processing Systems,
[11] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin volume 27. Curran Associates, Inc., 2014.
Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do-
ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi
Munos, and Michal Valko. Bootstrap your own latent: A new
approach to self-supervised learning. In NeurIPS, 2020.
[12] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
Damian Borth. Eurosat: A novel dataset and deep learning
benchmark for land use and land cover classification. IEEE
Journal of Selected Topics in Applied Earth Observations and
Remote Sensing, 12(7):2217–2226, 2019.
[13] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen
Sun, Alexander Shepard, Hartwig Adam, Pietro Perona, and
Serge J. Belongie. The inaturalist species classification and
detection dataset. 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8769–8778, 2018.
[14] Justin Johnson, Bharath Hariharan, Laurens van der Maaten,
Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A
diagnostic dataset for compositional language and elementary
visual reasoning. In CVPR, 2017.
[15] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
Dilip Krishnan. Supervised contrastive learning. Advances
in Neural Information Processing Systems, 33:18661–18673,
2020.
[16] Alex Krizhevsky. Learning multiple layers of features from
tiny images. pages 32–33, 2009.
[17] Guillaume Leclerc, Andrew Ilyas, Logan Engstrom,
Sung Min Park, Hadi Salman, and Aleksander Madry. ffcv.
https://github.com/libffcv/ffcv/, 2022. com-
mit xxxxxxx.
[18] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, An-
dreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu
Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im-
perative style, high-performance deep learning library. In H.
Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E.
9
VicREG Barlow Twins
67.25 66
67.00
64
66.75
66.50 62
66.25 60
66.00
58
65.75 0.6 0.7 1.0 1.2 1.5 1.7 0.0001 0.00025 0.0005 0.001 0.0025 0.005
Learning Rate Learning Rate
Figure 11. Imagenet validation accuracy with respect to the learning Figure 12. Imagenet validation accuracy with respect to the learning
rate on a single gpu (batch size 256) with VICReg. rate on a single gpu (batch size 256) with Barlow Twins.
A. FFCV-SSL: A library for fast SSL training 62 Byol
In this paper, we introduced FFCV-SSL, a fork of the
FFCV library [17] that we improved to make training of 60
SSL models much faster. FFCV increases significantly the
speed of data loading by converting any image dataset to a 58
single file with fixed or variable resolution for each images.
In addition, all data augmentations are compiled in advance
with Numba. In our implementation, we added the follow- 56
ing data augmentations: ColorJitter/Grayscale/Solarization
and also the support for multiple branch of augmentation 54 0.4 0.5 0.6 0.7 0.8 0.9 1.0
given a specific input. The library is available at https: Learning Rate
//github.com/facebookresearch/FFCV- SSL.
We added a code file in the supplementary material that Figure 13. Imagenet validation accuracy with respect to the learning
show how to use FFCV-SSL to train several type of SSL rate on a single gpu (batch size 256) with BYOL.
methods. Using this single file one can train SimCLR/VI-
CReg/BarlowTwins and Byol using a single or many gpus.
The script support the use of SLURM with submitit. a similarity and std coefficient of 25 and a learning rate of
1.0 when using LARS as optimizer. With Barlows Twins in
B. On the importance of increasing the learn- Fig. 15, we found that the optimal hyper-parameters on a
ing rate when using small batch size single gpu is a lambd value of 0.0025 and a learning rate of
0.005 using AdamW. For Byol Fig. 13, we found the optimal
In this section, we present more experiments concern- hyper-parameter to be a momentum encoder value of 0.996
ing the impact of the learning rate when using small batch and a learning rate of 1.0.
sizes. In Fig. 11, we study the impact of the learning rate
on VICReg when using a batch size of 256 on a single gpu.
The optimizer used is LARS and we observe that one can
get easily some percentage gains in accuracy by just tuning
the learning rate. We observe the same behavior with Byol,
using the same optimizer in Fig. 13 and with Barlow Twins,
using AdamW as optimizer in Fig. 12.
C. Additional Single GPU experiments

In this section, we present several grid search of hyper-
parameters for different ssl methods when using a single
gpu. In Fig. 16, we found that the optimal hyper-parameters
for VICReg on a single gpu, based on this grid search are
10
60
40
20
0
0 20 40 60 80 100
Epochs
Figure 14. Imagenet validation accuracy on a wide cross validation
performed on a single gpu with VICReg. For this experiment, we 60
performed a grid search on the similarity and std coefficient with
the values 1, 5, 10, 15, 25 (We fixed the covariance coefficient to 1)
50
and the learning rate (LARS with a weight decay of 1e − 4) with 40
the following values: 0.6, 0.7, 1.0, 1.2, 1.5, 1.7. We found that the
best hyper-parameters for a single GPU training, using a batch size 30
of 256, are a similarity and std coefficient of 25 and a learning rate
of 1.0, this lead to a 67.4 accuracy in online linear probing. 20
10
0
0 20 40 60 80 100
Epochs
Figure 16. Imagenet validation accuracy on a wide cross valida-
60 tion performed on a single gpu with Byol. For this experiment,
we performed a grid search on the momentum encoder hyper-

parameter with the values 0.8, 0.9, 0.996 and the learning rate
(LARS with a weight decay of 1e − 4) with the following val-
40 ues: 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. We found that the best hyper-
parameters for a single GPU training, using a batch size of 256, are
a momentum encoder value of 0.996 and a learning rate of 1.0, this
20 lead to a 62.2 accuracy in online linear probing.
0
0 20 40 60 80 100
Epochs
Figure 15. Imagenet validation accuracy on a wide cross validation
performed on a single gpu with Barlow Twins. For this experiment,
we performed a grid search on the Barlow Twins lambd hyper-
parameter with the values .0025, 0.0045, 0.0051, 0.0075, 0.01 and
the learning rate (AdamW with a weight decay of 4e − 5) with the
following values: 0.0001, 0.00025, 0.0005, 0.001, 0.0025, 0.005.
We found that the best hyper-parameters for a single GPU training,
using a batch size of 256, are a lambd value of 0.0025 and a learning
rate of 0.005, this lead to a 66.8 accuracy in online linear probing.
11

Towards Democratizing Joint-Embedding Self-Supervised Learning

Uploaded by

Copyright:

Available Formats

Towards Democratizing Joint-Embedding Self-Supervised Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Towards Democratizing Joint-Embedding Self-Supervised Learning

Uploaded by

Copyright:

Available Formats

Towards Democratizing Joint-Embedding Self-Supervised Learning

Florian Bordes1,2 , Randall Balestriero2 , Pascal Vincent1,2

batch sizes. We observe that there is a small gain when using

Table 1. Effect of the number of layers in the projector on the

3.3. The impact of the evaluation protocol

Datasets data augmentations, we found that it takes on average 1141s

• Imagenet-1k top-1 metric does not contain the full pic-

• The projector architecture also plays a crucial role for

A. FFCV-SSL: A library for fast SSL training 62 Byol

C. Additional Single GPU experiments

we performed a grid search on the momentum encoder hyper-

You might also like