Towards Democratizing Joint-Embedding Self-Supervised Learning
Towards Democratizing Joint-Embedding Self-Supervised Learning
Towards Democratizing Joint-Embedding Self-Supervised Learning
Abstract 60
ImageNet accuracy
arXiv:2303.01986v1 [cs.LG] 3 Mar 2023
50
Joint Embedding Self-Supervised Learning (JE-SSL) has 40
seen rapid developments in recent years, due to its promise 30 FFCV-SSL (8GPU Res 160 -> 224)
20 FFCV-SSL (8GPU 224)
to effectively leverage large unlabeled data. The develop- FFCV-SSL (1GPU Res 160 -> 224)
10 Torchvision (8GPU 224)
ment of JE-SSL methods was driven primarily by the search
0 5 10 15 20 25 30 35
for ever increasing downstream classification accuracies, us- Time (hours)
ing huge computational resources, and typically built upon
insights and intuitions inherited from a close parent JE-SSL Figure 1. ImageNet validation accuracy (y-axis) during training
of SimCLR with respect to the training time (x-axis). FFCV-
method. This has led unwittingly to numerous pre-conceived
SSL is our proposed library that is specifically optimized for
ideas that carried over across methods e.g. that SimCLR
Self-Supervised Learning, and that extends the original FFCV li-
requires very large mini batches to yield competitive accu- brary [17]. We compare FFCV-SSL with torchvision using vari-
racies; that strong and computationally slow data augmen- ous image’s resolution (224 means that a fixed resolution size of
tations are required. In this work, we debunk several such 224x224 is used when cropping the images while 160 -> 224 means
ill-formed a priori ideas in the hope to unleash the full poten- that the resolution is increasing during training from 160x160 to
tial of JE-SSL free of unnecessary limitations. In fact, when 224x224) –no other changes have been applied in the implemen-
carefully evaluating performances across different down- tation and the same hardware (GPU A100) is employed. Enabled
stream tasks and properly optimizing hyper-parameters of by FFCV-SSL, we are able to perform thorough empirical inves-
the methods, we most often –if not always– see that these tigations against preconceived failure modes of SSL models for
widespread misconceptions do not hold. For example we which we obtain novel conclusions e.g. 1) SimCLR can per-
form equally well with small or large mini-batch training, 2)
show that it is possible to train SimCLR to learn useful rep-
strong data-augmentations are not always necessary and crop-
resentations, while using a single image patch as negative
ping+grayscale is enough to reach competitive performances
example, and simple Gaussian noise as the only data aug- across SSL methods, and 3) it is now possible to train SSL
mentation for the positive pair. Along these lines, in the method, e.g. SimCLR in this figure, using only 1 GPU in a
hope to democratize JE-SSL and to allow researchers to reasonable amount of time.
easily make more extensive evaluations of their methods, we
introduce an optimized PyTorch library for SSL https:
//github.com/facebookresearch/FFCV-SSL.
negative examples, i.e. large batches, which limits the ac-
cessibility of contrastive method training to researchers who
1. Introduction have access to massive and costly hardware resources. An-
Interest in Self-Supervised Learning (SSL) has increased other common criticism is the requirement for a very specific
steadily since the work of [5]. By using very specific sets set of hand-crafted data augmentation to make such methods
of data augmentations to design positives pairs of examples, work. Moreover, in many instances, the use of these data
as well as using large mini-batch of images to define the augmentations can considerably increase the training time.
negatives examples, [5] demonstrated the competitiveness The computational burden of SSL directly impairs its
of SSL with respect to supervised baselines. Since then, widespread adoption since access to multi-GPU and large
other works tried to build upon the contrastive method of [5] scale training is not guaranteed, and because most available
by either increasing the scale [6], improving the negative resources are turned towards producing yet better perform-
example sampling scheme using a buffer [6] or by using new ing SSL variants. Two direct consequences are that (i) there
data augmentations [9]. Despite their successes, contrastive does not exist practical guideline in the literature prescribing
methods are subject to several frequent criticisms. The most how to perform more computationally friendly SSL even
common one is the assumed need for a large number of if it is at the cost of slightly reduced top-1 performances,
1
and (ii) that successively refined SSL methods rarely spend 2. Joint-Embedding Self-Supervised Methods
resources to question or contest the empirical guidelines that and Notations
were developed in previous studies. Those two points also
interplay with each other since (ii) for example commonly JE-SSL relies on processing multiple –semantically
prescribes the use of color-jitter data-augmentation with the related– views of a same input through a nonlinear map-
goal to produce state-of-the-art performances, but this aug- ping, commonly a deep network (DN), and enforcing that
mentation is also among the most computationally expansive the produced representations of those views are close to each
to apply in practice. other. This matching is enforced through the positive term
of the loss function, all while preventing the DN’s mapping
In this paper, we first show that several such widely held to collapse e.g. to a constant function through a collapse
ideas concerning Joint Embedding Self-Supervised Learning prevention term. Different flavors of positive and collapse
methods (JE-SSL) are misleading, and are an obstacle for prevention terms lead to different JE-SSL methods.
the democratization of JE-SSL methods. Here are further
Dataset, Data-Augmentation and Relation Matrix No-
illustrations of popular misconceptions that will be debunked
tations. Regardless of the loss and method employed, SSL
in this paper:
relies on having access to a set of observations i.e. input
samples X , [x1 , . . . , xN ]T ∈ RN ×D and a known posi-
• "We find that, when the number of training epochs is tive relationship between those samples e.g. in the form of
small (e.g. 100 epochs), larger batch sizes have a sig- a symmetric matrix G ∈ (R+ )N ×N where (G)i,j > 0 iff
nificant advantage over the smaller ones." [5] "SimCLR samples xi and xj are semantically related, and with 0 in
and SwAV both require a large batch (e.g., 4096) to work the diagonal. Commonly, one is only given a dataset X 0
well." [7] "Contrastive methods suffer from the need of of size N 0 , and artificially constructs the JE-SSL dataset
a lot of negative examples which can translate into the X and G from augmentations of X 0 , e.g. rotated (for im-
need for very large batch sizes" [2] ages) or noised versions of the original samples, obtained
through transformations t drawn from some distribution T .
Of course, such transformations are designed to preserve
• We find that [Byol] is not robust to removing some types the semantic information of the original inputs as those are
of data augmentations, like SIMCLR [11] employed to determine the positive views of the inputs as in
t1,1 (x1 )T
By reproducing many experiments of the original work of [6]
xT1 t1,2 (x1 )T
we will be able to debunk most of the aforementioned a pri- .. ti,j ∼T ,∀i,j
..
. −−−−−−−→ ,
oris. In fact, it appears that some of the most influential .
papers in the field may have paid insufficient attention to a T T
xN 0 tN 0 ,1 (xN 0 )
fundamental aspect in empirical machine learning research: tN 0 ,2 (xN 0 )T
hyper-parameters tuning and diversity in the evaluation pro-
tocol. Our main takeways will be that JE-SSL need special with (G)i,j = 1{j−1=i} + 1{j+1=i} and where in this case
care as their performance vary greatly with respect to the each original input is used to general two positive views.
employed loss’ hyper-parameters, and that solely looking Lastly, Z ∈ RN ×K denotes the matrix of feature maps
at Imagenet-1k [8] downstream performance is often mis- obtained from a model fθ : RD 7→ RK —commonly a Deep
leading when it comes to measuring the quality of a learned Network— as Z , [fθ (x1 ), . . . , fθ (xN )]T .
representation. Being free of such misconceptions, we then VICReg’s loss [2] is defined as a function of X and G
explore a more extreme scenario. We train SimCLR with a in the following triplet loss
single negative example which is taken from a small patch
of the positive pair. Doing this with only a Gaussian noise K
X q X
as data augmentation for the positive pair, leads on several L=α relu 1 − Cov(Z)k,k + β Cov(Z)2k,j
downstream tasks to results that are very close to the Sim- k=1 j6=k
CLR baseline. N X
N
γ X
All of our empirical analysis is enable by FFCV-SSL + (G)i,j kZi,. − Zj,. k22 . (1)
N
which we developed specifically to reduce data loading i=1 j=1
overhead when training JE-SSL methods, and is based on
the fast data loading library FFCV [17]. By using FFCV- We will refer to each term in Eq. (1) as Lvar , Lcov , and Linv
SSLhttps://github.com/facebookresearch/ respectively.
FFCV-SSL, we were able to run SSL experiment 3 times SimCLR’s loss [5] is slightly different and first produces
faster than before. an estimated relation matrix G(Z)
b [1] that is compared to
2
the ground-truth relation matrix G via 3. Debunking Popular Myths about Joint-
Embedding Failure Cases
eCoSim(zi ,zj )/τ
(G(Z))
b i,j = PN ,
eCoSim(zi ,zj )/τ Due to the large number of hyper-parameters that JE-SSL
j=1,j6=i
N X
N
rely on and the high computation cost required to train these
X models with existing software, most novel methods only
L=− (G)i,j log(G(Z))
b i,j . (2)
i=1 h=1
explore a small part of the whole hyper-parameter space. For
example only comparing with previously reported results
where CoSim denotes the cosine similarity, and τ > 0 is a using the same set-ups. For example, [5,21] study the impact
temperature parameter. The only difference between Sim- of the batch size but keep all other hyper-parameters fixed.
CLR and variants s.a. NNCLR [9] lies in how one defines Although such sensitivity experiments are useful in many
G. Hence, although we will particularly focus on SimCLR, ways, they also risk leading to misconceptions on the failure
our findings should easily extend to such variants. cases of JE-SSL. This is what we propose to investigate in
BarlowTwins’s loss [21] proposes yet a slightly different this section. Surprisingly, we will be able to debunk empiri-
approach where zi must be close to zj if Gi,j > 0. They do cally several observations that were put forward in multiple
so with different flavors of losses and constraints to facilitate previous studies, ultimately showcasing that most JE-SSL
training. Hence, and for these models only, it is common to methods are actually much more similar to one another than
explicitly group X into two subsets Xleft and Xright based previously thought, and suffer much less dramatic failures
on G so that ((Xleft )n , (Xright )n ), ∀n are all the positive than previously reported.
pairs from (X, G) . This does not lose any generality. In
fact, suppose that we have 5 samples a, b, c, d, e, and that 3.1. The Impact of Mini-Batch Size for SimCLR
G says that a, b, c are related to each other, and that d, e and BarlowTwins
are related to each other. Then, we can create the two data
Recall from Eq. (2) that SimCLR uses negative examples
matrices as
in the denominator term of its loss. It has been reasoned that
Xleft = [a, a, b, b, c, c, d, e], Xright = [b, c, a, c, a, b, e, d]. the role of this term is to perform negative sampling [15] and
thus that is will only be effective when the number of sam-
Once the two (left/right) views are obtained, the correspond- ples i.e. mini-batch size is large. Equivalently, BarlowTwins
ing embeddings Zleft , Zright can be computed and the Bar- (recall Eq. (3)) estimates correlation along the sample di-
lowTwins is then defined as mensions and thus also is expected to benefit from large
K mini-batch size for more accurate estimation. In this section,
we propose to refute that large mini-batch sizes are required
X
L= (CoSim((Zleft ).,k , (Zright ).,k ) − 1)2
k=1 to successfully train JE-SSL with these two methods.
K
X K
X The belief that many methods rely on large mini-batch
+α CoSim((Zleft ).,k , (Zright ).,k0 )2 . (3) size has been mentioned in many recent studies, e.g. [7].
k=1 k0 =1,6=k This belief was confirmed by one of the Figure in [5] that
display a 7% accuracy gap on ImageNet between a model
where CoSim computes the cosine similarity between the trained with a small (256) versus large batch size (8192).
two input vectors. One should notice that those terms cor- However, an important point that people often miss when
respond to the cross-correlation matrix between the two reading [5] is the critical influence of the optimizer and the
embeddings. choice of the learning rate when training with different batch
Projector Networks are multilayer perceptrons (MLPs) size. In their appendix, [5] show that when using a better
that are added on top of the DN "backbone" model that optimization technique, the gap in performances between
one aims to train with JE-SSL. Post-training, the projector bigger and smaller batch size are significantly reduced.
network is removed; giving back the original DN’s archi-
tecture but with much improved performance compared to
not employing a projector network. This technique popular- The impact of the downstream task One caveat of the
ized by [5] introduces additional hyper-parameters to tune batch size analysis in [5, 21] is that it concerns only the
e.g. the depth and width of that MLP. It also makes it less performances on ImageNet. Since one of the main moti-
clear what in truth is learned by the DN backbone (as this is vation behind SSL is to learn a model whose representa-
not the representation level on which the training objective tions can generalize to different tasks, we analyse the perfor-
is applied). [3, 4] demonstrated that one benefit of using a mances with different batch sizes across several downstream
projector is that it serves as a buffer to absorb the bias of tasks: ImageNet-1K [19], CIFAR10 [16], CLEVR [14], Eu-
possibly misspecified data-augmentations and/or suboptimal rosat [12], Inaturalist [13] and Places [22]. In Figure 2, we
JE-SSL loss hyper-parameters. plot the performances with SimCLR trained with several
3
100
Batch Size: 256 67
Batch Size: 512
80 Batch Size: 1024 66
60
Accuracy
Accuracy
65
40 64 Batch Size: 256
Batch Size: 512
20 63 Batch Size: 1024
Batch Size: 2048
0 62
0.1
5
0.5
5
10
1K
R-C
R-D
at
t18
0.1
0.1
0.1
0.2
0.2
0.7
20
ros
R
IN
Ina Temperature
EV
ces
EV
FA
Eu
CL
CL
CI
Pla
Datasets Figure 3. Validation accuracy on ImageNet with respect to the
temperature parameters in SimCLR and the batch size. When
Figure 2. Accuracy across different downstream tasks given by
having a temperature of 0.1, the gap between a large batch size can
probing SimCLR representations trained with different batch size.
be as high as 4%. However, when carefully running a grid search,
CLEVR-C corresponds to the task of counting the number of ob-
we observe that the optimal temperature might not be the same
jects in the images whereas CLEVR-D corresponds to the task of
depending of the batch size.
estimating the distance between objects,. Even if the performances
on ImageNet(IN1K) are better with a larger batch size, it’s not
necessarily the case for every downstream tasks.
66
64
Accuracy on ImageNet
4
100
69
80
Accuracy on ImageNet
68
Batch Size: 256 60
Accuracy
67 Batch Size: 512
Batch Size: 1024
66 Batch Size: 2048 40
Crop/Jitter/Gray/Sol.
Crop/Gray/Sol.
65 20 Crop/Gray
Crop
0.3 0.7 0.9 1.2 1.5 1.7 2.0 0
Learning Rate
0
1K
ist
at
t18
5
R1
20
ros
R-D
IN
ina
ces
FA
Eu
EV
CI
Figure 5. Validation accuracy on ImageNet with respect to the
Pla
CL
learning rate (with LARS [20] as optimizer) for Barlow Twins.
Like SimCLR, the optimal learning rate can be radically different
Datasets
depending the batch size one had decided to use to train the model. Figure 6. Accuracy for various downstream tasks for a model
So, it’s really important to perform a grid search over the learning trained with SimCLR with different data augmentations. So corre-
rate when changing the batch size. The dashed line corresponds to sponds to a Solarization transformation with is applied with a 20%
the situations when training Barlow Twins resulted in nan. probability, Gray corresponds to a Grayscale operation that is also
applied with a 20% probability, B. corresponds to a gaussian blur
BS/Layers 1 2 3 4 5 that is applied 100% of the time and Jitter is the ColorJitter opera-
tion, often used in SSL with a probability of 80%. For most of the
128 57.8 66.8 66.8 66.8 66.8
downstream tasks, there is an important gain in accuracy using all
256 59.3 66.4 68.1 68.4 68.4 the sets of augmentations available. Surprisingly, the performances
512 60.2 67.9 69.6 69.5 69.5 in only using cropping and a simple grayscale. transformation is
1024 61.3 69.3 70.3 70.3 70.5 very competitive.
2048 62.0 69.7 70.7 70.5 70.5
5
65 SimCLR
Barlow Twins 60
ImageNet Accuracy
ImageNet Accuracy
60
55 40
50
45 20 Online linear probing
Offline linear probing
40 Offline MLP probing
(Cr) . it 0 20 40 60 80 100
Crops Cr+So Cr+Gr +S o +Gr+B o+Gr+B.+J Epochs
C r Cr+S
Data augmentations Figure 8. Depiction of the classifier probe trained to predict the
Imagenet-1k labels from the output of the backbone during training
Figure 7. Detailed impact of the data augmentations used during (online) and post-training (offline) using a linear or MLP classifiers.
SimCLR and Barlow Twins training on the ImageNet validation The cross in red corresponds to the best accuracy. In the offline
accuracy. As in Fig. 6, So corresponds to a Solarization trans- setting no data-augmentation is employed. We observe clearly that
formation applied with a 20% probability, Gray corresponds to a (i) when employing an MLP only a few epochs are needed and reg-
Grayscale operation that is also applied with a 20% probability, B. ularization or early-stopping should be employed, however, in the
corresponds to a Gaussian blur applied 100%of the time and Jit popular linear case, we clearly see that there is limited differences
is the ColorJitter operation with 80% probability. In this Figure, between the online and offline performances, and that over-fitting
we can clearly see that the addition of grayscaling have the most never occurs during either of the training cases.
significant impact on the ImageNet accuracy.
6
100 Crops +Blur +Gray. +Sol. +Jitter
Torchvision 7:00 9:25 9:26 9:30 13:20
80 FFCV-SSL 1:30 1:36 1:58 2:07 7:00
60
Accuracy
Table 2. Time (minutes: seconds) for one complete epoch over the
torchvision data loader for various data augmentations. We observe
40 that the Blur and ColorJitter operation add a considerable time in
the training.
20 SimCLR usual DA
SimCLR denoising
0 misleading, since the forward and backward pass of the
model in a training loop will take additional times that can
0
1K
ist
at
t18
05
R1
ros
es2
R-D
IN
Ina play with the caching process. When measuring the training
FA
Eu
c
EV
CI
Pla
time for a single epoch with torchvision, using the full set of
CL
7
computationally cheaper set of augmentation to employ.
60
Accuracy on ImageNet
8
[6] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Fox, and R. Garnett, editors, Advances in Neural Information
Norouzi, and Geoffrey Hinton. Big self-supervised models Processing Systems 32, pages 8024–8035. Curran Associates,
are strong semi-supervised learners. In NeurIPS, 2020. Inc., 2019.
[7] Xinlei Chen and Kaiming He. Exploring simple siamese [19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
representation learning. In CVPR, 2020. jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li
Fei-Fei. Imagenet: A large-scale hierarchical image database. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
In CVPR, 2009. IJCV, 115(3):211–252, 2015.
[9] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre [20] Yang You, Igor Gitman, and Boris Ginsburg. Large batch
Sermanet, and Andrew Zisserman. With a little help from training of convolutional networks, 2017.
my friends: Nearest-neighbor contrastive learning of visual [21] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane
representations. 2021 IEEE/CVF International Conference Deny. Barlow twins: Self-supervised learning via redundancy
on Computer Vision (ICCV), pages 9568–9577, 2021. reduction. arXiv preprint arxiv:2103.03230, 2021.
[10] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew [22] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Tor-
Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vini- ralba, and Aude Oliva. Learning deep features for scene
cius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, recognition using places database. In Z. Ghahramani, M.
and Ishan Misra. Vissl. https : / / github . com / Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, ed-
facebookresearch/vissl, 2021. itors, Advances in Neural Information Processing Systems,
[11] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin volume 27. Curran Associates, Inc., 2014.
Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Do-
ersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Moham-
mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi
Munos, and Michal Valko. Bootstrap your own latent: A new
approach to self-supervised learning. In NeurIPS, 2020.
[12] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
Damian Borth. Eurosat: A novel dataset and deep learning
benchmark for land use and land cover classification. IEEE
Journal of Selected Topics in Applied Earth Observations and
Remote Sensing, 12(7):2217–2226, 2019.
[13] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen
Sun, Alexander Shepard, Hartwig Adam, Pietro Perona, and
Serge J. Belongie. The inaturalist species classification and
detection dataset. 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8769–8778, 2018.
[14] Justin Johnson, Bharath Hariharan, Laurens van der Maaten,
Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A
diagnostic dataset for compositional language and elementary
visual reasoning. In CVPR, 2017.
[15] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
Dilip Krishnan. Supervised contrastive learning. Advances
in Neural Information Processing Systems, 33:18661–18673,
2020.
[16] Alex Krizhevsky. Learning multiple layers of features from
tiny images. pages 32–33, 2009.
[17] Guillaume Leclerc, Andrew Ilyas, Logan Engstrom,
Sung Min Park, Hadi Salman, and Aleksander Madry. ffcv.
https://github.com/libffcv/ffcv/, 2022. com-
mit xxxxxxx.
[18] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, An-
dreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu
Fang, Junjie Bai, and Soumith Chintala. Pytorch: An im-
perative style, high-performance deep learning library. In H.
Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E.
9
VicREG Barlow Twins
67.25 66
Accuracy on ImageNet
Accuracy on ImageNet
67.00
64
66.75
66.50 62
66.25 60
66.00
58
65.75 0.6 0.7 1.0 1.2 1.5 1.7 0.0001 0.00025 0.0005 0.001 0.0025 0.005
Learning Rate Learning Rate
Figure 11. Imagenet validation accuracy with respect to the learning Figure 12. Imagenet validation accuracy with respect to the learning
rate on a single gpu (batch size 256) with VICReg. rate on a single gpu (batch size 256) with Barlow Twins.
Accuracy on ImageNet
In this paper, we introduced FFCV-SSL, a fork of the
FFCV library [17] that we improved to make training of 60
SSL models much faster. FFCV increases significantly the
speed of data loading by converting any image dataset to a 58
single file with fixed or variable resolution for each images.
In addition, all data augmentations are compiled in advance
with Numba. In our implementation, we added the follow- 56
ing data augmentations: ColorJitter/Grayscale/Solarization
and also the support for multiple branch of augmentation 54 0.4 0.5 0.6 0.7 0.8 0.9 1.0
given a specific input. The library is available at https: Learning Rate
//github.com/facebookresearch/FFCV- SSL.
We added a code file in the supplementary material that Figure 13. Imagenet validation accuracy with respect to the learning
show how to use FFCV-SSL to train several type of SSL rate on a single gpu (batch size 256) with BYOL.
methods. Using this single file one can train SimCLR/VI-
CReg/BarlowTwins and Byol using a single or many gpus.
The script support the use of SLURM with submitit. a similarity and std coefficient of 25 and a learning rate of
1.0 when using LARS as optimizer. With Barlows Twins in
B. On the importance of increasing the learn- Fig. 15, we found that the optimal hyper-parameters on a
ing rate when using small batch size single gpu is a lambd value of 0.0025 and a learning rate of
0.005 using AdamW. For Byol Fig. 13, we found the optimal
In this section, we present more experiments concern- hyper-parameter to be a momentum encoder value of 0.996
ing the impact of the learning rate when using small batch and a learning rate of 1.0.
sizes. In Fig. 11, we study the impact of the learning rate
on VICReg when using a batch size of 256 on a single gpu.
The optimizer used is LARS and we observe that one can
get easily some percentage gains in accuracy by just tuning
the learning rate. We observe the same behavior with Byol,
using the same optimizer in Fig. 13 and with Barlow Twins,
using AdamW as optimizer in Fig. 12.
10
60
Accuracy on ImageNet
40
20
0
0 20 40 60 80 100
Epochs
Figure 14. Imagenet validation accuracy on a wide cross validation
performed on a single gpu with VICReg. For this experiment, we 60
Accuracy on ImageNet
performed a grid search on the similarity and std coefficient with
the values 1, 5, 10, 15, 25 (We fixed the covariance coefficient to 1)
50
and the learning rate (LARS with a weight decay of 1e − 4) with 40
the following values: 0.6, 0.7, 1.0, 1.2, 1.5, 1.7. We found that the
best hyper-parameters for a single GPU training, using a batch size 30
of 256, are a similarity and std coefficient of 25 and a learning rate
of 1.0, this lead to a 67.4 accuracy in online linear probing. 20
10
0
0 20 40 60 80 100
Epochs
Figure 16. Imagenet validation accuracy on a wide cross valida-
60 tion performed on a single gpu with Byol. For this experiment,
Accuracy on ImageNet
0
0 20 40 60 80 100
Epochs
Figure 15. Imagenet validation accuracy on a wide cross validation
performed on a single gpu with Barlow Twins. For this experiment,
we performed a grid search on the Barlow Twins lambd hyper-
parameter with the values .0025, 0.0045, 0.0051, 0.0075, 0.01 and
the learning rate (AdamW with a weight decay of 4e − 5) with the
following values: 0.0001, 0.00025, 0.0005, 0.001, 0.0025, 0.005.
We found that the best hyper-parameters for a single GPU training,
using a batch size of 256, are a lambd value of 0.0025 and a learning
rate of 0.005, this lead to a 66.8 accuracy in online linear probing.
11