To Compress or Not To Compress - Self-Supervised Learning and Information Theory: A Review
To Compress or Not To Compress - Self-Supervised Learning and Information Theory: A Review
To Compress or Not To Compress - Self-Supervised Learning and Information Theory: A Review
Abstract
arXiv:2304.09355v5 [cs.LG] 21 Nov 2023
Deep neural networks excel in supervised learning tasks but are constrained by the need
for extensive labeled data. Self-supervised learning emerges as a promising alternative,
allowing models to learn without explicit labels. Information theory, and notably the
information bottleneck principle, has been pivotal in shaping deep neural networks. This
principle focuses on optimizing the trade-off between compression and preserving relevant
information, providing a foundation for efficient network design in supervised contexts.
However, its precise role and adaptation in self-supervised learning remain unclear. In
this work, we scrutinize various self-supervised learning approaches from an information-
theoretic perspective, introducing a unified framework that encapsulates the self-supervised
information-theoretic learning problem. We weave together existing research into a cohesive
narrative, delve into contemporary self-supervised methodologies, and spotlight potential
research avenues and inherent challenges. Additionally, we discuss the empirical evaluation of
information-theoretic quantities and their estimation methods. Overall, this paper furnishes
an exhaustive review of the intersection of information theory, self-supervised learning, and
deep neural networks.
1. Introduction
Deep neural networks (DNNs) have revolutionized fields such as computer vision, natural
language processing, and speech recognition due to their remarkable performance in super-
vised learning tasks (Alam et al., 2020; He et al., 2015; LeCun et al., 2015). However, the
success of DNNs is often limited by the need for vast amounts of labeled data, which can
be both time-consuming and expensive to acquire. Self-supervised learning (SSL) emerges
as a promising alternative, enabling models to learn from data without explicit labels by
leveraging the underlying structure and relationships within the data itself.
Recent advances in SSL have been driven by joint embedding architectures, such as Siamese
Nets (Bromley et al., 1993), DrLIM (Chopra et al., 2005; Hadsell et al., 2006), and SimCLR
(Chen et al., 2020a). These approaches define a loss function that encourages representations
of different versions of the same image to be similar while pushing representations of distinct
images apart. After optimizing the surrogate objective, the pre-trained model can be
employed as a feature extractor, with the learned features serving as inputs for downstream
supervised tasks like image classification, object detection, instance segmentation, or pose
estimation (Caron et al., 2021; Chen et al., 2020a; Misra and van der Maaten, 2020; Shwartz-
Ziv et al., 2022b). Although SSL methods have shown promising results in practice, the
1
theoretical underpinnings behind their effectiveness remain an open question (Arora et al.,
2019; Lee et al., 2021a).
Information theory has played a crucial role in understanding and optimizing deep neural
networks, from practical applications like the variational information bottleneck (Alemi
et al., 2016) to theoretical investigations of generalization bounds induced by mutual
information (Steinke and Zakynthinou, 2020; Xu and Raginsky, 2017). Building upon
these foundations, several researchers have attempted to enhance self-supervised and semi-
supervised learning algorithms using information-theoretic principles, such as the Mutual
Information Neural Estimator (MINE) (Belghazi et al., 2018b) combined with the information
maximization (InfoMax) principle (Linsker, 1988). However, the plethora of objective
functions, contradicting assumptions, and various estimation techniques in the literature
can make it challenging to grasp the underlying principles and their implications.
In this paper, we aim to achieve two objectives. First, we propose a unified framework
that synthesizes existing research on self-supervised and semi-supervised learning from an
information-theoretic standpoint. This framework allows us to present and compare current
methods, analyze their assumptions and difficulties, and discuss the optimal representation
for neural networks in general and self-supervised networks in particular. Second, we explore
different methods and estimators for optimizing information-theoretic quantities in deep
neural networks and investigate how recent models optimize various theoretical-information
terms.
In addition to the main structure of the paper, we dedicate a section to the challenges and
opportunities in extending the information-theoretic perspective to other learning paradigms,
such as energy-based models. We highlight the potential advantages of incorporating these
extensions into self-supervised learning algorithms and discuss the technical and conceptual
challenges that must be addressed.
The structure of the paper is as follows. Section 2 introduces the key concepts in supervised,
semi-supervised, self-supervised learning, information theory, and representation learning.
Section 3 presents a unified framework for multiview learning based on information theory.
We first discuss what an optimal representation is and why compression is beneficial for
learning. Next, we explore optimal representation in single-view supervised learning models
and how they can be extended to unsupervised, semi-supervised, and multiview contexts.
The focus then shifts to self-supervised learning, where the optimal representation remains
an open question. Using the unified framework, we compare recent self-supervised algorithms
and discuss their differences. We analyze the assumptions behind these models, their effects
2
To Compress or Not to Compress
Section 5 addresses several technical challenges, discussing both theoretical and practical
issues in estimating theoretical information terms. We present recent methods for estimating
these quantities, including variational bounds and estimators. Section 6 concludes the paper
by offering insights into potential future research directions at the intersection of information
theory, self-supervised learning, and deep neural networks. Our aim is to inspire further
research that leverages information theory to advance our understanding of self-supervised
learning and to develop more efficient and effective models for a broad range of applications.
Although these views often provide different and complementary information about the
same data, directly integrating them does not produce satisfactory results due to biases
between multiple views (Yan et al., 2021). Thus, multiview representation learning involves
identifying the underlying data structure and integrating the different views into a common
feature space, resulting in high performance. In recent decades, multiview learning has been
used for many machine learning tasks and influenced many algorithms, such as co-training
mechanisms (Kumar and Daumé, 2011), subspace learning methods (Xue et al., 2019), and
multiple kernel learning (MKL) (Bach and Jordan, 2002). Li et al. (2018) proposed two
categories for multiview representation learning: (i) multiview representation fusion, which
combines different features from multiple views into a single compact representation, and (ii)
alignment of multiview representation, which attempts to capture the relationships among
multiple different views through feature alignment. In this case, a learned mapping function
embeds the data of each view, and the representations are regularized to form a multiview-
aligned space. In this research direction, an early study is the Canonical Correlation Analysis
(CCA) (Hotelling, 1936) and its kernel extensions (Bach and Jordan, 2003; Hardoon et al.,
2004; Sun, 2013). In addition to CCA, multiview representation learning has penetrated a
variety of learning methods, such as dimensionality reduction (Sun et al., 2010), clustering
analysis (Yan et al., 2015), multiview sparse coding (Cao et al., 2013; Jia et al., 2010;
Liu et al., 2014), and multimodal topic learning (Pu et al., 2020). However, despite their
3
promising results, these methods use handcrafted features and linear embedding functions,
which cannot capture the nonlinear properties of multiview data.
The emergence of deep learning has provided a powerful way to learn complex, nonlinear,
and hierarchical representations of data. By incorporating multiple hierarchical layers, deep
learning algorithms can learn complex, subtle, and abstract representations of target data.
The success of deep learning in various application domains has led to a growing interest in
deep multiview methods, which have shown promising results. Examples of these methods
include deep multiview canonical correlation analysis (Andrew et al., 2013) as an extension
of CCA, multiview clustering via deep matrix factorization (Zhao et al., 2017a), and the deep
multiview spectral network (Huang et al., 2019). Moreover, deep architectures have been
employed to generate effective representations in methods such as multiview convolutional
neural networks (Liu et al., 2021a), multimodal deep Boltzmann machines (Srivastava and
Salakhutdinov, 2014), multimodal deep autoencoders (Ngiam et al., 2011; Wang et al., 2015),
and multimodal recurrent neural networks (Donahue et al., 2015; Karpathy and Fei-Fei,
2015; Mao et al., 2014).
Two main categories of SSL architectures exist: (1) generative architectures based on
reconstruction or prediction and (2) joint embedding architectures (Liu et al., 2021b). Both
architecture classes can be trained using either contrastive or non-contrastive methods.
4
To Compress or Not to Compress
and maximizing the entropy of the distribution relative to a prior. Vector quantized
variational auto-encoders (VQ-VAE) employ binary stochastic variables to achieve
similar results (Van Den Oord et al., 2017).
• Contrastive Methods: Contrastive methods utilize data points from the training
set as positive samples and generate points outside the region of high data density as
contrastive samples. The energy (e.g., reconstruction error for generative architectures
or representation predictive error for JEA) should be low for positive samples and
higher for contrastive samples. Various loss functions involving the energies of pairs or
sets of samples can be minimized to achieve this objective.
We now present a few concrete examples of popular models that employ various combinations
of generative architectures, joint embedding architectures, contrastive training, and non-
contrastive training:
5
objective function in many contrastive learning methods:
" T +
!#
ef (x) f (x )
Ex,x+ ,x− − log P
k = 1K ef (x)T f (xk )
where x+ is a sample similar to x, xk are all the samples in the batch, and f is an encoder.
However, contrastive methods heavily depend on all other samples in the batch and require a
large batch size. Additionally, recent studies (Jing et al., 2021) have shown that contrastive
learning can lead to dimensional collapse, where the embedding vectors span a lower-
dimensional subspace instead of the entire embedding space. Although positive and negative
pairs should repel each other to prevent dimensional collapse, augmentation along feature
dimensions and implicit regularization cause the embedding vectors to fall into a lower-
dimensional subspace, resulting in low-rank solutions.
To address these problems, recent works have introduced JEA models with non-contrastive
methods. Unlike contrastive methods, these methods employ regularization to prevent the
collapse of the representation and do not explicitly rely on negative samples. For example,
several papers use stop-gradients and extra predictors to avoid collapse (Chen and He, 2021;
Grill et al., 2020), while Caron et al. (2020) employed an additional clustering step. VICReg
(Bardes et al., 2021) is another non-contrastive method that regularizes the covariance
matrix of representation. Consider two embedding batches Z = [f (x1 ), . . . , f (xN )] and
Z ′ = [f (x′ 1), . . . , f (x′ N )], each of size (N × K). Denote by C the (K × K) covariance
matrix obtained from [Z, Z ′ ]. The VICReg triplet loss is defined by:
K
1 X X 2
L= Ck,k′ + γ∥Z − Z ′ ∥2F /N.
p
α max 0, γ − Ck,k + ϵ +β
K ′
k=1 k ̸=k
6
To Compress or Not to Compress
A sufficient statistic captures all the information about Y in X. Cover (1999) proved this
property:
However, the sufficiency definition also encompasses trivial identity statistics that only ”copy”
rather than ”extract” essential information. To prevent statistics from inefficiently utilizing
observations, the concept of minimal sufficient statistics was introduced:
Definition 3 (Minimal sufficient statistic (MSS)) A sufficient statistic T is minimal if, for
any other sufficient statistic S, there exists a function f such that T = f (S) almost surely
(a.s.).
In essence, MSS are the simplest sufficient statistics, inducing the coarsest sufficient partition
on X. In MSS, the values of X are grouped into as few partitions as possible without
sacrificing information. MSS are statistics with the maximum information about Y while
retaining the least information about X as possible (Koopman, 1936).
7
(Turner and Sahani, 2007), speech recognition (Hecht et al., 2009), and deep learning (Alemi
et al., 2016; Shwartz-Ziv and Tishby, 2017).
Let X be an input random variable, Y a target variable, and P (X, Y ) their joint distribution.
A representation T is a stochastic function of X defined by a mapping R P (T | X). This
mapping transforms X ∼ P (X) into a representation of T ∼ P (T ) := PT |X (· | x)dPX (x).
The triple Y − X − T forms a Markov chain in that order with respect to the joint probability
measure PX,Y,T = PX,Y PT |X and the mutual information terms I(X; T ) and I(Y ; T ).
Within the IB framework, our goal is to find a representation P (T | X) that extracts as much
information as possible about Y (high performance) while compressing X maximally (keeping
I(X; T ) small). This can also be interpreted as extracting only the relevant information
that X contains about Y .
The data processing inequality (DPI) implies that I(Y ; T ) ≤ I(X; Y ), so the compressed
representation T cannot convey more information than the original signal. Consequently,
there is a trade-off between compressed representation and the preservation of relevant
information about Y . The construction of an efficient representation variable is characterized
by its encoder and decoder distributions, P (T | X) and P (Y | T ), respectively. The efficient
representation of X involves minimizing the complexity of the representation I (T ; X) while
maximizing I (T ; Y ). Formally, the IB optimization involves minimizing the following
objective function:
where β is the trade-off parameter controlling the complexity of T and the amount of relevant
information it preserves. Intuitively, we pass the information that X contains about Y
through a “bottleneck” via the representation T . It has been shown that:
8
To Compress or Not to Compress
Several issues arise with the population risk. Firstly, it remains unclear which loss function
is optimal. A popular choice is the logarithmic loss (or error’s entropy), which has been
numerically demonstrated to yield better results (Erdogmus, 2002). This loss has been
employed in various algorithms, including the InfoMax principle (Linsker, 1988), tree-based
algorithms (Quinlan, 2014), deep neural networks (Zhang and Sabuncu, 2018), and Bayesian
modeling (Wenzel et al., 2020). Painsky and Wornell (2018) provided a rigorous justification
for using the logarithmic loss and showed that it is an upper bound to any choice of the loss
function that is smooth, proper, and convex for binary classification problems.
In most cases, the joint distribution P (X, Y ) is unknown, and we have access to only n
samples from it, denoted by Dn := (xi , yi ) | i = 1, . . . , n. Consequently, the population risk
cannot be computed directly. Instead, we typically choose the predictor that minimizes the
empirical population risk on a training dataset:
n
1X
L̂P (X,Y ) (f, ℓ, Dn ) = [ℓ(yi , f (xi ))]
n
i=1
The generalization gap, defined as the difference between empirical and population risks, is
given by:
Interestingly, the relationship between the true loss and the empirical loss can be bounded
using the information bottleneck term. Shamir et al. (2010) developed several finite sample
bounds for the generalization gap. According to their study, the IB framework exhibited
good generalizability even with small sample sizes. In particular, they developed non-uniform
bounds adaptive to the model’s complexity. They demonstrated that for the discrete case,
|X| log n
the error in estimating mutual information from finite samples is bounded by O √
n
,
where |X| is the cardinality of X (the number of possible values that the random variable X
can take). The results support the intuition that simpler models generalize better, and we
would like to compress our model. Therefore, optimizing eq. (1) presents a trade-off between
two opposing forces. On the one hand, we want to increase our prediction accuracy in our
training data (high β).
On the other hand, we would like to decrease β to narrow the generalization gap. Vera
et al. (2018) extended their work and showed that the generalization gap is bounded by the
square root of mutual information between training input and model representation times
log n
n . Furthermore, Russo and Zou (2019) and Xu and Raginsky (2017) demonstrated that
the square root of the mutual information between the training input and the parameters
inferred from the training algorithm provides a concise bound on the generalization gap.
However, these bounds critically depend on the Markov operator that maps the training set
to the network parameters, whose characterization is not trivial.
Achille and Soatto (2018) explored how applying the IB objective to the network’s parameters
may reduce overfitting while maintaining invariant representations. Their work showed
9
that flat minima, which have better generalization properties, bound the information with
the weights, and the information in the weights bound the information in the activations.
Chelombiev et al. (2019) found that the generalization precision is positively correlated with
the degree of compression of the last layer in the network. Shwartz-Ziv et al. (2018) showed
that the generalization error depends exponentially on the mutual information between the
model and the input once it is smaller than log 2n - the query sample complexity. Moreover,
they demonstrated that M bits of compression of X are equivalent to an exponential factor
of 2M training examples. Piran et al. (2020) extended the original IB to the dual form,
which offers several advantages in terms of compression.
These studies illustrate that the IB leads to a trade-off between prediction and complexity,
even for the empirical distribution. With the IB objective, we can design estimators to
find optimal solutions for different regimes with varying performance, complexity, and
generalization.
3. Information-Theoretic Objectives
Before delving into the details, this section aims to provide an overview of the information-
theoretic objectives in various learning scenarios, including supervised, unsupervised, and
self-supervised settings. We will also introduce a general framework to understand better
the process of learning optimal representations and explore recent methods working towards
this goal.
In our model, we use a learned encoder with a prior P (Z) to generate a conditional
representation (which may be deterministic or stochastic) Zi |Xi = Pθi (Zi |Xi ), where i = 1, 2
10
To Compress or Not to Compress
represents the two views. Subsequently, we utilize various decoders to ’decode’ distinct
aspects of the representation:
For the supervised scenario, we have a joint embedding of the label classifiers from both
views, Ŷ1,2 = Qρ (Y |Z1 , Z2 ), and two decoders predicting the labels of the downstream task
based on each individual view, Ŷi = Qρi (Y |Zi ) for i = 1, 2.
For the unsupervised case, we have direct decoders for input reconstruction from the
representation, X̄i = Qψi (Xi |Zi ) for i = 1, 2.
For self-supervised learning, we utilize two cross-decoders attempting to predict one represen-
tation based on the other, Z˜1 |Z2 = qη1 (Z1 |Z2 ) and Z˜2 |Z1 = qη2 (Z2 |Z1 ). Figure 1 illustrates
this structure.
The information-theoretic perspective of self-supervised networks has led to confusion
regarding the information being optimized in recent work. In supervised and unsupervised
learning, only one ’information path’ exists when optimizing information-theoretic terms:
the input is encoded through the network, and then the representation is decoded and
compared to the targets. As a result, the representation and corresponding information
always stem from a single encoder and decoder.
However, in the self-supervised multiview scenario, we can construct our representation
using various encoders and decoders. For instance, we need to specify the associated random
variable to define the information involved in I(X1 ; Z1 ). This variable could either be based
on the encoder of X1 - Pθ1 (Z1 |X1 ), or based on the encoder of X2 - Pθ2 (Z2 |X2 ), which
is subsequently passed to the cross-decoder Qη1 (Z1 |Z2 ) and then to the direct decoder
Qψ1 (X1 |Z1 ).
To fully understand the information terms, we aim to optimize and distinguish between
various ”information paths,” we marked each information path differently. For example,
I,P (X1 ),P (Z1 |X1 ),P (Z2 |Z1 ) (X1 , Z2 ) is based on the path P (X1 ) → P (Z1 |X1 ) → P (Z2 |Z1 ). In
the following section, we will ”translate” previous work into our present framework and
examine the loss function.
11
Unsupervised Self-Supervised Unsupervised
Supervised Prediction Supervised Prediction
Reconstruction Predication Reconstruction
Complexity Complexity
sufficiency of Z for Y as the amount of label information retained after passing data through
the encoder:
Definition 4 Sufficiency: A representation Z of X is sufficient for Y if and only if
I(X; Y |Z) = 0.
Federici et al. (2020) showed that Z is sufficient for Y if and only if the amount of information
regarding the task remains unchanged by the encoding procedure. A sufficient representation
can predict Y as accurately as the original data X. In Section 2.4, we saw a trade-off
between prediction and generalization when there is a finite amount of data. To reduce the
generalization gap, we aim to compress X while retaining as much predicate information on
the labels as possible. Thus, we relax the sufficiency definition and minimize the following
objective:
The mutual information I(Y ; Z) determines how much label information is accessible and
reflects the model’s ability to predict performance on the target task. I(X; Z) represents the
information that Z carries about the input, which we aim to compress. However, I(X; Z)
contains both relevant and irrelevant information about Y . Therefore, using the chain rule
of information, Federici et al. (2020) proposed splitting I(X, Z) into two terms:
I(X; Z) = I(X; Z|Y ) + I(Z; Y ) (4)
| {z } | {z }
superfluous information predictive information
The conditional information I(X, Z|Y ) represents information in Z that is not predictive
of Y , i.e., superfluous information. The decomposition of input information enables us to
compress only irrelevant information while preserving the relevant information for predicting
Y . Several methods are available for evaluating and estimating these information-theoretic
terms in the supervised case (see Section 5 for details).
12
To Compress or Not to Compress
The study of representation compression in Deep Neural Networks (DNNs) for supervised
learning has shown inconsistent results. For instance, Chelombiev et al. (2019) discovered
a positive correlation between generalization accuracy and the compression level of the
network’s final layer. Shwartz-Ziv et al. (2018) also examined the relationship between
generalization and compression, demonstrating that generalization error exponentially de-
pends on mutual information, I(X; Z). Furthermore, Achille et al. (2017) established that
flat minima, known for their improved generalization properties, constrain the mutual
information. However, Saxe et al. (2019) showed that compression was not necessary for
generalization in deep linear networks. Basirat et al. (2021) revealed that the decrease in
mutual information is essentially equivalent to geometrical compression. Other studies have
found that the mutual information between training inputs and inferred parameters provides
a concise bound on the generalization gap (Pensia et al., 2018; Xu and Raginsky, 2017).
Lastly, Achille and Soatto (2018) explored using an information bottleneck objective on
network parameters to prevent overfitting and promote invariant representations.
However, the LMIB method has a significant limitation: it utilizes linear projections for
each view, which can restrict the combined representation when the relationship between
different views is complex. To overcome this limitation, Wang et al. (2019) proposed using
13
deep neural networks to replace linear projectors. Their model first extracts concise latent
representations from each view using deep networks and then learns the joint representation
of all views using neural networks. They minimize the objective:
L = αIP (X1 ),P (Z1 |X1 ) (X1 ; Z1 ) + βIP (X2 ),P (Z2 |X2 ) (X2 ; Z2 ) − IP (Z2 |X2 ),P (Z2 |X1 ) (Z1,2 ; Y )
Here, α and β are trade-off parameters, Z1 and Z2 are the two neural networks’ represen-
tations, and Z1,2 is the joint embedding of Z1 and Z2 . The first two terms decrease the
mutual information between a view’s latent representation and its original data representa-
tion, resulting in a simpler and more generalizable model. The final term forces the joint
representation to maximize the discrimination ability for the downstream task.
14
To Compress or Not to Compress
Here, β and βy are hyperparameters that balance the trade-off between the relevance of M
to the labels and the compression of Z into M .
where I(X; Z) is the information determined by the encoder q(z|x) and I(Z; X̄) is the
information determined by the decoder q(x|z), i.e., the reconstruction error. In other
words, unsupervised IB is a special case of supervised IB, where labels are replaced with the
reconstruction performance of the training input. Alemi et al. (2016) showed that Variational
Autoencoder (VAE) (Kingma and Welling, 2019) and β-VAE (Higgins et al., 2017) are
special cases of unsupervised variational IB. Voloshynovskiy et al. (2020) extended their
results and showed that many models, including adversarial autoencoders (Makhzani et al.,
2015), InfoVAEs (Zhao et al., 2017c), and VAE/GANs (Larsen et al., 2016), could be viewed
as special cases of unsupervised IB. The main difference between them is the bounds on the
different mutual information of the IB. Furthermore, unsupervised IB was used by Uğur
et al. (2020) to derive lower bounds for their unsupervised generative clustering framework,
while Roy et al. (2018) used it to study vector-quantized autoencoders.
Voloshynovskiy et al. (2020) pointed out that for the classification task in supervised IB, the
latent space Z should be sufficient statistics for Y , whose entropy is much lower than X.
This results in a highly compressed representation where sequences close in the input space
might be close in the latent space, and the less significant features will be compressed. In
contrast, in the unsupervised setup, the IB suggests compressing the input to the encoded
representation so that each input sequence can be decoded uniquely. In this case, the latent
space’s entropy should correspond to the input space’s entropy, and compression is much
more difficult.
15
4. Self-Supervised Multiview Information Bottleneck Learning
How can we learn without labels and still achieve good predictive power? Is compression
necessary to obtain an optimal representation? This section analyzes and discusses how
to achieve optimal representation for self-supervised learning when labels are not available
during training. We review recent methods for self-supervised learning and show how they
can be integrated into a single framework. We compare their objective functions, implicit
assumptions, and theoretical challenges. Finally, we consider the information-theoretic
properties of these representations, their optimality, and different ways of learning them.
One approach to enhance deep learning methods is to apply the InfoMax principle in a
multiview setting (Linsker, 1988; Wiskott and Sejnowski, 2002). As one of the earliest
approaches, Linsker (1988) proposed maximizing information transfer from input data to
its latent representation, showing its equivalence to maximizing the determinant of the
output covariance under the Gaussian distribution assumption. Becker and Hinton (1992)
introduced a representation learning approach based on maximizing an approximation of
the mutual information between alternative latent vectors obtained from the same image.
The most well-known application is the Independent Component Analysis (ICA) Infomax
algorithm (Bell and Sejnowski, 1995), designed to separate independent sources from their
linear combinations. The ICA-Infomax algorithm aims to maximize the mutual information
between mixtures and source estimates while imposing statistical independence among
outputs. The Deep Infomax approach (Hjelm et al., 2018) extends this idea to unsupervised
feature learning by maximizing the mutual information between input and output while
matching a prior distribution for the representations. Recent work has applied this principle
to a self-supervised multiview setting (Bachman et al., 2019; Henaff, 2020; Hjelm et al.,
2018; Tian et al., 2020a), wherein these works maximize the mutual information between the
views Z1 and Z2 using the classifier q(z1 |z2 ), which attempts to predict one representation
from the other.
However, Tschannen et al. (2019) demonstrated that the effectiveness of InfoMax models is
more attributable to the inductive biases introduced by the architecture and estimators than
to the training objectives themselves, as the InfoMax objectives can be trivially maximized
using invertible encoders. Moreover, a fundamental issue with the InfoMax principle is that
it retains irrelevant information about the labels, contradicting the core concept of the IB
principle, which advocates compressing the representation to enhance generalizability.
To resolve this problem, Sridharan and Kakade (2008) proposed the multiview IB framework.
According to this framework, in the multiview without labels setting, the IB principle of
preserving relevant data while compressing irrelevant data requires assumptions regarding
the relationship between views and labels. They presented the MultiView assumption, which
asserts that either view (approximately) would be sufficient for downstream tasks. By this
assumption, they define the relevant information as the shared information between the
views. Therefore, augmentations (such as changing the image style) should not affect the
labels.
16
To Compress or Not to Compress
Additionally, the views will provide most of the information in the input regarding down-
stream tasks. We improve generalization without affecting performance by compressing the
information not shared between the two views. Their formulation is as follows:
Assumption 1 The MultiView Assumption: There exists a ϵinfo (which is assumed to
be small) such that
As a result, when the information sharing parameter, ϵinfo , is small, the information shared
between views includes task-relevant details. For instance, in self-supervised contrastive
learning for visual data (Hjelm et al., 2018), views represent various augmentations of the
same image. In this scenario, the MultiView assumption is considered mild if the downstream
task remains unaffected by the augmentation (Geiping et al., 2022). Image augmentations
can be perceived as altering an image’s style without changing its content. Thus, Tsai et al.
(2020) contends that the information required for downstream tasks should be preserved in
the content rather than the style. This assumption allows us to separate the information into
relevant (shared information) and irrelevant (not shared) components and to compress only
the unimportant details that do not contain information about downstream tasks. Based
on this assumption, we aim to maximize the relevant information I(X2 ; Z1 ) and minimize
I(X1 ; Z1 | X2 ) - the exclusive information that Z1 contains about X1 , which cannot be
predicted by observing X2 . This irrelevant information is unnecessary for the prediction task
and can be discarded. In the extreme case, where X1 and X2 share only label information,
this approach recovers the supervised IB method without labels. Conversely, if X1 and X2
are identical, this method collapses into the InfoMax principle, as no information can be
accurately discarded.
Federici et al. (2020) used the relaxed Lagrangian objective to obtain the minimal sufficient
representation Z1 for X2 as:
and the symmetric loss to obtain the minimal sufficient representation Z2 for X1 :
where β1 and β2 are the Lagrangian multipliers introduced by the constraint optimization.
By defining Z1 and Z2 on the same domain and re-parameterizing the Lagrangian multipliers,
the average of the two loss functions can be upper bounded as:
L = −IP (Z1 |X1 ),Q(Z2 |Z1 ) (Z1 ; Z2 ) + βDSKL [p(z1 | x1 )||P (z2 | x2 )]
where DSKL represents the symmetrized KL divergence obtained by averaging the expected
value of DKL (p(z1 | x1 )||p(z2 | x2 )) and DKL (p(z2 | x2 )||p(z1 | x1 )). Note that when the
mapping from X1 to Z1 is deterministic, I(Z1 ; X1 | X2 ) minimization and H(Z1 | X2 )
minimization are interchangeable and the algorithms of Federici et al. (2020) and Tsai et al.
17
(2020) minimize the same objective. Another implementation of the same idea is based on
the Conditional Entropy Bottleneck (CEB) algorithm (Fischer, 2020) and proposed by Lee
et al. (2021b). This algorithm adds the residual information as a compression term to the
InfoMax objective using the reverse decoders q(z1 | x2 ) and q(z2 | x1 ).
In conclusion, all the algorithms mentioned above are based on the Multiview assump-
tion. Utilizing this assumption, they can distinguish relevant information from irrelevant
information. As a result, all these algorithms aim to maximize the information (or the
predictive ability) of one representation with respect to the other view while compressing
the information between each representation and its corresponding view. The key differences
between these algorithms lie in the decomposition and implementation of these information
terms.
Dubois et al. (2021) offers another theoretical analysis of the IB for self-supervised learning.
Their work addresses the question of the minimum bit rate required to store the input but
still achieve high performance on a family of downstream tasks Y ∈ Y. It is a rate-distortion
problem, where the goal is to find a compressed representation that will give us a good
prediction for every task. We require that the distortion measure is bounded:
Accessing the downstream task is necessary to find the solution during the learning process.
As a result, Dubois et al. (2021) considered only tasks invariant to some equivalence relation,
which divides the input into disjoint equivalence classes. An example would be an image
with labels that remain unchanged after augmentation. This is similar to the Multiview
assumption where ϵinf o → 0. By applying Shannon’s rate-distortion theory, they concluded
that the minimum achievable bit rate is the rate-distortion function with the above invariance
distortion. Thus, the optimal rate can be determined by minimizing the following Lagrangian:
Using this objective, the maximization of information with labels is replaced by maximizing
the prediction ability of one view from the original input, regularized by direct information
from the input. Similarly to the above results, we would like to find a representation Z1
that compresses the input X1 so that Z1 has the maximum information about X2 .
18
To Compress or Not to Compress
to retain all the information from both X1 and X2 by making the representations invertible.
In this section, we attempt to explain this phenomenon.
We begin with the InfoMax principle (Linsker, 1988), which maximizes the mutual information
between the representations of random variables Z 1 and Z 2 of the two views. We can lower-
bound it using:
The bound is tight when q(z1 |z2 ) = p(z1 |z2 ), in which case the first term equals the
conditional entropy H(Z1 |Z2 ). The second term of eq. (7) can be considered a negative
reconstruction error or distortion between Z1 and Z2 .
In the supervised case, where Z is a learned stochastic representation of the input and Y is
the label, we aim to optimize
19
aggressive data augmentation or multiple downstream tasks or modalities, sharing all the
necessary information can be challenging. For example, if one view is a video stream while
the other is an audio stream, the shared information may be sufficient for object recognition
but not for tracking. Furthermore, relevant information for downstream tasks may not be
contained within the shared information between views, meaning that removing non-shared
information can negatively impact performance.
Kahana and Hoshen (2022) identified a series of tasks that violate the Multiview assumption.
To accomplish these tasks, the learned representation must also be invariant to unwanted
attributes, such as bias removal and cross-domain retrieval. In such cases, only some
attributes have labels, and the objective is to learn an invariant representation for the
domain for which labels are provided while also being informative for all other attributes
without labels. For example, for face images, only the identity labels may be provided, and
the goal is to learn a representation that captures the unlabeled pose attribute but contains
no information about the identity attribute. The task can also be applied to fair decisions,
cross-domain matching, model anonymization, and image translation.
Wang et al. (2022) formalized another case where the Multiview assumption does not
hold when non-shared task-relevant information cannot be ignored. In such cases, the
minimal sufficient representation contains less task-relevant information than other sufficient
representations, resulting in inferior performance. Furthermore, their analysis shows that in
such cases, the learned representation in contrastive learning is insufficient for downstream
tasks, which may overfit the shared information.
As a result of their analysis, Wang et al. (2022) and Kahana and Hoshen (2022) proposed
explicitly increasing mutual information between the representation and input to preserve
task-relevant information and prevent the compression of unshared information between
views. In this case, the two regularization terms of the two views are incorporated into the
original InfoMax objective, and the following objective is optimized:
L= min −IP (Z1 |X1 ) (X1 ; Z1 ) − IP (Z2 |X2 ) (X2 ; Z2 ) − βIP (Z1 |X1 ),P (Z2 |Z1 ) (Z1 ; Z2 ). (9)
P (Z1 |X1 ),p(Z2 |X2 )
Wang et al. (2022) demonstrated the effectiveness of their method for SimCLR (Chen
et al., 2020a), BYOL (Grill et al., 2020), and Barlow Twins (Zbontar et al., 2021) across
classification, detection, and segmentation tasks.
20
To Compress or Not to Compress
is relevant for the downstream task, we cannot separate relevant and irrelevant information.
Furthermore, the learning algorithm’s nature requires that this information be protected by
explicitly maximizing it.
As datasets continue to expand in size and models are anticipated to serve as base models for
various downstream tasks, the Multiview assumption becomes less pertinent. Consequently,
compressing irrelevant information when the Multiview assumption does not hold presents
one of the most significant challenges in self-supervised learning. Identifying new methods
to separate relevant from irrelevant information based on alternative assumptions is a
promising avenue for research. It is also essential to recognize that empirical measurement
of information-theoretic quantities and their estimators plays a crucial role in developing
and evaluating such methods.
The mutual information between the input and representation is infinite, leading to ill-posed
optimization problems or piecewise constant outcomes (Amjad and Geiger, 2019; Goldfeld
et al., 2018). To tackle this issue, researchers have proposed various solutions. One common
approach is to discretize the input distribution and real-valued hidden representations by
binning, which facilitates non-trivial measurements and prevents the mutual information
from always taking the maximum value of the log of the dataset size, thus avoiding ill-posed
optimization problems (Shwartz-Ziv and Tishby, 2017).
21
Measuring Information in High-Dimensional Spaces
Estimating mutual information in high-dimensional spaces presents a significant challenge
when applying information-theoretic measures to real-world data. This problem has been
extensively studied (Gao et al., 2015; Paninski, 2003), revealing the inefficiency of solutions
for large dimensions and the limited scalability of known approximations with respect to
sample size and dimension. Despite these difficulties, various entropy and mutual information
estimation approaches have been developed, including classic methods like k-nearest neighbors
(KNN) (Kozachenko and Leonenko, 1987) and kernel density estimation techniques (Hang
et al., 2018), as well as more recent efficient methods.
Improving mutual information estimation can be achieved using larger batch sizes, although
this may negatively impact generalization performance and memory requirements. Alterna-
tively, researchers have suggested employing surrogate measures for mutual information, such
as log-determinant mutual information (LDMI), based on second-order statistics (Erdogan,
2022; Ozsoy et al., 2022), which reflects linear dependence. Goldfeld and Greenewald (2021)
proposed the Sliced Mutual Information (SMI), defined as an average of MI terms between
one-dimensional projections of high-dimensional variables. SMI inherits many properties
of its classic counterpart. It can be estimated with optimal parametric error rates in all
dimensions by combining an MI estimator between scalar variables with an MC integrator
(Goldfeld and Greenewald, 2021). The k-SMI, introduced by Goldfeld et al. (2022), extends
the SMI by projecting to k-dimensional subspace, which relaxes the smoothness assumptions,
improves scalability, and enhances performance.
22
To Compress or Not to Compress
23
learning, predictive information H(Y |Z) measures the amount of information that can be
extracted from Z about Y given access to all decoders p(y|z) in the world. Recently, Xu
et al. (2020) introduced predictive V-information as an alternative formulation based on
realistic computational constraints.
Investigating energy-based models for self-supervised learning from both theoretical and
practical perspectives can open up numerous promising research directions. For instance, we
could directly apply tools developed for energy-based models and statistical machines to
optimize the model, such as Maximum Likelihood Training with MCMC (Younes, 1999),
score matching (Hyvärinen, 2006), denoising score matching (Song et al., 2020; Vincent,
2011), and score-based generation models (Song and Ermon, 2019).
24
To Compress or Not to Compress
7. Conclusion
In this study, we delved deeply into the concept of optimal representation in self-supervised
learning through the lens of information theory. We synthesized various approaches, high-
lighting their foundational assumptions and constraints, and integrated them into a unified
framework. Additionally, we explored the key information-theoretic terms that influence
these optimal representations and the methods for estimating them.
While supervised and unsupervised learning offer more direct access to relevant information,
self-supervised learning depends heavily on assumptions about the relationship between data
and downstream tasks. This reliance makes distinguishing between relevant and irrelevant
information considerably more challenging, necessitating further assumptions.
Despite these challenges, information theory stands out as a robust and versatile framework
for analysis and algorithmic development. This adaptable framework caters to a range of
learning paradigms and elucidates the inherent assumptions underpinning data and model
optimization.
With the rapid growth of datasets and the increasing expectations placed on models to
handle multiple downstream tasks, the traditional Multi-view assumption might become less
reliable. One significant challenge in self-supervised learning is the precise compression of
irrelevant information, especially when these assumptions are compromised.
Future research avenues might involve expanding the Multi-view framework to include more
views and tasks and deepening our understanding of information theory’s impact on facets
of deep learning, such as reinforcement learning and generative models.
In summary, information theory is a crucial tool in our quest to understand better and
optimize self-supervised learning models. By harnessing its principles, we can more adeptly
navigate the intricacies of deep neural network development, paving the way for creating
more effective models.
-
25
References
Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in
deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep
neural networks. arXiv preprint arXiv:1711.08856, 2017.
Mahbubul Alam, Manar D Samad, Lasitha Vidyaratne, Alexander Glandon, and Khan M
Iftekharuddin. Survey on deep neural networks in speech and vision systems. Neurocom-
puting, 417:302–321, 2020.
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational
information bottleneck. arXiv:1612.00410, 2016. URL http://arxiv.org/abs/1612.
00410.
Rana Ali Amjad and Bernhard C Geiger. How (not) to train your neural network using the
information bottleneck principle. arXiv preprint arXiv:1802.09766, 2018.
Rana Ali Amjad and Bernhard Claus Geiger. Learning representations for neural network-
based classification using the information bottleneck principle. IEEE transactions on
pattern analysis and machine intelligence, 2019.
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation
analysis. In International conference on machine learning, pages 1247–1255. PMLR, 2013.
Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679–6687,
2021.
Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj
Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv
preprint arXiv:1902.09229, 2019.
Francis R Bach and Michael I Jordan. Kernel independent component analysis. Journal of
machine learning research, 3(Jul):1–48, 2002.
Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. J. Mach.
Learn. Res., 3(null):1–48, mar 2003. ISSN 1532-4435. doi: 10.1162/153244303768966085.
URL https://doi.org/10.1162/153244303768966085.
Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna
Rohrbach, Trevor Darrell, and Amir Globerson. Detreg: Unsupervised pretraining with
region priors for object detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 14605–14615, 2022.
26
To Compress or Not to Compress
Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that discovers
surfaces in random-dot stereograms. Nature, 355(6356):161–163, 1992.
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio,
R. Devon Hjelm, and Aaron C. Courville. Mutual information neural estimation. In
ICML, 2018a.
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio,
Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv
preprint arXiv:1801.04062, 2018b.
Ido Ben-Shaul, Ravid Shwartz-Ziv, Tomer Galanti, Shai Dekel, and Yann LeCun. Reverse
engineering self-supervised learning. arXiv preprint arXiv:2305.15614, 2023.
Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards ai ,in l. bottou, o.
chapelle, d. decoste, and j. weston, editors,. Large Scale Kernel Machines,MIT Press.,
2007.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review
and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35
(8):1798–1828, 2013.
David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and
Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in
Neural Information Processing Systems, 32, 2019.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature
verification using a” siamese” time delay neural network. Advances in neural information
processing systems, 6, 1993.
Lars Buesing and Wolfgang Maass. A spiking neuron as information bottleneck. Neural
computation, 22(8):1961–1992, 2010.
Tian Cao, Vladimir Jojic, Shannon Modla, Debbie Powell, Kirk Czymmek, and Marc
Niethammer. Robust multimodal dictionary learning. In Kensaku Mori, Ichiro Sakuma,
Yoshinobu Sato, Christian Barillot, and Nassir Navab, editors, Medical Image Computing
and Computer-Assisted Intervention – MICCAI 2013, pages 259–266, Berlin, Heidelberg,
2013. Springer Berlin Heidelberg. ISBN 978-3-642-40811-3.
27
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand
Joulin. Unsupervised learning of visual features by contrasting cluster assignments.
Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski,
and Armand Joulin. Emerging properties in self-supervised vision transformers. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
9650–9660, 2021.
Ivan Chelombiev, Conor Houghton, and Cian O’Donnell. Adaptive estimators show informa-
tion compression in deep neural networks. arXiv preprint arXiv:1902.09037, 2019.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. In International conference on machine
learning, pages 1597–1607. PMLR, 2020a.
Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
15750–15758, 2021.
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum
contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively,
with application to face verification. In 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE,
2005.
Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
Luke Nicholas Darlow and Amos Storkey. What information does a resnet compress? arXiv
preprint arXiv:2003.06254, 2020.
Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from
incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B
(Methodological), 39(1):1–22, 1977.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.
arXiv preprint arXiv:1605.08803, 2016.
28
To Compress or Not to Compress
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini
Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional
networks for visual recognition and description. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2625–2634, 2015.
Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov
process expectations for large time, i. Communications on Pure and Applied Mathematics,
28(1):1–47, 1975.
Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, and Chris J Maddison. Lossy com-
pression for lossless prediction. Advances in Neural Information Processing Systems, 34,
2021.
Adar Elad, Doron Haviv, Yochai Blau, and Tomer Michaeli. Direct validation of the
information bottleneck principle for deep nets. In Proceedings of the IEEE International
Conference on Computer Vision Workshops, 2019a.
Adar Elad, Doron Haviv, Yochai Blau, and Tomer Michaeli. The effectiveness of layer-by-layer
training using the information bottleneck principle, 2019b. URL https://openreview.
net/forum?id=r1Nb5i05tX.
Gal Elidan and Nir Friedman. The information bottleneck em algorithm. arXiv preprint
arXiv:1212.2460, 2012.
Alper T Erdogan. An information maximization based blind source separation approach for
dependent and independent sources. In ICASSP 2022-2022 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 4378–4382. IEEE, 2022.
Deniz Erdogmus. Information theoretic learning: Renyi’s entropy and its applications to
adaptive system training. University of Florida, 2002.
Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learn-
ing robust representations via multi-view information bottleneck. arXiv preprint
arXiv:2002.07017, 2020.
Ian Fischer. The conditional entropy bottleneck. Entropy, 22(9):999, 2020.
Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate information
bottleneck. arXiv preprint arXiv:1301.2270, 2013.
Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information
for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286,
2015.
Bernhard C Geiger. On information plane analyses of neural network classifiers–a review.
arXiv preprint arXiv:2003.09671, 2020.
Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein,
and Andrew Gordon Wilson. How much data are augmentations worth? an investigation
into scaling laws, invariance, and implicit regularization. arXiv preprint arXiv:2210.06441,
2022.
29
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked
autoencoder for distribution estimation. In International Conference on Machine Learning,
pages 881–889. PMLR, 2015.
Ziv Goldfeld and Kristjan Greenewald. Sliced mutual information: A scalable measure of
statistical dependence. Advances in Neural Information Processing Systems, 34:17567–
17578, 2021.
Ziv Goldfeld, Kristjan Greenewald, Theshani Nuradha, and Galen Reeves. k-sliced mu-
tual information: A quantitative study of scalability with dimension. arXiv preprint
arXiv:2206.08526, 2022.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation
for MIT Press, 2016. URL http://www.deeplearningbook.org.
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint
arXiv:1308.0850, 2013.
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond,
Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad
Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised
learning. Advances in Neural Information Processing Systems, 33:21271–21284, 2020.
Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an
invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
Hanyuan Hang, Ingo Steinwart, Yunlong Feng, and Johan AK Suykens. Kernel density
estimation for dynamical systems. The Journal of Machine Learning Research, 19(1):
1260–1308, 2018.
David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis:
An overview with application to learning methods. Neural Computation, 16(12):2639–2664,
2004. doi: 10.1162/0899766042321814.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. CoRR, abs/1512.03385, 2015.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for
unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 9729–9738, 2020.
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked
autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
30
To Compress or Not to Compress
Ron M Hecht, Elad Noor, and Naftali Tishby. Speaker recognition by gaussian information
bottleneck. In Tenth Annual Conference of the International Speech Communication
Association, 2009.
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew
Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual
concepts with a constrained variational framework. In ICLR, 2017.
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,
Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information
estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
ISSN 00063444. URL http://www.jstor.org/stable/2333955.
Zhenyu Huang, Joey Tianyi Zhou, Xi Peng, Changqing Zhang, Hongyuan Zhu, and Jiancheng
Lv. Multi-view spectral clustering network. In IJCAI, pages 2563–2569, 2019.
Patrick Huembeli, Juan Miguel Arrazola, Nathan Killoran, Masoud Mohseni, and Peter
Wittek. The physics of energy-based models. Quantum Machine Intelligence, 4(1):1–13,
2022.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence
and generalization in neural networks. Advances in neural information processing systems,
31, 2018.
Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Factorized latent spaces with struc-
tured sparsity. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Cu-
lotta, editors, Advances in Neural Information Processing Systems, volume 23. Cur-
ran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper/2010/file/
a49e9411d64ff53eccfdd09ad10a15b3-Paper.pdf.
Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional
collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
Jonathan Kahana and Yedid Hoshen. A contrastive objective for learning disentangled
representations. arXiv preprint arXiv:2203.11284, 2022.
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image
descriptions. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 3128–3137, 2015.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
31
Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv
preprint arXiv:1906.02691, 2019.
Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-
supervised learning with deep generative models. Advances in neural information processing
systems, 27, 2014.
Bernard Osgood Koopman. On distributions admitting a sufficient statistic. Transactions
of the American Mathematical society, 39(3):399–409, 1936.
Lyudmyla F Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a
random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
Abhishek Kumar and Hal Daumé. A co-training approach for multi-view spectral clustering.
In Proceedings of the 28th international conference on machine learning (ICML-11), pages
393–400. Citeseer, 2011.
Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv
preprint arXiv:1610.02242, 2016.
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther.
Autoencoding beyond pixels using a learned similarity metric. In International conference
on machine learning, pages 1558–1566. PMLR, 2016.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, ””, 2015.
Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning
method for deep neural networks. In Workshop on challenges in representation learning,
ICML, volume 3, page 896, 2013.
Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Efficient sparse coding algorithms.
In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Pro-
cessing Systems, volume 19. MIT Press, 2006. URL https://proceedings.neurips.cc/
paper_files/paper/2006/file/2d71b2ae158c7c5912cc0bbde2bb9d95-Paper.pdf.
Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already
know helps: Provable self-supervised learning. Advances in Neural Information Processing
Systems, 34, 2021a.
Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, and Ian Fischer. Com-
pressive visual representations. Advances in Neural Information Processing Systems, 34,
2021b.
Yingming Li, Ming Yang, and Zhongfei Zhang. A survey of multi-view representation
learning. IEEE transactions on knowledge and data engineering, 31(10):1863–1883, 2018.
Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.
Shiming Liu, Yifan Xia, Zhusheng Shi, Hui Yu, Zhiqiang Li, and Jianguo Lin. Deep learning
in sheet metal bending with a novel theory-guided deep neural network. IEEE/CAA
Journal of Automatica Sinica, 8(3):565–581, 2021a.
32
To Compress or Not to Compress
Weifeng Liu, Dacheng Tao, Jun Cheng, and Yuanyan Tang. Multiview hessian discriminative
sparse coding for image annotation. Computer Vision and Image Understanding, 118:
50–60, 2014. ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2013.03.007. URL
https://www.sciencedirect.com/science/article/pii/S1077314213001550.
Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang.
Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge
and Data Engineering, 2021b.
Zhengzheng Lou, Yangdong Ye, and Xiaoqiang Yan. The multi-feature information bottleneck
with application to unsupervised image categorization. In Twenty-Third International
Joint Conference on Artificial Intelligence, 2013.
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey.
Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep caption-
ing with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632,
2014.
Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant
representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 6707–6717, 2020.
Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial
training: a regularization method for supervised and semi-supervised learning. IEEE
transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
Charlie Nash, Nate Kushman, and Christopher KI Williams. Inverting supervised repre-
sentations with autoregressive neural density models. arXiv preprint arXiv:1806.00400,
2018.
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y.
Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on
International Conference on Machine Learning, ICML’11, page 689–696, Madison, WI,
USA, 2011. Omnipress. ISBN 9781450306195.
Morteza Noshad and Alfred O Hero III. Scalable mutual information estimation using
dependence graphs. arXiv preprint arXiv:1801.09125, 2018.
Serdar Ozsoy, Shadi Hamdan, Sercan Arik, Deniz Yuret, and Alper Erdogan. Self-supervised
learning with an information maximization criterion. Advances in Neural Information
Processing Systems, 35:35240–35253, 2022.
Amichai Painsky and Gregory W Wornell. On the universality of the logistic loss function.
arXiv preprint arXiv:1805.03804, 2018.
33
Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive
information in a sensory population. Proceedings of the National Academy of Sciences,
112(22):6908–6913, 2015.
Liam Paninski. Estimation of entropy and mutual information. Neural Comput., 15(6):
1191–1253, 2003. ISSN 0899-7667. doi: 10.1162/089976603321780272.
Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative
algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pages
546–550. IEEE, 2018.
Zoe Piran, Ravid Shwartz-Ziv, and Naftali Tishby. The dual information bottleneck. arXiv
preprint arXiv:2006.04641, 2020.
Shi Pu, Yijiang He, Zheng Li, and Mao Zheng. Multimodal topic learning for video
recommendation. arXiv preprint arXiv:2010.13373, 2020.
Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In
International conference on machine learning, pages 1530–1538. PMLR, 2015.
Brian C Ross. Mutual information between discrete and continuous data sets. PLoS ONE, 9
(2):e87357, 2014. doi: 10.1371/journal.pone.0087357. URL https://doi.org/10.1371/
journal.pone.0087357.
Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments
on vector quantized autoencoders. arXiv preprint arXiv:1805.11063, 2018.
Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias
via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Bren-
dan D Tracey, and David D Cox. On the information bottleneck theory of deep learning.
Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019.
Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the informa-
tion bottleneck. Theoretical Computer Science, 411(29):2696 – 2711, 2010. ISSN 0304-3975.
doi: https://doi.org/10.1016/j.tcs.2010.04.006. URL http://www.sciencedirect.com/
science/article/pii/S030439751000201X. Algorithmic Learning Theory (ALT 2008).
Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need.
Information Fusion, 81:84–90, 2022.
34
To Compress or Not to Compress
Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via
information. arXiv preprint arXiv:1703.00810, 2017.
Ravid Shwartz-Ziv, Amichai Painsky, and Naftali Tishby. Representation compression and
generalization in deep neural networks, 2018.
Ravid Shwartz-Ziv, Micah Goldblum, Hossein Souri, Sanyam Kapoor, Chen Zhu, Yann
LeCun, and Andrew G Wilson. Pre-train your loss: Easy bayesian transfer learning with
informative priors. Advances in Neural Information Processing Systems, 35:27706–27715,
2022b.
Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, Tim GJ Rudner, and Yann LeCun.
An information-theoretic perspective on variance-invariance-covariance regularization.
arXiv preprint arXiv:2303.00633, 2023.
Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel,
Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-
supervised learning with consistency and confidence. Advances in Neural Information
Processing Systems, 33:596–608, 2020.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32.
Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/
file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf.
Yang Song and Diederik P Kingma. How to train your energy-based models. arXiv preprint
arXiv:2101.03288, 2021.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon,
and Ben Poole. Score-based generative modeling through stochastic differential equations.
arXiv preprint arXiv:2011.13456, 2020.
Karthik Sridharan and Sham Kakade. An information theoretic framework for multi-view
learning. SO, 01 2008.
Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann
machines. Journal of Machine Learning Research, 15(84):2949–2980, 2014. URL http:
//jmlr.org/papers/v15/srivastava14b.html.
35
Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional
mutual information. In Conference on Learning Theory, pages 3437–3452. PMLR, 2020.
Liang Sun, Betul Ceran, and Jieping Ye. A scalable two-stage approach for a class of dimen-
sionality reduction techniques. In Proceedings of the 16th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 313–322, 2010.
Shiliang Sun. A survey of multi-view machine learning. Neural Computing and Applications,
23:2031–2038, 2013.
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In European
conference on computer vision, pages 776–794. Springer, 2020a.
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola.
What makes for good views for contrastive learning? Advances in Neural Information
Processing Systems, 33:6827–6839, 2020b.
N. Tishby, F.C. Pereira, and W. Biale. The information bottleneck method. In The 37th
annual Allerton Conf. on Communication, Control, and Computing, pages 368–377, 1999a.
URL https://arxiv.org/abs/physics/0004057.
Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method.
In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and
Computing, 1999b.
Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-
supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576,
2020.
Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lu-
cic. On mutual information maximization for representation learning. arXiv preprint
arXiv:1907.13625, 2019.
Richard Turner and Maneesh Sahani. A maximum-likelihood interpretation for slow feature
analysis. Neural computation, 19(4):1022–1038, 2007.
Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. Subtab: Subsetting features of
tabular data for self-supervised representation learning. Advances in Neural Information
Processing Systems, 34:18853–18865, 2021.
Yiğit Uğur, George Arvanitakis, and Abdellatif Zaidi. Variational information bottleneck
for unsupervised clustering: Deep gaussian mixture embedding. Entropy, 22(2):213, 2020.
Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
Conditional image generation with pixelcnn decoders. Advances in neural information
processing systems, 29, 2016.
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances
in neural information processing systems, 30, 2017.
36
To Compress or Not to Compress
Matı́as Vera, Pablo Piantanida, and Leonardo Rey Vega. The role of information complexity
and randomization in representation learning. arXiv preprint arXiv:1802.05355, 2018.
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural
Computation, 23(7):1661–1674, 2011. doi: 10.1162/NECO a 00142.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting
and composing robust features with denoising autoencoders. In Proceedings of the 25th
international conference on Machine learning, pages 1096–1103, 2008.
Slava Voloshynovskiy, Olga Taran, Mouad Kondah, Taras Holotyak, and Danilo Rezende.
Variational information bottleneck for semi-supervised classification. Entropy, 22(9), 2020.
ISSN 1099-4300. doi: 10.3390/e22090943. URL https://www.mdpi.com/1099-4300/22/
9/943.
Haoqing Wang, Xun Guo, Zhi-Hong Deng, and Yan Lu. Rethinking minimal sufficient
representation in contrastive learning. arXiv preprint arXiv:2203.07004, 2022.
Qi Wang, Claire Boudreau, Qixing Luo, Pang-Ning Tan, and Jiayu Zhou. Deep Multi-view
Information Bottleneck, pages 37–45. A, 2019. doi: 10.1137/1.9781611975673.5. URL
https://epubs.siam.org/doi/abs/10.1137/1.9781611975673.5.
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through
alignment and uniformity on the hypersphere. In International Conference on Machine
Learning, pages 9929–9939. PMLR, 2020.
Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view repre-
sentation learning. In Proceedings of the 32nd International Conference on International
Conference on Machine Learning - Volume 37, ICML’15, page 1083–1092. JMLR.org,
2015.
Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub Świkatkowski, Linh Tran, Stephan
Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin.
How good is the bayes posterior in deep neural networks really? arXiv preprint
arXiv:2002.02405, 2020.
Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learning of
invariances. Neural Computation, 14(4):715–770, 2002. doi: 10.1162/089976602317318938.
Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data
augmentation for consistency training. Advances in Neural Information Processing Systems,
33:6256–6268, 2020.
Chang Xu, Dacheng Tao, and Chao Xu. Large-margin multi-viewinformation bottleneck.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:1559–1572, 2014.
37
Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of
usable information under computational constraints. arXiv preprint arXiv:2002.10689,
2020.
Zhe Xue, Junping Du, Dawei Du, and Siwei Lyu. Deep low-rank subspace ensemble
for multi-view clustering. Information Sciences, 482:210–227, 2019. ISSN 0020-0255.
doi: https://doi.org/10.1016/j.ins.2019.01.018. URL https://www.sciencedirect.com/
science/article/pii/S0020025519300271.
Xiaoqiang Yan, Yangdong Ye, and Zhengzheng Lou. Unsupervised video categorization
based on multivariate information bottleneck method. Knowledge-Based Systems, 84:
34–45, 2015.
Xiaoqiang Yan, Shizhe Hu, Yiqiao Mao, Yangdong Ye, and Hui Yu. Deep multi-view
learning methods: A review. Neurocomputing, 448:106–129, 2021. ISSN 0925-2312. doi:
https://doi.org/10.1016/j.neucom.2021.03.090. URL https://www.sciencedirect.com/
science/article/pii/S0925231221004768.
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-
supervised learning via redundancy reduction. In International Conference on Machine
Learning, pages 12310–12320. PMLR, 2021.
Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised
semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 1476–1485, 2019.
Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural
networks with noisy labels. Advances in neural information processing systems, 31, 2018.
Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix
factorization. In Thirty-first AAAI conference on artificial intelligence, 2017a.
Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent
progress and new challenges. Information Fusion, 38:43–54, 2017b. ISSN 1566-2535. doi:
https://doi.org/10.1016/j.inffus.2017.02.007. URL https://www.sciencedirect.com/
science/article/pii/S1566253516302032.
Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing
variational autoencoders. arXiv preprint arXiv:1706.02262, 2017c.
Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland
Brendel. Contrastive learning inverts the data generating process. In International
Conference on Machine Learning, pages 12979–12990. PMLR, 2021.
38