Representation Learning: A Review and New Perspectives
Representation Learning: A Review and New Perspectives
Representation Learning: A Review and New Perspectives
= f(x) = argmin
h
x Wh
2
2
+ h1, (2)
Learning the dictionary W can be accomplished by optimizing
the following training criterion with respect to W:
JSC =
t
x
(t)
Wh
(t)
2
2
, (3)
where x
(t)
is the t-th example and h
(t)
is the corresponding
sparse code determined by Eq. 2. W is usually constrained to
have unit-norm columns (because one can arbitrarily exchange
scaling of column i with scaling of h
(t)
i
, such a constraint is
necessary for the L1 penalty to have any effect).
The probabilistic interpretation of sparse coding differs from
that of PCA, in that instead of a Gaussian prior on the latent
random variable h, we use a sparsity inducing Laplace prior
(corresponding to an L1 penalty):
p(h) =
d
h
2
exp(|hi|)
p(x | h) = N(x; Wh + x,
2
x
I). (4)
In the case of sparse coding, because we will seek a sparse
representation (i.e., one with many features set to exactly zero),
we will be interested in recovering the MAP (maximum a
posteriori value of h: i.e. h
= argmax
h
p(h | x) rather than
its expected value E[h|x]. Under this interpretation, dictionary
learning proceeds as maximizing the likelihood of the data
given these MAP values of h
: argmax
W
t
p(x
(t)
| h
(t)
)
subject to the norm constraint on W. Note that this pa-
rameter learning scheme, subject to the MAP values of
the latent h, is not standard practice in the probabilistic
graphical model literature. Typically the likelihood of the
data p(x) =
h
p(x | h)p(h) is maximized directly. In the
presence of latent variables, expectation maximization is em-
ployed where the parameters are optimized with respect to
the marginal likelihood, i.e., summing or integrating the joint
log-likelihood over the all values of the latent variables under
their posterior P(h | x), rather than considering only the
single MAP value of h. The theoretical properties of this
form of parameter learning are not yet well understood but
seem to work well in practice (e.g. k-Means vs Gaussian
mixture models and Viterbi training for HMMs). Note also
that the interpretation of sparse coding as a MAP estimation
can be questioned (Gribonval, 2011), because even though the
interpretation of the L1 penalty as a log-prior is a possible
interpretation, there can be other Bayesian interpretations
compatible with the training criterion.
Sparse coding is an excellent example of the power of
explaining away. Even with a very overcomplete dictionary
11
,
the MAP inference process used in sparse coding to nd h
can pick out the most appropriate bases and zero the others,
despite them having a high degree of correlation with the input.
This property arises naturally in directed graphical models
such as sparse coding and is entirely owing to the explaining
away effect. It is not seen in commonly used undirected prob-
abilistic models such as the RBM, nor is it seen in parametric
feature encoding methods such as auto-encoders. The trade-
off is that, compared to methods such as RBMs and auto-
encoders, inference in sparse coding involves an extra inner-
loop of optimization to nd h
i
i(x)
j
j(h)
k
(x, h) (5)
where
i
(x),
j
(h) and
k
(x, h) are the clique potentials de-
scribing the interactions between the visible elements, between
the hidden variables, and those interaction between the visible
and hidden variables respectively. The partition function Z
exp (E
(x, h) =
1
2
x
T
Ux
1
2
h
T
V h x
T
Wh b
T
x d
T
h, (7)
where = {U, V, W, b, d} are the model parameters which
respectively encode the visible-to-visible interactions, the
hidden-to-hidden interactions, the visible-to-hidden interac-
tions, the visible self-connections, and the hidden self-
connections (called biases). To avoid over-parametrization, the
diagonals of U and V are set to zero.
The Boltzmann machine energy function species the prob-
ability distribution over [x, h], via the Boltzmann distribution,
Eq. 6, with the partition function Z
given by:
Z
=
x
1
=1
x
1
=0
x
dx
=1
x
dx
=0
h
1
=1
h
1
=0
h
d
h
=1
h
d
h
=0
exp
_
E
BM
(x, h; )
_
. (8)
This joint probability distribution gives rise to the set of
conditional distributions of the form:
P(hi | x, h
\i
) = sigmoid
_
_
j
Wjixj +
=i
V
ii
h
i
+ di
_
_
(9)
P(xj | h, x
\j
) = sigmoid
_
_
i
Wjixj +
=j
U
jj
x
j
+ bj
_
_
.
(10)
In general, inference in the Boltzmann machine is intractable.
For example, computing the conditional probability of h
i
given
the visibles, P(h
i
| x), requires marginalizing over the rest of
the hiddens, which implies evaluating a sum with 2
d
h
1
terms:
P(hi | x) =
h
1
=1
h
1
=0
h
i1
=1
h
i1
=0
h
i+1
=1
h
i+1
=0
h
d
h
=1
h
d
h
=0
P(h | x) (11)
However with some judicious choices in the pattern of inter-
actions between the visible and hidden units, more tractable
subsets of the model family are possible, as we discuss next.
Restricted Boltzmann Machines (RBMs). The RBM
is likely the most popular subclass of Boltzmann ma-
chine (Smolensky, 1986). It is dened by restricting the
interactions in the Boltzmann energy function, in Eq. 7, to
only those between h and x, i.e. E
RBM
is E
BM
with U = 0
and V = 0. As such, the RBM can be said to form a bipartite
graph with the visibles and the hiddens forming two layers
of vertices in the graph (and no connection between units of
the same layer). With this restriction, the RBM possesses the
useful property that the conditional distribution over the hidden
units factorizes given the visibles:
P(h | x) =
i
P(hi | x)
P(hi = 1 | x) = sigmoid
_
j
Wjixj + di
_
. (12)
Likewise, the conditional distribution over the visible units
given the hiddens also factorizes:
P(x | h) =
j
P(xj | h)
P(xj = 1 | h) = sigmoid
_
i
Wjihi + bj
_
. (13)
10
This makes inferences readily tractable in RBMs. For exam-
ple, the RBM feature representation is taken to be the set of
posterior marginals P(h
i
| x), which, given the conditional
independence described in Eq. 12, are immediately available.
Note that this is in stark contrast to the situation with popular
directed graphical models for unsupervised feature extraction,
where computing the posterior probability is intractable.
Importantly, the tractability of the RBM does not extend
to its partition function, which still involves summing an
exponential number of terms. It does imply however that we
can limit the number of terms to min{2
dx
, 2
d
h
}. Usually this is
still an unmanageable number of terms and therefore we must
resort to approximate methods to deal with its estimation.
It is difcult to overstate the impact the RBM has had to
the elds of unsupervised feature learning and deep learning.
It has been used in a truly impressive variety of applica-
tions, including fMRI image classication (Schmah et al.,
2009), motion and spatial transformations (Taylor and Hinton,
2009; Memisevic and Hinton, 2010), collaborative ltering
(Salakhutdinov et al., 2007) and natural image modeling
(Ranzato and Hinton, 2010; Courville et al., 2011b).
6.3 Generalizations of the RBM to Real-valued data
Important progress has been made in the last few years in
dening generalizations of the RBM that better capture real-
valued data, in particular real-valued image data, by better
modeling the conditional covariance of the input pixels. The
standard RBM, as discussed above, is dened with both binary
visible variables v {0, 1} and binary latent variables h
{0, 1}. The tractability of inference and learning in the RBM
has inspired many authors to extend it, via modications of its
energy function, to model other kinds of data distributions. In
particular, there has been multiple attempts to develop RBM-
type models of real-valued data, where x R
dx
. The most
straightforward approach to modeling real-valued observations
within the RBM framework is the so-called Gaussian RBM
(GRBM) where the only change in the RBM energy function
is to the visible units biases, by adding a bias term that is
quadratic in the visible units x. While it probably remains
the most popular way to model real-valued data within the
RBM framework, Ranzato and Hinton (2010) suggest that the
GRBM has proved to be a somewhat unsatisfactory model of
natural images. The trained features typically do not represent
sharp edges that occur at object boundaries and lead to latent
representations that are not particularly useful features for
classication tasks. Ranzato and Hinton (2010) argue that
the failure of the GRBM to adequately capture the statistical
structure of natural images stems from the exclusive use of the
model capacity to capture the conditional mean at the expense
of the conditional covariance. Natural images, they argue, are
chiey characterized by the covariance of the pixel values,
not by their absolute values. This point is supported by the
common use of preprocessing methods that standardize the
global scaling of the pixel values across images in a dataset
or across the pixel values within each image.
These kinds of concerns about the ability of the GRBM
to model natural image data has lead to the development of
alternative RBM-based models that each attempt to take on this
objective of better modeling non-diagonal conditional covari-
ances. (Ranzato and Hinton, 2010) introduced the mean and
covariance RBM (mcRBM). Like the GRBM, the mcRBM is a
2-layer Boltzmann machine that explicitly models the visible
units as Gaussian distributed quantities. However unlike the
GRBM, the mcRBM uses its hidden layer to independently
parametrize both the mean and covariance of the data through
two sets of hidden units. The mcRBM is a combination of the
covariance RBM (cRBM) (Ranzato et al., 2010a), that models
the conditional covariance, with the GRBM that captures the
conditional mean. While the GRBM has shown considerable
potential as the basis of a highly successful phoneme recogni-
tion system (Dahl et al., 2010), it seems that due to difculties
in training the mcRBM, the model has been largely superseded
by the mPoT model. The mPoT model (mean-product of
Students T-distributions model) (Ranzato et al., 2010b) is
a combination of the GRBM and the product of Students T-
distributions model (Welling et al., 2003). It is an energy-based
model where the conditional distribution over the visible units
conditioned on the hidden variables is a multivariate Gaussian
(non-diagonal covariance) and the complementary conditional
distribution over the hidden variables given the visibles are a
set of independent Gamma distributions.
The PoT model has recently been generalized to the mPoT
model (Ranzato et al., 2010b) to include nonzero Gaussian
means by the addition of GRBM-like hidden units, similarly to
how the mcRBM generalizes the cRBM. The mPoT model has
been used to synthesize large-scale natural images (Ranzato
et al., 2010b) that show large-scale features and shadowing
structure. It has been used to model natural textures (Kivinen
and Williams, 2012) in a tiled-convolution conguration (see
section 11.2).
Another recently introduced RBM-based model with the
objective of having the hidden units encode both the mean
and covariance information is the spike-and-slab Restricted
Boltzmann Machine (ssRBM) (Courville et al., 2011a,b).
The ssRBM is dened as having both a real-valued slab
variable and a binary spike variable associated with each
unit in the hidden layer. The ssRBM has been demonstrated
as a feature learning and extraction scheme in the context
of CIFAR-10 object classication (Krizhevsky and Hinton,
2009) from natural images and has performed well in the
role (Courville et al., 2011a,b). When trained convolutionally
(see Section 11.2) on full CIFAR-10 natural images, the model
demonstrated the ability to generate natural image samples
that seem to capture the broad statistical structure of natural
images better than previous parametric generative models, as
illustrated with the samples of Figure 2.
The mcRBM, mPoT and ssRBM each set out to model
real-valued data such that the hidden units encode not only
the conditional mean of the data but also its conditional
covariance. Other than differences in the training schemes, the
most signicant difference between these models is how they
encode their conditional covariance. While the mcRBM and
the mPoT use the activation of the hidden units to enforce con-
straints on the covariance of x, the ssRBM uses the hidden unit
to pinch the precision matrix along the direction specied by
the corresponding weight vector. These two ways of modeling
11
Fig. 2. (Top) Samples from convolutionally trained -ssRBM
fromCourville et al. (2011b). (Bottom) Images in CIFAR-10 train-
ing set closest (L2 distance with contrast normalized training
images) to corresponding model samples on top. The model
does not appear to be overtting particular training examples.
conditional covariance diverge when the dimensionality of the
hidden layer is signicantly different from that of the input.
In the overcomplete setting, sparse activation with the ssRBM
parametrization permits variance only in the select directions
of the sparsely activated hidden units. This is a property the
ssRBM shares with sparse coding models (Olshausen and
Field, 1996; Grosse et al., 2007). On the other hand, in
the case of the mPoT or mcRBM, an overcomplete set of
constraints on the covariance implies that capturing arbitrary
covariance along a particular direction of the input requires
decreasing potentially all constraints with positive projection
in that direction. This perspective would suggest that the mPoT
and mcRBM do not appear to be well suited to provide a sparse
representation in the overcomplete setting.
6.4 RBM parameter estimation
Many of the RBM training methods we discuss here are ap-
plicable to more general undirected graphical models, but are
particularly practical in the RBM setting. Freund and Haussler
(1994) proposed a learning algorithm for harmoniums (RBMs)
based on projection pursuit. Contrastive Divergence (Hinton,
1999; Hinton et al., 2006) has been used most often to train
RBMs, and many recent papers use Stochastic Maximum
Likelihood (Younes, 1999; Tieleman, 2008).
As discussed in Sec. 6.1, in training probabilistic models
parameters are typically adapted in order to maximize the like-
lihood of the training data (or equivalently the log-likelihood,
or its penalized version, which adds a regularization term).
With T training examples, the log likelihood is given by:
T
t=1
log P(x
(t)
; ) =
T
t=1
log
h{0,1}
d
h
P(x
(t)
, h; ). (14)
Gradient-based optimization requires its gradient, which for
Boltzmann machines, is given by:
i
T
t=1
log p(x
(t)
) =
T
t=1
E
p(h|x
(t)
)
_
i
E
BM
(x
(t)
, h)
_
+
T
t=1
E
p(x,h)
_
i
E
BM
(x, h)
_
, (15)
where we have the expectations with respect to p(h
(t)
| x
(t)
)
in the clamped condition (also called the positive phase),
and over the full joint p(x, h) in the unclamped condition
(also called the negative phase). Intuitively, the gradient acts
to locally move the model distribution (the negative phase
distribution) toward the data distribution (positive phase dis-
tribution), by pushing down the energy of (h, x
(t)
) pairs (for
h P(h|x
(t)
)) while pushing up the energy of (h, x) pairs
(for (h, x) P(h, x)) until the two forces are in equilibrium,
at which point the sufcient statistics (gradient of the energy
function) have equal expectations with x sampled from the
training distribution or with x sampled from the model.
The RBM conditional independence properties imply that
the expectation in the positive phase of Eq. 15 is tractable.
The negative phase term arising from the partition func-
tions contribution to the log-likelihood gradient is more
problematic because the computation of the expectation over
the joint is not tractable. The various ways of dealing with the
partition functions contribution to the gradient have brought
about a number of different training algorithms, many trying
to approximate the log-likelihood gradient.
To approximate the expectation of the joint distribution in
the negative phase contribution to the gradient, it is natural to
again consider exploiting the conditional independence of the
RBM in order to specify a Monte Carlo approximation of the
expectation over the joint:
E
p(x,h)
_
i
E
RBM
(x, h)
_
1
L
L
l=1
i
E
RBM
( x
(l)
,
h
(l)
), (16)
with the samples ( x
(l)
,
h
(l)
) drawn by a block Gibbs MCMC
(Markov chain Monte Carlo) sampling procedure:
x
(l)
P(x |
h
(l1)
)
h
(l)
P(h | x
(l)
).
Naively, for each gradient update step, one would start a
Gibbs sampling chain, wait until the chain converges to the
equilibrium distribution and then draw a sufcient number of
samples to approximate the expected gradient with respect
to the model (joint) distribution in Eq. 16. Then restart the
process for the next step of approximate gradient ascent on
the log-likelihood. This procedure has the obvious aw that
waiting for the Gibbs chain to burn-in and reach equilibrium
anew for each gradient update cannot form the basis of a prac-
tical training algorithm. Contrastive Divergence (Hinton, 1999;
Hinton et al., 2006), Stochastic Maximum Likelihood (Younes,
1999; Tieleman, 2008) and fast-weights persistent contrastive
divergence or FPCD (Tieleman and Hinton, 2009) are all ways
to avoid or reduce the need for burn-in.
6.4.1 Contrastive Divergence
Contrastive divergence (CD) estimation (Hinton, 1999; Hinton
et al., 2006) estimates the negative phase expectation (Eq. 15)
with a very short Gibbs chain (often just one step) initialized
12
at the training data used in the positive phase. This reduces
the variance of the gradient estimator and still moves in a
direction that pulls the negative chain samples towards the as-
sociated positive chain samples. Much has been written about
the properties and alternative interpretations of CD and its
similarity to auto-encoder training, e.g. Carreira-Perpi nan and
Hinton (2005); Yuille (2005); Bengio and Delalleau (2009);
Sutskever and Tieleman (2010).
6.4.2 Stochastic Maximum Likelihood
The Stochastic Maximum Likelihood (SML) algorithm (also
known as persistent contrastive divergence or PCD) (Younes,
1999; Tieleman, 2008) is an alternative way to sidestep an
extended burn-in of the negative phase Gibbs sampler. At each
gradient update, rather than initializing the Gibbs chain at the
positive phase sample as in CD, SML initializes the chain at
the last state of the chain used for the previous update. In
other words, SML uses a continually running Gibbs chain (or
often a number of Gibbs chains run in parallel) from which
samples are drawn to estimate the negative phase expectation.
Despite the model parameters changing between updates, these
changes should be small enough that only a few steps of Gibbs
(in practice, often one step is used) are required to maintain
samples from the equilibrium distribution of the Gibbs chain,
i.e. the model distribution.
A troublesome aspect of SML is that it relies on the Gibbs
chain to mix well (especially between modes) for learning to
succeed. Typically, as learning progresses and the weights of
the RBM grow, the ergodicity of the Gibbs sample begins to
break down
12
. If the learning rate associated with gradient
ascent + g (with E[ g]
log p
(x)
) is not reduced
to compensate, then the Gibbs sampler will diverge from the
model distribution and learning will fail. Desjardins et al.
(2010); Cho et al. (2010); Salakhutdinov (2010b,a) have all
considered various forms of tempered transitions to address
the failure of Gibbs chain mixing, and convincing solutions
have not yet been clearly demonstrated. A recently introduced
promising avenue relies on depth itself, showing that mixing
between modes is much easier on deeper layers (Bengio et al.,
2013) (Sec.9.4).
Tieleman and Hinton (2009) have proposed quite a dif-
ferent approach to addressing potential mixing problems of
SML with their fast-weights persistent contrastive divergence
(FPCD), and it has also been exploited to train Deep Boltz-
mann Machines (Salakhutdinov, 2010a) and construct a pure
sampling algorithm for RBMs (Breuleux et al., 2011). FPCD
builds on the surprising but robust tendency of Gibbs chains
to mix better during SML learning than when the model
parameters are xed. The phenomenon is rooted in the form of
the likelihood gradient itself (Eq. 15). The samples drawn from
the SML Gibbs chain are used in the negative phase of the
gradient, which implies that the learning update will slightly
increase the energy (decrease the probability) of those samples,
making the region in the neighborhood of those samples
12. When weights become large, the estimated distribution is more peaky,
and the chain takes very long time to mix, to move from mode to mode, so
that practically the gradient estimator can be very poor. This is a serious
chicken-and-egg problem because if sampling is not effective, nor is the
training procedure, which may seem to stall, and yields even larger weights.
less likely to be resampled and therefore making it more
likely that the samples will move somewhere else (typically
going near another mode). Rather than drawing samples from
the distribution of the current model (with parameters ),
FPCD exaggerates this effect by drawing samples from a local
perturbation of the model with parameters
and an update
t+1
= (1 )t+1 +
t
+
i
_
T
t=1
log p(x
(t)
)
_
, (17)
where
(x)
from an input x. For each example x
(t)
from a data set
{x
(1)
, . . . , x
(T)
}, we dene
h
(t)
= f
(x
(t)
) (18)
where h
(t)
is the feature-vector or representation or code com-
puted from x
(t)
. Another closed form parametrized function
g
(h). Whereas
probabilistic models are dened from an explicit probability
function and are trained to maximize (often approximately) the
data likelihood (or a proxy), auto-encoders are parametrized
through their encoder and decoder and are trained using a
different training principle. The set of parameters of the
encoder and decoder are learned simultaneously on the task
of reconstructing as well as possible the original input, i.e.
attempting to incur the lowest possible reconstruction error
L(x, r) a measure of the discrepancy between x and its
reconstruction r over training examples. Good generalization
means low reconstruction error at test examples, while having
high reconstruction error for most other x congurations. To
capture the structure of the data-generating distribution, it
is therefore important that something in the training crite-
rion or the parametrization prevents the auto-encoder from
learning the identity function, which has zero reconstruction
error everywhere. This is achieved through various means in
the different forms of auto-encoders, as described below in
more detail, and we call these regularized auto-encoders. A
particular form of regularization consists in constraining the
code to have a low dimension, and this is what the classical
auto-encoder or PCA do.
In summary, basic auto-encoder training consists in nding
a value of parameter vector minimizing reconstruction error
JAE() =
t
L(x
(t)
, g
(f
(x
(t)
))) (19)
where x
(t)
is a training example. This minimization is usually
carried out by stochastic gradient descent as in the training
of Multi-Layer-Perceptrons (MLPs). Since auto-encoders were
primarily developed as MLPs predicting their input, the most
commonly used forms for the encoder and decoder are afne
mappings, optionally followed by a non-linearity:
f
(x) = s
f
(b + Wx) (20)
g
(h) = sg(d + W
h) (21)
where s
f
and s
g
are the encoder and decoder activation
functions (typically the element-wise sigmoid or hyperbolic
tangent non-linearity, or the identity function if staying linear).
The set of parameters of such a model is = {W, b, W
, d}
where b and d are called encoder and decoder bias vectors,
and W and W
(x) and g
= W
T
may
be (and is most often) used, rendering the parametrizations
identical. The usual training procedures however differ greatly
between the two approaches. A practical advantage of training
auto-encoder variants is that they dene a simple tractable
optimization objective that can be used to monitor progress.
In the case of a linear auto-encoder (linear encoder and
decoder) with squared reconstruction error, minimizing Eq. 19
learns the same subspace
15
as PCA. This is also true when
using a sigmoid nonlinearity in the encoder (Bourlard and
Kamp, 1988), but not if the weights W and W
are tied
(W
= W
T
), because W cannot be forced into being small
and W
j
s
3
(W
j
x
(t)
) to a linear
auto-encoder with tied weights, where s
3
is a nonlinear convex
function, yields an efcient algorithm for learning linear ICA.
7.2 Regularized Auto-Encoders
Like PCA, auto-encoders were originally seen as a dimen-
sionality reduction technique and thus used a bottleneck, i.e.
d
h
< d
x
. On the other hand, successful uses of sparse
coding and RBM approaches tend to favour overcomplete
representations, i.e. d
h
> d
x
. This can allow the auto-
encoder to simply duplicate the input in the features, with
perfect reconstruction without having extracted more mean-
ingful features. Recent research has demonstrated very suc-
cessful alternative ways, called regulrized auto-encoders, to
constrain the representation, even when it is overcomplete.
The effect of a bottleneck or of this regularization is that the
auto-encoder cannot reconstruct well everything, it is trained
to reconstruct well the training examples and generalization
means that reconstruction error is also small on test examples.
An interesting justication (Ranzato et al., 2008) for the
sparsity penalty (or any penalty that restricts in a soft way
14. L(x, r) =
dx
i=1
x
i
log(r
i
) + (1 x
i
) log(1 r
i
)
15. Contrary to traditional PCA loading factors, but similarly to the
parameters learned by probabilistic PCA, the weight vectors learned by a
linear auto-encoder are not constrained to form an orthonormal basis, nor to
have a meaningful ordering. They will however span the same subspace.
the volume of hidden congurations easily accessible by the
learner) is that it acts in spirit like the partition function of
RBMs, by making sure that only few input congurations can
have a low reconstruction error.
Alternatively, one can view the objective of the regulariza-
tion applied to an auto-encoder as making the representation
as constant (insensitive) as possible with respect to changes
in input. This view immediately justies two variants of
regularized auto-encoders described below: contractive auto-
encoders reduce the number of effective degrees of freedom of
the representation (around each point) by making the encoder
contractive, i.e., making the derivative of the encoder small
(thus making the hidden units saturate), while the denoising
auto-encoder makes the whole mapping robust, i.e., insen-
sitive to small random perturbations, or contractive, making
sure that the reconstruction cannot stay good when moving in
most directions around a training example.
7.2.1 Sparse Auto-Encoders
The earliest uses of single-layer auto-encoders for building
deep architectures by stacking them (Bengio et al., 2007)
considered the idea of tying the encoder weights and decoder
weights to restrict capacity as well as the idea of introducing a
form of sparsity regularization (Ranzato et al., 2007). Sparsity
in the representation can be achieved by penalizing the hidden
unit biases (making these additive offset parameters more
negative) (Ranzato et al., 2007; Lee et al., 2008; Goodfellow
et al., 2009; Larochelle and Bengio, 2008) or by directly
penalizing the output of the hidden unit activations (making
them closer to their saturating value at 0) (Ranzato et al.,
2008; Le et al., 2011a; Zou et al., 2011). Penalizing the bias
runs the danger that the weights could compensate for the
bias, which could hurt numerical optimization. When directly
penalizing the hidden unit outputs, several variants can be
found in the literature, but a clear comparative analysis is
still lacking. Although the L1 penalty (i.e., simply the sum
of output elements h
j
in the case of sigmoid non-linearity)
would seem the most natural (because of its use in sparse cod-
ing), it is used in few papers involving sparse auto-encoders.
A close cousin of the L1 penalty is the Student-t penalty
(log(1+h
2
j
)), originally proposed for sparse coding (Olshausen
and Field, 1996). Several papers penalize the average output
h
j
(e.g. over a minibatch), and instead of pushing it to 0,
encourage it to approach a xed target, either through a mean-
square error penalty, or maybe more sensibly (because h
j
behaves like a probability), a Kullback-Liebler divergence
with respect to the binomial distribution with probability :
log
h
j
(1) log(1
h
j
)+constant, e.g., with = 0.05.
7.2.2 Denoising Auto-Encoders
Vincent et al. (2008, 2010) proposed altering the training ob-
jective in Eq. 19 from mere reconstruction to that of denoising
an articially corrupted input, i.e. learning to reconstruct the
clean input from a corrupted version. Learning the identity
is no longer enough: the learner must capture the structure
of the input distribution in order to optimally undo the effect
of the corruption process, with the reconstruction essentially
being a nearby but higher density point than the corrupted
input. Figure 3 illustrates that the Denoising Auto-Encoder
15
(DAE) is learning a reconstruction function that corresponds
to a vector eld pointing towards high-density regions (the
manifold where examples concentrate).
Corrupted input
Corrupted input
!"#$"% '()*!+', -$.-'./")/'
.')" ) +$0'" 1#*'.,#$.)+
2*).#3$+14
original
input
Fig. 3. When data concentrate near a lower-dimensional
manifold, the corruption vector is typically almost orthogonal to
the manifold, and the reconstruction function learns to denoise,
map from low-probability congurations (corrupted inputs) to
high-probability ones (original inputs), creating a vector eld
aligned with the score (derivative of the estimated density).
.
Formally, the objective optimized by a DAE is:
JDAE =
t
E
q( x|x
(t)
)
_
L(x
(t)
, g
(f
( x)))
_
(22)
where E
q( x|x
(t)
)
[] averages over corrupted examples x drawn
from corruption process q( x|x
(t)
). In practice this is optimized
by stochastic gradient descent, where the stochastic gradient is
estimated by drawing one or a few corrupted versions of x
(t)
each time x
(t)
is considered. Corruptions considered in Vin-
cent et al. (2010) include additive isotropic Gaussian noise,
salt and pepper noise for gray-scale images, and masking
noise (salt or pepper only), e.g., setting some randomly chosen
inputs to 0 (independently per example). Masking noise has
been used in most of the simulations. Qualitatively better
features are reported with denoising, resulting in improved
classication, and DAE features performed similarly or better
than RBM features. Chen et al. (2012) show that a simpler
alternative with a closed form solution can be obtained when
restricting to a linear auto-encoder and have successfully
applied it to domain adaptation.
Vincent (2011) relates DAEs to energy-based probabilistic
models: DAEs basically learn in r( x) x a vector pointing
in the direction of the estimated score
log p( x)
x
(Figure 3). In
the special case of linear reconstruction and squared error,
Vincent (2011) shows that training an afne-sigmoid-afne
DAE amounts to learning an energy-based model, whose
energy function is very close to that of a GRBM. Training uses
a regularized variant of the score matching parameter estima-
tion technique (Hyv arinen, 2005; Hyv arinen, 2008; Kingma
and LeCun, 2010) termed denoising score matching (Vincent,
2011). Swersky (2010) had shown that training GRBMs with
score matching is equivalent to training a regular auto-encoder
with an additional regularization term, while, following up on
the theoretical results in Vincent (2011), Swersky et al. (2011)
showed the practical advantage of denoising to implement
score matching efciently. Finally Alain and Bengio (2012)
generalize Vincent (2011) and prove that DAEs of arbitrary
parametrization with small Gaussian corruption noise are
general estimators of the score.
7.2.3 Contractive Auto-Encoders
Contractive Auto-Encoders (CAE), proposed by Rifai et al.
(2011a), follow up on Denoising Auto-Encoders (DAE) and
share a similar motivation of learning robust representations.
CAEs achieve this by adding an analytic contractive penalty
to Eq. 19: the Frobenius norm of the encoders Jacobian,
and results in penalizing the sensitivity of learned features to
innitesimal input variations. Let J(x) =
f
x
(x) the Jacobian
matrix of the encoder at x. The CAEs training objective is
JCAE =
t
L(x
(t)
, g
(f
(x
(t)
))) +
_
_
_J(x
(t)
)
_
_
_
2
F
(23)
where is a hyper-parameter controlling the strength of the
regularization. For an afne sigmoid encoder, the contractive
penalty term is easy to compute:
Jj(x) = f
(x)j(1 f
(x)j)Wj
J(x)
2
F
=
j
(f
(x)j(1 f
(x)j))
2
Wj
2
(24)
There are at least three notable differences with DAEs, which
may be partly responsible for the better performance that CAE
features seem to empirically demonstrate: (a) the sensitivity of
the features is penalized
16
rather than that of the reconstruc-
tion; (b) the penalty is analytic rather than stochastic: an ef-
ciently computable expression replaces what might otherwise
require d
x
corrupted samples to size up (i.e. the sensitivity in
d
x
directions); (c) a hyper-parameter allows a ne control of
the trade-off between reconstruction and robustness (while the
two are mingled in a DAE). Note however that there is a tight
connection between the DAE and the CAE: as shown in (Alain
and Bengio, 2012) a DAE with small corruption noise can be
seen (through a Taylor expansion) as a type of contractive
auto-encoder where the contractive penalty is on the whole
reconstruction function rather than just on the encoder
17
.
A potential disadvantage of the CAEs analytic penalty is
that it amounts to only encouraging robustness to innitesimal
input variations. This is remedied in Rifai et al. (2011b) with
the CAE+H, that penalizes all higher order derivatives, in an
efcient stochastic manner, by adding a term that encourages
J(x) and J(x + ) to be close:
JCAE+H =
t
L(x
(t)
, g
(x
(t)
)) +
_
_
_J(x
(t)
)
_
_
_
2
F
+E
_
J(x) J(x + )
2
F
_
(25)
where N(0,
2
I), and is the associated regularization
strength hyper-parameter.
The DAE and CAE have been successfully used to win
the nal phase of the Unsupervised and Transfer Learning
Challenge (Mesnil et al., 2011). The representation learned
by the CAE tends to be saturated rather than sparse, i.e.,
most hidden units are near the extremes of their range (e.g. 0
or 1), and their derivative
hi(x)
x
is near 0. The non-saturated
units are few and sensitive to the inputs, with their associated
lters (weight vectors) together forming a basis explaining
the local changes around x, as discussed in Section 8.2.
Another way to get saturated (nearly binary) units is semantic
hashing (Salakhutdinov and Hinton, 2007).
16. i.e., the robustness of the representation is encouraged.
17. but note that in the CAE, the decoder weights are tied to the encoder
weights, to avoid degenerate solutions, and this should also make the decoder
contractive.
16
7.2.4 Predictive Sparse Decomposition
Sparse coding (Olshausen and Field, 1996) may be viewed as a
kind of auto-encoder that uses a linear decoder with a squared
reconstruction error, but whose non-parametric encoder f
t
h
(t)
1 +x
(t)
Wh
(t)
2
2
+h
(t)
f(x
(t)
)
2
2
(26)
where x
(t)
is the input vector for example t, h
(t)
is the
optimized hidden code for that example, and f
() is the
encoding function, the simplest variant being
f(x
(t)
) = tanh(b + W
T
x
(t)
) (27)
where encoding weights are the transpose of decoding
weights. Many variants have been proposed, including the
use of a shrinkage operation instead of the hyperbolic tan-
gent (Kavukcuoglu et al., 2010). Note how the L1 penalty on
h tends to make them sparse, and how this is the same criterion
as sparse coding with dictionary learning (Eq. 3) except for the
additional constraint that one should be able to approximate
the sparse codes h with a parametrized encoder f
(x)
that PCA yields for an input point x, uniquely locates its
projection on that manifold: it corresponds to intrinsic co-
ordinates on the manifold. Data manifold for complex real
world domains are however expected to be strongly non-
linear. Their modeling is sometimes approached as patchworks
of locally linear tangent spaces (Vincent and Bengio, 2003;
Brand, 2003). The large majority of algorithms built on
this geometric perspective adopt a non-parametric approach,
based on a training set nearest neighbor graph (Sch olkopf
et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000;
Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes,
2003; Weinberger and Saul, 2004; Hinton and Roweis, 2003;
van der Maaten and Hinton, 2008). In these non-parametric
approaches, each high-dimensional training point has its own
set of free low-dimensional embedding coordinates, which are
optimized so that certain properties of the neighborhood graph
computed in original high dimensional input space are best
preserved. These methods however do not directly learn a
parametrized feature extraction function f
(x) applicable to
new test points
19
, which seriously limits their use as feature
extractors, except in a transductive setting. Comparatively few
non-linear manifold learning methods have been proposed,
that learn a parametric map that can directly compute a
representation for new points; we will focus on these.
8.1 Learning a parametric mapping based on a
neighborhood graph
Some of the above non-parametric manifold learning al-
gorithms can be modied to learn a parametric mapping
f
1
=
j
|h
(t)
j
| yielding objective
JLCC =
t
_
x
(t)
Wh
(t)
2
2
+
j
|h
(t)
j
|W:,j x
(t)
1+p
_
(28)
This is identical to sparse coding when p = 1, but with
larger p it encourages the active anchor points for x
(t)
(i.e.
the codebook vectors W
:,j
with non-negligible |h
(t)
j
| that
are combined to reconstruct x
(t)
) to be not too far from
x
(t)
, hence the local aspect of the algorithm. An important
theoretical contribution of Yu et al. (2009) is to show that
that any Lipschitz-smooth function : M R dened on a
smooth nonlinear manifold M embedded in R
dx
can be well
approximated by a globally linear function with respect to the
resulting coding scheme (i.e. linear in h), where the accuracy
of the approximation and required number d
h
of anchor points
depend on d
M
rather than d
x
. This result has been further
extended with the use of local tangent directions (Yu and
Zhang, 2010), as well as to multiple layers (Lin et al., 2010).
Let us now consider the efcient non-iterative feed-
forward encoders f
(v, h
(1)
, h
(2)
; ) =v
T
Wh
(1)
h
(1)
T
V h
(2)
d
(1)
T
h
(1)
d
(2)
T
h
(2)
b
T
v, (29)
with = {W, V, d
(1)
, d
(2)
, b}. The DBM can also be char-
acterized as a bipartite graph between two sets of vertices,
formed by odd and even-numbered layers (with v := h
(0)
).
10.2.1 Mean-eld approximate inference
A key point of departure from the RBM is that the pos-
terior distribution over the hidden units (given the visibles)
is no longer tractable, due to the interactions between the
hidden units. Salakhutdinov and Hinton (2009) resort to a
mean-eld approximation to the posterior. Specically, in
the case of a model with two hidden layers, we wish to
approximate P
h
(1)
, h
(2)
| v
N1
j=1
Q
v
h
(1)
j
N2
i=1
Q
v
h
(2)
i
, such
that the KL divergence KL
h
(1)
, h
(2)
| v
Q
v
(h
1
, h
2
)
h
(1)
h
(2)
Qv(h
(1)
, h
(2)
) log
_
P(v, h
(1)
, h
(2)
)
Qv(h
(1)
, h
(2)
)
_
(30)
Maximizing this lower-bound with respect to the mean-eld
distribution Q
v
(h
1
, h
2
) (by setting derivatives to zero) yields
the following mean eld update equations:
h
(1)
i
sigmoid
_
j
Wjivj +
k
V
ik
h
(2)
k
+ d
(1)
i
_
(31)
h
(2)
k
sigmoid
_
i
V
ik
h
(1)
i
+ d
(2)
k
_
(32)
25. Joint training of all the layers of a Deep Belief Net is much more
challenging because of the much harder inference problem involved.
Note how the above equations ostensibly look like a xed
point recurrent neural network, i.e., with constant input. In
the same way that an RBM can be associated with a simple
auto-encoder, the above mean-eld update equations for the
DBM can be associated with a recurrent auto-encoder. In that
case the training criterion involves the reconstruction error at
the last or at consecutive time steps. This type of model has
been explored by Savard (2011) and Seung (1998) and shown
to do a better job at denoising than ordinary auto-encoders.
Iterating Eq. (31-32) until convergence yields the Q param-
eters of the variational positive phase of Eq. 33:
L(Qv) =EQv
_
log P(v, h
(1)
, h
(2)
) log Qv(h
(1)
, h
(2)
)
_
=EQv
_
E
DBM
(v, h
(1)
, h
(2)
) log Qv(h
(1)
, h
(2)
)
_
log Z
L(Qv)
= EQv
_
E
DBM
(v, h
(1)
, h
(2)
)
_
+EP
_
E
DBM
(v, h
(1)
, h
(2)
)
_
(33)
This variational learning procedure leaves the negative
phase untouched, which can thus be estimated through SML
or Contrastive Divergence (Hinton, 2000) as in the RBM case.
10.2.2 Training Deep Boltzmann Machines
The major difference between training a DBM and an RBM
is that instead of maximizing the likelihood directly, we
instead choose parameters to maximize the lower-bound on
the likelihood given in Eq. 30. The SML-based algorithm for
maximizing this lower-bound is as follows:
1) Clamp the visible units to a training example.
2) Iterate over Eq. (31-32) until convergence.
3) Generate negative phase samples v
, h
(1)
and h
(2)
through SML.
4) Compute L(Q
v
) / using the values obtained in steps
2-3.
5) Finally, update the model parameters with a step of
approximate stochastic gradient ascent.
While the above procedure appears to be a simple extension
of the highly effective SML scheme for training RBMs, as we
demonstrate in Desjardins et al. (2012), this procedure seems
vulnerable to falling in poor local minima which leave many
hidden units effectively dead (not signicantly different from
its random initialization with small norm).
The failure of the SML joint training strategy was noted
by Salakhutdinov and Hinton (2009). As an alternative, they
proposed a greedy layer-wise training strategy. This procedure
consists in pre-training the layers of the DBM, in much the
same way as the Deep Belief Network: i.e. by stacking RBMs
and training each layer to independently model the output of
the previous layer. A nal joint ne-tuning is done following
the above SML-based procedure.
11 BUILDING-IN INVARIANCE
It is well understood that incorporating prior domain knowl-
edge helps machine learning. Exploring good strategies for
doing so is a very important research avenue. However, if we
are to advance our understanding of core machine learning
principles, it is important that we keep comparisons between
predictors fair and maintain a clear awareness of the prior
23
domain knowledge used by different learning algorithms,
especially when comparing their performance on benchmark
problems. We have so far only presented algorithms that
exploited only generic inductive biases for high dimensional
problems, thus making them potentially applicable to any
high dimensional problem. The most prevalent approach to
incorporating prior knowledge is to hand-design better features
to feed a generic classier, and has been used extensively in
computer vision (e.g. (Lowe, 1999)). Here, we rather focus
on how basic domain knowledge of the input, in particular
its topological structure (e.g. bitmap images having a 2D
structure), may be used to learn better features.
11.1 Generating transformed examples
Generalization performance is usually improved by providing
a larger quantity of representative data. This can be achieved
by generating new examples by applying small random defor-
mations to the original training examples, using deformations
that are known not to change the target variables of interest,
e.g., an object class is invariant to small transformations of
images such as translations, rotations, scaling, or shearing.
This old approach (Baird, 1990) has been recently applied with
great success in the work of Ciresan et al. (2010) who used an
efcient GPU implementation (40 speedup) to train a stan-
dard but large deep multilayer Perceptron on deformed MNIST
digits. Using both afne and elastic deformations (Simard
et al., 2003), with plain old stochastic gradient descent, they
reach a record 0.32% classication error rate.
11.2 Convolution and pooling
Another powerful approach is based on even more basic
knowledge of merely the topological structure of the input
dimensions. By this we mean e.g., the 2D layout of pixels
in images or audio spectrograms, the 3D structure of videos,
the 1D sequential structure of text or of temporal sequences
in general. Based on such structure, one can dene local
receptive elds (Hubel and Wiesel, 1959), so that each low-
level feature will be computed from only a subset of the input:
a neighborhood in the topology (e.g. a sub-image at a given
position). This topological locality constraint corresponds to a
layer having a very sparse weight matrix with non-zeros only
allowed for topologically local connections. Computing the
associated matrix products can of course be made much more
efcient than having to handle a dense matrix, in addition
to the statistical gain from a much smaller number of free
parameters. In domains with such topological structure, similar
input patterns are likely to appear at different positions, and
nearby values (e.g. consecutive frames or nearby pixels) are
likely to have stronger dependencies that are also important to
model the data. In fact these dependencies can be exploited
to discover the topology (Le Roux et al., 2008a), i.e. recover
a regular grid of pixels out of a set of vectors without any
order information, e.g. after the elements have been arbitrarily
shufed in the same way for all examples. Thus a same local
feature computation is likely to be relevant at all translated po-
sitions of the receptive eld. Hence the idea of sweeping such
a local feature extractor over the topology: this corresponds to
a convolution, and transforms an input into a similarly shaped
feature map. Equivalently to sweeping, this may be seen as
static but differently positioned replicated feature extractors
that all share the same parameters. This is at the heart of
convolutional networks (LeCun et al., 1989, 1998b) which
have been applied both to object recognition and to image
segmentation (Turaga et al., 2010). Another hallmark of the
convolutional architecture is that values computed by the same
feature detector applied at several neighboring input locations
are then summarized through a pooling operation, typically
taking their max or their sum. This confers the resulting pooled
feature layer some degree of invariance to input translations,
and this style of architecture (alternating selective feature
extraction and invariance-creating pooling) has been the ba-
sis of convolutional networks, the Neocognitron (Fukushima,
1980) and HMAX (Riesenhuber and Poggio, 1999) models,
and argued to be the architecture used by mammalian brains
for object recognition (Riesenhuber and Poggio, 1999; Serre
et al., 2007; DiCarlo et al., 2012). The output of a pooling
unit will be the same irrespective of where a specic feature
is located inside its pooling region. Empirically the use of
pooling seems to contribute signicantly to improved classi-
cation accuracy in object classication tasks (LeCun et al.,
1998b; Boureau et al., 2010, 2011). A successful variant of
pooling connected to sparse coding is L2 pooling (Hyv arinen
et al., 2009; Kavukcuoglu et al., 2009; Le et al., 2010),
for which the pool output is the square root of the possibly
weighted sum of squares of lter outputs. Ideally, we would
like to generalize feature-pooling so as to learn what features
should be pooled together, e.g. as successfully done in several
papers (Hyv arinen and Hoyer, 2000; Kavukcuoglu et al., 2009;
Le et al., 2010; Ranzato and Hinton, 2010; Courville et al.,
2011b; Coates and Ng, 2011b; Gregor et al., 2011). In this
way, the pool output learns to be invariant to the variations
captured by the span of the features pooled.
Patch-based training
The simplest approach for learning a convolutional layer in an
unsupervised fashion is patch-based training: simply feeding
a generic unsupervised feature learning algorithm with local
patches extracted at random positions of the inputs. The
resulting feature extractor can then be swiped over the input to
produce the convolutional feature maps. That map may be used
as a new input for the next layer, and the operation repeated
to thus learn and stack several layers. Such an approach
was recently used with Independent Subspace Analysis (Le
et al., 2011c) on 3D video blocks, reaching the state-of-the-art
on Hollywood2, UCF, KTH and YouTube action recognition
datasets. Similarly (Coates and Ng, 2011a) compared several
feature learners with patch-based training and reached state-
of-the-art results on several classication benchmarks. Inter-
estingly, in this work performance was almost as good with
very simple k-means clustering as with more sophisticated
feature learners. We however conjecture that this is the case
only because patches are rather low dimensional (compared
to the dimension of a whole image). A large dataset might
provide sufcient coverage of the space of e.g. edges prevalent
in 6 6 patches, so that a distributed representation is not
absolutely necessary. Another plausible explanation for this
24
success is that the clusters identied in each image patch are
then pooled into a histogram of cluster counts associated with
a larger sub-image. Whereas the output of a regular clustering
is a one-hot non-distributed code, this histogram is itself a
distributed representation, and the soft k-means (Coates and
Ng, 2011a) representation allows not only the nearest lter but
also its neighbors to be active.
Convolutional and tiled-convolutional training
It is possible to directly train large convolutional layers using
an unsupervised criterion. An early approach (Jain and Seung,
2008) trained a standard but deep convolutional MLP on
the task of denoising images, i.e. as a deep, convolutional,
denoising auto-encoder. Convolutional versions of the RBM
or its extensions have also been developed (Desjardins and
Bengio, 2008; Lee et al., 2009a; Taylor et al., 2010) as well as
a probabilistic max-pooling operation built into Convolutional
Deep Networks (Lee et al., 2009a,b; Krizhevsky, 2010). Other
unsupervised feature learning approaches that were adapted to
the convolutional setting include PSD (Kavukcuoglu et al.,
2009, 2010; Jarrett et al., 2009; Henaff et al., 2011), a
convolutional version of sparse coding called deconvolutional
networks (Zeiler et al., 2010), Topographic ICA (Le et al.,
2010), and mPoT that Kivinen and Williams (2012) applied
to modeling natural textures. Gregor and LeCun (2010a);
Le et al. (2010) also demonstrated the technique of tiled-
convolution, where parameters are shared only between feature
extractors whose receptive elds are k steps away (so the
ones looking at immediate neighbor locations are not shared).
This allows pooling units to be invariant to more than just
translations, and is a hybrid between convolutional networks
and earlier neural networks with local connections but no
weight sharing (LeCun, 1986, 1989).
Alternatives to pooling
Alternatively, one can also use explicit knowledge of the
expected invariants expressed mathematically to dene trans-
formations that are robust to a known family of input defor-
mations, using so-called scattering operators (Mallat, 2012;
Bruna and Mallat, 2011), which can be computed in a way
interestingly analogous to deep convolutional networks and
wavelets. Like convolutional networks, the scattering operators
alternate two types of operations: convolution and pooling
(as a norm). Unlike convolutional networks, the proposed
approach keeps at each level all of the information about the
input (in a way that can be inverted), and automatically yields
a very sparse (but very high-dimensional) representation. An-
other difference is that the lters are not learned but instead
set so as to guarantee that a priori specied invariances are
robustly achieved. Just a few levels were sufcient to achieve
impressive results on several benchmark datasets.
11.3 Temporal coherence and slow features
The principle of identifying slowly moving/changing factors in
temporal/spatial data has been investigated by many (Becker
and Hinton, 1992; Wiskott and Sejnowski, 2002; Hurri and
Hyv arinen, 2003; K ording et al., 2004; Cadieu and Olshausen,
2009) as a principle for nding useful representations. In
particular this idea has been applied to image sequences and as
an explanation for why V1 simple and complex cells behave
the way they do. A good overview can be found in Hurri and
Hyv arinen (2003); Berkes and Wiskott (2005).
More recently, temporal coherence has been successfully
exploited in deep architectures to model video (Mobahi et al.,
2009). It was also found that temporal coherence discov-
ered visual features similar to those obtained by ordinary
unsupervised feature learning (Bergstra and Bengio, 2009),
and a temporal coherence penalty has been combined with a
training criterion for unsupervised feature learning (Zou et al.,
2011), sparse auto-encoders with L1 regularization, in this
case, yielding improved classication performance.
The temporal coherence prior can be expressed in several
ways, the simplest being the squared difference between
feature values at times t and t + 1. Other plausible tempo-
ral coherence priors include the following. First, instead of
penalizing the squared change, penalizing the absolute value
(or a similar sparsity penalty) would state that most of the
time the change should be exactly 0, which would intuitively
make sense for the real-life factors that surround us. Second,
one would expect that instead of just being slowly changing,
different factors could be associated with their own different
time scale. The specicity of their time scale could thus
become a hint to disentangle explanatory factors. Third, one
would expect that some factors should really be represented by
a group of numbers (such as x, y, and z position of some object
in space and the pose parameters of Hinton et al. (2011))
rather than by a single scalar, and that these groups tend
to move together. Structured sparsity penalties (Kavukcuoglu
et al., 2009; Jenatton et al., 2009; Bach et al., 2011; Gregor
et al., 2011) could be used for this purpose.
11.4 Algorithms to Disentangle Factors of Variation
The goal of building invariant features is to remove sensitivity
of the representation to directions of variance in the data that
are uninformative to the task at hand. However it is often the
case that the goal of feature extraction is the disentangling or
separation of many distinct but informative factors in the data,
e.g., in a video of people: subject identity, action performed,
subject pose relative to the camera, etc. In this situation,
the methods of generating invariant features, such as feature-
pooling, may be inadequate.
The process of building invariant features can be seen as
consisting of two steps. First, low-level features are recovered
that account for the data. Second, subsets of these low level
features are pooled together to form higher-level invariant
features, exemplied by the pooling and subsampling layers
of convolutional neural networks. The invariant representation
formed by the pooling features offers an incomplete window
on the data as the detailed representation of the lower-level
features is abstracted away in the pooling procedure. While
we would like higher-level features to be more abstract and
exhibit greater invariance, we have little control over what
information is lost through pooling. What we really would like
is for a particular feature set to be invariant to the irrelevant
features and disentangle the relevant features. Unfortunately,
it is often difcult to determine a priori which set of features
will ultimately be relevant to the task at hand.
25
An interesting approach to taking advantage of some of
the factors of variation known to exist in the data is the
transforming auto-encoder (Hinton et al., 2011): instead of a
scalar pattern detector (e.g,. corresponding to the probability
of presence of a particular form in the input) one can think
of the features as organized in groups that include both a
pattern detector and pose parameters that specify attributes
of the detected pattern. In (Hinton et al., 2011), what is
assumed a priori is that pairs of examples (or consecutive ones)
are observed with an associated value for the corresponding
change in the pose parameters. For example, an animal that
controls its eyes knows what changes to its ocular motor
system were applied when going from one image on its retina
to the next. In that work, it is also assumed that the pose
changes are the same for all the pattern detectors, and this
makes sense for global changes such as image translation and
camera geometry changes. Instead, we would like to discover
the pose parameters and attributes that should be associated
with each feature detector, without having to specify ahead of
time what they should be, force them to be the same for all
features, and having to necessarily observe the changes in all
of the pose parameters or attributes.
The approach taken recently in the Manifold Tangent Clas-
sier, discussed in section 8.3, is interesting in this respect.
Without any supervision or prior knowledge, it nds prominent
local factors of variation (tangent vectors to the manifold,
extracted from a CAE, interpreted as locally valid input defor-
mations). Higher-level features are subsequently encouraged
to be invariant to these factors of variation, so that they must
depend on other characteristics. In a sense this approach is
disentangling valid local deformations along the data manifold
from other, more drastic changes, associated to other factors
of variation such as those that affect class identity.
26
One solution to the problem of information loss that would
t within the feature-pooling paradigm, is to consider many
overlapping pools of features based on the same low-level
feature set. Such a structure would have the potential to
learn a redundant set of invariant features that may not cause
signicant loss of information. However it is not obvious
what learning principle could be applied that can ensure
that the features are invariant while maintaining as much
information as possible. While a Deep Belief Network or a
Deep Boltzmann Machine (as discussed in sections 4 and 10.2
respectively) with two hidden layers would, in principle, be
able to preserve information into the pooling second hidden
layer, there is no guarantee that the second layer features
are more invariant than the low-level rst layer features.
However, there is some empirical evidence that the second
layer of the DBN tends to display more invariance than the
rst layer (Erhan et al., 2010a).
A more principled approach, from the perspective of en-
suring a more robust compact feature representation, can
be conceived by reconsidering the disentangling of features
through the lens of its generative equivalent feature com-
position. Since many unsupervised learning algorithms have a
26. The changes that affect class identity might, in input space, actually be
of similar magnitude to local deformations, but not follow along the manifold,
i.e. cross zones of low density.
generative interpretation (or a way to reconstruct inputs from
their high-level representation), the generative perspective can
provide insight into how to think about disentangling fac-
tors. The majority of the models currently used to construct
invariant features have the interpretation that their low-level
features linearly combine to construct the data.
27
This is a
fairly rudimentary form of feature composition with signicant
limitations. For example, it is not possible to linearly combine
a feature with a generic transformation (such as translation) to
generate a transformed version of the feature. Nor can we even
consider a generic color feature being linearly combined with
a gray-scale stimulus pattern to generate a colored pattern. It
would seem that if we are to take the notion of disentangling
seriously we require a richer interaction of features than that
offered by simple linear combinations.
12 CONCLUSION
This review of representation learning and deep learning has
covered three major and apparently disconnected approaches:
the probabilistic models (both the directed kind such as
sparse coding and the undirected kind such as Boltzmann
machines), the reconstruction-based algorithms related to auto-
encoders, and the geometrically motivated manifold-learning
approaches. Drawing connections between these approaches
is currently a very active area of research and is likely to
continue to produce models and methods that take advantage
of the relative strengths of each paradigm.
Practical Concerns and Guidelines. One of the criticisms
addressed to articial neural networks and deep learning algo-
rithms is that they have many hyper-parameters and variants
and that exploring their congurations and architectures is an
art. This has motivated an earlier book on the Tricks of the
Trade (Orr and Muller, 1998) of which LeCun et al. (1998a)
is still relevant for training deep architectures, in particular
what concerns initialization, ill-conditioning and stochastic
gradient descent. A good and more modern compendium of
good training practice, particularly adapted to training RBMs,
is provided in Hinton (2010), while a similar guide oriented
more towards deep neural networks can be found in Bengio
(2013), both of which are part of a novel version of the
above book. Recent work on automating hyper-parameter
search (Bergstra and Bengio, 2012; Bergstra et al., 2011;
Snoek et al., 2012) is also making it more convenient, efcient
and reproducible.
Incorporating Generic AI-level Priors. We have covered
many high-level generic priors that we believe could bring
machine learning closer to AI by improving representation
learning. Many of these priors relate to the assumed existence
of multiple underlying factors of variation, whose variations
are in some sense orthogonal to each other. They are expected
to be organized at multiple levels of abstraction, hence the
need for deep architectures, which also have statistical advan-
tages because they allow to re-use parameters in a combi-
natorially efcient way. Only a few of these factors would
27. As an aside, if we are given only the values of the higher-level pooling
features, we cannot accurately recover the data because we do not know how to
apportion credit for the pooling feature values to the lower-level features. This
is simply the generative version of the consequences of the loss of information
caused by pooling.
26
typically be relevant for any particular example, justifying
sparsity of representation. These factors are expected to be
related to simple (e.g., linear) dependencies, with subsets of
these explaining different random variables of interest (inputs,
tasks) and varying in structured ways in time and space
(temporal and spatial coherence). We expect future successful
applications of representation learning to rene and increase
that list of priors, and to incorporate most of them instead of
focusing on only one. Research in training criteria that better
take these priors into account are likely to move us closer to
the long-term objective of discovering learning algorithms that
can disentangle the underlying explanatory factors.
Inference. We anticipate that methods based on directly
parametrizing a representation function will incorporate more
and more of the iterative type of computation one nds in the
inference procedures of probabilistic latent-variable models.
There is already movement in the other direction, with prob-
abilistic latent-variable models exploiting approximate infer-
ence mechanisms that are themselves learned (i.e., producing a
parametric description of the representation function). A major
appeal of probabilistic models is that the semantics of the
latent variables are clear and this allows a clean separation
of the problems of modeling (choose the energy function),
inference (estimating P(h|x)), and learning (optimizing the
parameters), using generic tools in each case. On the other
hand, doing approximate inference and not taking that approxi-
mation into account explicitly in the approximate optimization
for learning could have detrimental effects, hence the appeal
of learning approximate inference. More fundamentally, there
is the question of the multimodality of the posterior P(h|x).
If there are exponentially many probable congurations of
values of the factors h
i
that can explain x, then we seem
to be stuck with very poor inference, either focusing on a
single mode (MAP inference), assuming some kind of strong
factorization (as in variational inference) or using an MCMC
that cannot visit enough modes of P(h|x). What we propose
as food for thought is the idea of dropping the requirement
of an explicit representation of the posterior and settle for
an implicit representation that exploits potential structure in
P(h|x) in order to represent it compactly: even though P(h|x)
may have an exponential number of modes, it may be possible
to represent it with a small set of numbers. For example,
consider computing a deterministic feature representation f(x)
that implicitly captures the information about a highly multi-
modal P(h|x), in the sense that all the questions (e.g. making
some prediction about some target concept) that can be asked
from P(h|x) can also be answered from f(x).
Optimization. Much remains to be done to better under-
stand the successes and failures of training deep architectures,
both in the supervised case (with many recent successes) and
the unsupervised case (where much more work needs to be
done). Although regularization effects can be important on
small datasets, the effects that persist on very large datasets
suggest some optimization issues are involved. Are they more
due to local minima (we now know there are huge numbers
of them) and the dynamics of the training procedure? Or
are they due mostly to ill-conditioning and may be handled
by approximate second-order methods? These basic questions
remain unanswered and deserve much more study.
Acknowledgments
The author would like to thank David Warde-Farley, Razvan
Pascanu and Ian Goodfellow for useful feedback, as well as
NSERC, CIFAR and the Canada Research Chairs for funding.
REFERENCES
Alain, G. and Bengio, Y. (2012). What regularized auto-encoders
learn from the data generating distribution. Technical Report Arxiv
report 1211.4246, Universit e de Montr eal.
Amari, S. (1998). Natural gradient works efciently in learning.
Neural Computation, 10(2), 251276.
Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2011). Struc-
tured sparsity through convex optimization. CoRR, abs/1109.2397.
Bagnell, J. A. and Bradley, D. M. (2009). Differentiable sparse
coding. In NIPS2009, pages 113120.
Baird, H. (1990). Document image defect models. In IAPR Workshop,
Syntactic & Structural Patt. Rec., pages 3846.
Becker, S. and Hinton, G. (1992). A self-organizing neural network
that discovers surfaces in random-dot stereograms. Nature, 355,
161163.
Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimen-
sionality reduction and data representation. Neural Computation,
15(6), 13731396.
Bell, A. and Sejnowski, T. J. (1997). The independent components
of natural scenes are edge lters. Vision Research, 37, 33273338.
Bengio, Y. (1993). A connectionist approach to speech recognition.
International Journal on Pattern Recognition and Articial Intel-
ligence, 7(4), 647668.
Bengio, Y. (2008). Neural net language models. Scholarpedia, 3(1).
Bengio, Y. (2009). Learning deep architectures for AI. Foundations
and Trends in Machine Learning, 2(1), 1127. Also published as
a book. Now Publishers, 2009.
Bengio, Y. (2011). Deep learning of representations for unsupervised
and transfer learning. In JMLR W&CP: Proc. Unsupervised and
Transfer Learning.
Bengio, Y. (2013). Practical recommendations for gradient-based
training of deep architectures. In K.-R. M uller, G. Montavon, and
G. B. Orr, editors, Neural Networks: Tricks of the Trade. Springer.
Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing
contrastive divergence. Neural Computation, 21(6), 16011621.
Bengio, Y. and Delalleau, O. (2011). On the expressive power of
deep architectures. In ALT2011.
Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards
AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors,
Large Scale Kernel Machines. MIT Press.
Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent
learning. In NIPS2004, pages 129136. MIT Press.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term
dependencies with gradient descent is difcult. IEEE Transactions
on Neural Networks, 5(2), 157166.
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A
neural probabilistic language model. JMLR, 3, 11371155.
Bengio, Y., Paiement, J.-F., Vincent, P., Delalleau, O., Le Roux,
N., and Ouimet, M. (2004). Out-of-sample extensions for LLE,
Isomap, MDS, Eigenmaps, and Spectral Clustering. In NIPS2003.
Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of
highly variable functions for local kernel machines. In NIPS2005.
Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local
manifold Parzen windows. In NIPS2005. MIT Press.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007).
Greedy layer-wise training of deep networks. In NIPS2006.
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009).
Curriculum learning. In ICML09.
Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees
do not generalize to new variations. Computational Intelligence,
26(4), 449467.
27
Bengio, Y., Alain, G., and Rifai, S. (2012). Implicit density esti-
mation by local moment matching to sample from auto-encoders.
Technical report, arXiv:1207.0057.
Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better
mixing via deep representations. In ICML2013.
Bergstra, J. and Bengio, Y. (2009). Slow, decorrelated features for
pretraining complex cell-like networks. In NIPS2009.
Bergstra, J. and Bengio, Y. (2012). Random search for hyper-
parameter optimization. J. Machine Learning Res., 13, 281305.
Bergstra, J., Bardenet, R., Bengio, Y., and K egl, B. (2011). Algo-
rithms for hyper-parameter optimization. In NIPS2011.
Berkes, P. and Wiskott, L. (2005). Slow feature analysis yields a
rich repertoire of complex cell properties. Journal of Vision, 5(6),
579602.
Besag, J. (1975). Statistical analysis of non-lattice data. The
Statistician, 24(3), 179195.
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint
learning of words and meaning representations for open-text
semantic parsing. AISTATS2012.
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012).
Modeling temporal dependencies in high-dimensional sequences:
Application to polyphonic music generation and transcription. In
ICML2012.
Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis
of feature pooling in vision algorithms. In ICML10.
Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011).
Ask the locals: multi-way local pooling for image recognition. In
ICCV11.
Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer
perceptrons and singular value decomposition. Biological Cyber-
netics, 59, 291294.
Brand, M. (2003). Charting a manifold. In NIPS2002, pages 961
968. MIT Press.
Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating
representative samples from an RBM-derived process. Neural
Computation, 23(8), 20532073.
Bruna, J. and Mallat, S. (2011). Classication with scattering
operators. In ICPR2011.
Cadieu, C. and Olshausen, B. (2009). Learning transformational
invariants from natural movies. In NIPS2009, pages 209216.
MIT Press.
Carreira-Perpi nan, M. A. and Hinton, G. E. (2005). On contrastive
divergence learning. In AISTATS2005, pages 3340.
Chen, M., Xu, Z., Winberger, K. Q., and Sha, F. (2012). Marginalized
denoising autoencoders for domain adaptation. In ICML2012.
Cho, K., Raiko, T., and Ilin, A. (2010). Parallel tempering is efcient
for learning restricted Boltzmann machines. In IJCNN2010.
Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradient and
adaptive learning rate for training restricted Boltzmann machines.
In ICML2011, pages 105112.
Ciresan, D., Meier, U., and Schmidhuber, J. (2012). Multi-column
deep neural networks for image classication. Technical report,
arXiv:1202.2745.
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber,
J. (2010). Deep big simple neural nets for handwritten digit
recognition. Neural Computation, 22, 114.
Coates, A. and Ng, A. Y. (2011a). The importance of encoding versus
training with sparse coding and vector quantization. In ICML2011.
Coates, A. and Ng, A. Y. (2011b). Selecting receptive elds in deep
networks. In NIPS2011.
Collobert, R. and Weston, J. (2008). A unied architecture for
natural language processing: Deep neural networks with multitask
learning. In ICML2008.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K.,
and Kuksa, P. (2011). Natural language processing (almost) from
scratch. Journal of Machine Learning Research, 12, 24932537.
Courville, A., Bergstra, J., and Bengio, Y. (2011a). A spike and slab
restricted Boltzmann machine. In AISTATS2011.
Courville, A., Bergstra, J., and Bengio, Y. (2011b). Unsupervised
models of images by spike-and-slab RBMs. In ICML2011.
Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010).
Phone recognition with the mean-covariance restricted Boltzmann
machine. In NIPS2010.
Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-
dependent pre-trained deep neural networks for large vocabulary
speech recognition. IEEE Transactions on Audio, Speech, and
Language Processing, 20(1), 3342.
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton,
G. (2010). Binary coding of speech spectrograms using a deep
auto-encoder. In Interspeech 2010, Makuhari, Chiba, Japan.
Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of
convolutional RBMs for vision. Technical Report 1327, Dept.
IRO, U. Montr eal.
Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau,
O. (2010). Tempered Markov chain Monte Carlo for training of
restricted Boltzmann machine. In AISTATS2010, volume 9, pages
145152.
Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the
partition function. In NIPS2011.
Desjardins, G., Courville, A., and Bengio, Y. (2012). On training
deep Boltzmann machines. Technical Report arXiv:1203.4416v1,
Universit e de Montr eal.
DiCarlo, J., Zoccolan, D., and Rust, N. (2012). How does the brain
solve visual object recognition? Neuron.
Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new
locally linear embedding techniques for high-dimensional data.
Technical Report 2003-08, Dept. Statistics, Stanford University.
Eisner, J. (2012). Learning approximate inference policies for
fast prediction. Keynote talk at ICML Workshop on Inferning:
Interactions Between Search and Learning.
Erhan, D., Courville, A., and Bengio, Y. (2010a). Understanding
representations learned in deep architectures. Technical Report
1355, Universit e de Montr eal/DIRO.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P.,
and Bengio, S. (2010b). Why does unsupervised pre-training help
deep learning? Journal of Machine Learning Research, 11, 625
660.
Freund, Y. and Haussler, D. (1994). Unsupervised learning of
distributions on binary vectors using two layer networks. Technical
Report UCSC-CRL-94-25, University of California, Santa Cruz.
Fukushima, K. (1980). Neocognitron: A self-organizing neural
network model for a mechanism of pattern recognition unaffected
by shift in position. Biological Cybernetics, 36, 193202.
Glorot, X. and Bengio, Y. (2010). Understanding the difculty of
training deep feedforward neural networks. In AISTATS2010.
Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectier
neural networks. In AISTATS2011.
Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation
for large-scale sentiment classication: A deep learning approach.
In ICML2011.
Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring
invariances in deep networks. In NIPS2009, pages 646654.
Goodfellow, I., Courville, A., and Bengio, Y. (2011). Spike-and-
slab sparse coding for unsupervised feature discovery. In NIPS
Workshop on Challenges in Learning Hierarchical Models.
Goodfellow, I. J., Courville, A., and Bengio, Y. (2012). Spike-
and-slab sparse coding for unsupervised feature discovery.
arXiv:1201.3382.
Gregor, K. and LeCun, Y. (2010a). Emergence of complex-like cells
in a temporal product network with local receptive elds. Technical
report, arXiv:1006.0448.
Gregor, K. and LeCun, Y. (2010b). Learning fast approximations of
sparse coding. In ICML2010.
Gregor, K., Szlam, A., and LeCun, Y. (2011). Structured sparse
coding via lateral inhibition. In NIPS2011.
Gribonval, R. (2011). Should penalized least squares regression be
interpreted as Maximum A Posteriori estimation? IEEE Transac-
tions on Signal Processing, 59(5), 24052410.
28
Grosse, R., Raina, R., Kwong, H., and Ng, A. Y. (2007). Shift-
invariant sparse coding for audio classication. In UAI2007.
Grubb, A. and Bagnell, J. A. D. (2010). Boosted backpropagation
learning for training deep modular networks. In ICML2010.
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation:
A new estimation principle for unnormalized statistical models. In
AISTATS2010.
Hamel, P., Lemieux, S., Bengio, Y., and Eck, D. (2011). Temporal
pooling and multiscale learning for automatic annotation and
ranking of music audio. In ISMIR.
H astad, J. (1986). Almost optimal lower bounds for small depth
circuits. In STOC86, pages 620.
H astad, J. and Goldmann, M. (1991). On the power of small-depth
threshold circuits. Computational Complexity, 1, 113129.
Henaff, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011).
Unsupervised learning of sparse features for scalable audio clas-
sication. In ISMIR11.
Hinton, G., Krizhevsky, A., and Wang, S. (2011). Transforming auto-
encoders. In ICANN2011.
Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012).
Deep neural networks for acoustic modeling in speech recognition.
IEEE Signal Processing Magazine, 29(6), 8297.
Hinton, G. E. (1986). Learning distributed representations of con-
cepts. In Proc. 8th Conf. Cog. Sc. Society, pages 112.
Hinton, G. E. (1999). Products of experts. In ICANN1999.
Hinton, G. E. (2000). Training products of experts by minimizing
contrastive divergence. Technical Report GCNU TR 2000-004,
Gatsby Unit, University College London.
Hinton, G. E. (2010). A practical guide to training restricted
Boltzmann machines. Technical Report UTML TR 2010-003,
Department of Computer Science, University of Toronto.
Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding.
In NIPS2002.
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimension-
ality of data with neural networks. Science, 313(5786), 504507.
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum
description length, and helmholtz free energy. In NIPS1993.
Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning
algorithm for deep belief nets. Neural Computation, 18, 1527
1554.
Hubel, D. H. and Wiesel, T. N. (1959). Receptive elds of single
neurons in the cats striate cortex. Journal of Physiology, 148,
574591.
Hurri, J. and Hyv arinen, A. (2003). Temporal coherence, natural
image sequences, and the visual cortex. In NIPS2002.
Hyv arinen, A. (2005). Estimation of non-normalized statistical
models using score matching. J. Machine Learning Res., 6.
Hyv arinen, A. (2007). Some extensions of score matching. Compu-
tational Statistics and Data Analysis, 51, 24992512.
Hyv arinen, A. (2008). Optimal approximation of signal priors. Neural
Computation, 20(12), 30873110.
Hyv arinen, A. and Hoyer, P. (2000). Emergence of phase and
shift invariant features by decomposition of natural images into
independent feature subspaces. Neural Computation, 12(7).
Hyv arinen, A., Karhunen, J., and Oja, E. (2001a). Independent
Component Analysis. Wiley-Interscience.
Hyv arinen, A., Hoyer, P. O., and Inki, M. (2001b). Topographic
independent component analysis. Neural Computation, 13(7),
15271558.
Hyv arinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image
Statistics: A probabilistic approach to early computational vision.
Springer-Verlag.
Jaeger, H. (2007). Echo state network. Scholarpedia, 2(9), 2330.
Jain, V. and Seung, S. H. (2008). Natural image denoising with
convolutional networks. In NIPS2008.
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009).
What is the best multi-stage architecture for object recognition?
In ICCV09.
Jenatton, R., Audibert, J.-Y., and Bach, F. (2009). Structured
variable selection with sparsity-inducing norms. Technical report,
arXiv:0904.3523.
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I:
an adaptive algorithm based on neuromimetic architecture. Signal
Processing, 24, 110.
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008). Fast inference
in sparse coding algorithms with applications to object recognition.
CBLL-TR-2008-12-01, NYU.
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009).
Learning invariant features through topographic lter maps. In
CVPR2009.
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Math-
ieu, M., and LeCun, Y. (2010). Learning convolutional feature
hierarchies for visual recognition. In NIPS2010.
Kingma, D. and LeCun, Y. (2010). Regularized estimation of image
statistics by score matching. In NIPS2010.
Kivinen, J. J. and Williams, C. K. I. (2012). Multiple texture
Boltzmann machines. In AISTATS2012.
K ording, K. P., Kayser, C., Einh auser, W., and K onig, P. (2004).
How are complex cell properties adapted to the statistics of natural
stimuli? J. Neurophysiology, 91.
Krizhevsky, A. (2010). Convolutional deep belief networks on
CIFAR-10. Technical report, U. Toronto.
Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of
features from tiny images. Technical report, U. Toronto.
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet clas-
sication with deep convolutional neural networks. In NIPS2012.
Larochelle, H. and Bengio, Y. (2008). Classication using discrimi-
native restricted Boltzmann machines. In ICML2008.
Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. (2009).
Exploring strategies for training deep neural networks. Journal of
Machine Learning Research, 10, 140.
Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of
features: Spatial pyramid matching for recognizing natural scene
categories. In CVPR2006.
Le, H.-S., Oparin, I., Allauzen, A., Gauvin, J.-L., and Yvon, F. (2013).
Structured output layer neural network language models for speech
recognition. IEEE Trans. Audio, Speech & Language Processing.
Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A.
(2010). Tiled convolutional neural networks. In NIPS2010.
Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng,
A. (2011a). On optimization methods for deep learning. In
ICML2011.
Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. (2011b). ICA with
reconstruction cost for efcient overcomplete feature learning. In
NIPS2011.
Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011c). Learning
hierarchical spatio-temporal features for action recognition with
independent subspace analysis. In CVPR2011.
Le Roux, N., Bengio, Y., Lamblin, P., Joliveau, M., and Kegl, B.
(2008a). Learning the 2-D topology of images. In NIPS07.
Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008b). Topmoumoute
online natural gradient algorithm. In NIPS07.
LeCun, Y. (1986). Learning processes in an asymmetric threshold
network. In Disordered Systems and Biological Organization,
pages 233240. Springer-Verlag.
LeCun, Y. (1987). Mod` eles connexionistes de lapprentissage. Ph.D.
thesis, Universit e de Paris VI.
LeCun, Y. (1989). Generalization and network design strategies. In
Connectionism in Perspective. Elsevier Publishers.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,
Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied
to handwritten zip code recognition. Neural Computation.
LeCun, Y., Bottou, L., Orr, G. B., and M uller, K. (1998a). Efcient
backprop. In Neural Networks, Tricks of the Trade.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient
based learning applied to document recognition. Proc. IEEE.
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net
29
model for visual area V2. In NIPS07.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolu-
tional deep belief networks for scalable unsupervised learning of
hierarchical representations. In ICML2009.
Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised
feature learning for audio classication using convolutional deep
belief networks. In NIPS2009.
Lin, Y., Tong, Z., Zhu, S., and Yu, K. (2010). Deep coding network.
In NIPS2010.
Lowe, D. (1999). Object recognition from local scale invariant
features. In ICCV99.
Mallat, S. (2012). Group invariant scattering. Communications on
Pure and Applied Mathematics.
Marlin, B. and de Freitas, N. (2011). Asymptotic efciency of
deterministic estimators for discrete energy-based models: Ratio
matching and pseudolikelihood. In UAI2011.
Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010).
Inductive principles for restricted Boltzmann machine learning. In
AISTATS2010, pages 509516.
Martens, J. (2010). Deep learning via Hessian-free optimization. In
ICML2010, pages 735742.
Martens, J. and Sutskever, I. (2011). Learning recurrent neural
networks with Hessian-free optimization. In ICML2011.
Memisevic, R. and Hinton, G. E. (2010). Learning to represent spatial
transformations with factored higher-order Boltzmann machines.
Neural Comp., 22(6).
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow,
I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vin-
cent, P., Courville, A., and Bergstra, J. (2011). Unsupervised and
transfer learning challenge: a deep learning approach. In JMLR
W&CP: Proc. Unsupervised and Transfer Learning, volume 7.
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky,
J. (2011). Empirical evaluation and combination of advanced
language modeling techniques. In INTERSPEECH2011.
Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from
temporal coherence in video. In ICML2009.
Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling
using deep belief networks. IEEE Trans. on Audio, Speech and
Language Processing, 20(1), 1422.
Montufar, G. F. and Morton, J. (2012). When does a mixture
of products contain a product of mixtures? Technical report,
arXiv:1206.0387.
Murray, I. and Salakhutdinov, R. (2009). Evaluating probabilities
under high-dimensional latent variable models. In NIPS2008,
pages 11371144.
Nair, V. and Hinton, G. E. (2010). Rectied linear units improve
restricted Boltzmann machines. In ICML10.
Neal, R. M. (1992). Connectionist learning of belief networks.
Articial Intelligence, 56, 71113.
Neal, R. M. (1993). Probabilistic inference using Markov chain
Monte-Carlo methods. Technical Report CRG-TR-93-1, Dept. of
Computer Science, University of Toronto.
Ngiam, J., Chen, Z., Koh, P., and Ng, A. (2011). Learning deep
energy models. In Proc. ICML2011. ACM.
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-
cell receptive eld properties by learning a sparse code for natural
images. Nature, 381, 607609.
Orr, G. and Muller, K.-R., editors (1998). Neural networks: tricks of
the trade. Lect. Notes Comp. Sc. Springer-Verlag.
Pascanu, R. and Bengio, Y. (2013). Natural gradient revisited.
Technical report, arXiv:1301.3584.
Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep learning made
easier by linear transformations in perceptrons. In AISTATS2012.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y. (2007).
Self-taught learning: transfer learning from unlabeled data. In
ICML2007.
Ranzato, M. and Hinton, G. H. (2010). Modeling pixel means and
covariances using factorized third-order Boltzmann machines. In
CVPR2010, pages 25512558.
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efcient
learning of sparse representations with an energy-based model. In
NIPS2006.
Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature
learning for deep belief networks. In NIPS2007.
Ranzato, M., Krizhevsky, A., and Hinton, G. (2010a). Factored 3-
way restricted Boltzmann machines for modeling natural images.
In AISTATS2010, pages 621628.
Ranzato, M., Mnih, V., and Hinton, G. (2010b). Generating more
realistic images using gated MRFs. In NIPS2010.
Ranzato, M., Susskind, J., Mnih, V., and Hinton, G. (2011). On deep
generative models with applications to recognition. In CVPR2011.
Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object
recognition in cortex. Nature Neuroscience.
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a).
Contractive auto-encoders: Explicit invariance during feature ex-
traction. In ICML2011.
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin,
Y., and Glorot, X. (2011b). Higher order contractive auto-encoder.
In ECML PKDD.
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011c).
The manifold tangent classier. In NIPS2011.
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A
generative process for sampling contractive auto-encoders. In
ICML2012.
Roweis, S. (1997). EM algorithms for PCA and sensible PCA. CNS
Technical Report CNS-TR-97-02, Caltech.
Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality
reduction by locally linear embedding. Science, 290(5500).
Salakhutdinov, R. (2010a). Learning deep Boltzmann machines using
adaptive MCMC. In ICML2010.
Salakhutdinov, R. (2010b). Learning in Markov random elds using
tempered transitions. In NIPS2010.
Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In
SIGIR2007.
Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann
machines. In AISTATS2009, pages 448455.
Salakhutdinov, R. and Larochelle, H. (2010). Efcient learning of
deep Boltzmann machines. In AISTATS2010.
Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted
Boltzmann machines for collaborative ltering. In ICML 2007.
Savard, F. (2011). R eseaux de neurones ` a relaxation entran es par
crit` ere dautoencodeur d ebruitant. Masters thesis, U. Montr eal.
Schmah, T., Hinton, G. E., Zemel, R., Small, S. L., and Strother,
S. (2009). Generative versus discriminative training of RBMs for
classication of fMRI images. In NIPS2008, pages 14091416.
Sch olkopf, B., Smola, A., and M uller, K.-R. (1998). Nonlinear
component analysis as a kernel eigenvalue problem. Neural
Computation, 10, 12991319.
Schwenk, H., Rousseau, A., and Attik, M. (2012). Large, pruned or
continuous space language models on a gpu for statistical machine
translation. In Workshop on the future of language modeling for
HLT.
Seide, F., Li, G., and Yu, D. (2011a). Conversational speech
transcription using context-dependent deep neural networks. In
Interspeech 2011, pages 437440.
Seide, F., Li, G., and Yu, D. (2011b). Feature engineering in
context-dependent deep neural networks for conversational speech
transcription. In ASRU2011.
Serre, T., Wolf, L., Bileschi, S., and Riesenhuber, M. (2007). Robust
object recognition with cortex-like mechanisms. IEEE Trans.
Pattern Anal. Mach. Intell., 29(3), 411426.
Seung, S. H. (1998). Learning continuous attractors in recurrent
networks. In NIPS1997.
Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). Best practices
for convolutional neural networks. In ICDAR2003.
Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent
prop - A formalism for specifying selected invariances in an
adaptive network. In NIPS1991.
30
Simard, P. Y., LeCun, Y., and Denker, J. (1993). Efcient pattern
recognition using a new transformation distance. In NIPS92.
Smolensky, P. (1986). Information processing in dynamical systems:
Foundations of harmony theory. In D. E. Rumelhart and J. L.
McClelland, editors, Parallel Distributed Processing, volume 1,
chapter 6, pages 194281. MIT Press, Cambridge.
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian
optimization of machine learning algorithms. In NIPS2012.
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning,
C. D. (2011a). Dynamic pooling and unfolding recursive autoen-
coders for paraphrase detection. In NIPS2011.
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning,
C. D. (2011b). Semi-supervised recursive autoencoders for pre-
dicting sentiment distributions. In EMNLP2011.
Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning
with deep boltzmann machines. In NIPS2012.
Stoyanov, V., Ropson, A., and Eisner, J. (2011). Empirical risk
minimization of graphical model parameters given approximate
inference, decoding, and model structure. In AISTATS2011.
Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D.
thesis, Departement of computer science, University of Toronto.
Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties
of Contrastive Divergence. In AISTATS2010.
Sutskever, I., Hinton, G., and Taylor, G. (2009). The recurrent
temporal restricted Boltzmann machine. In NIPS2008.
Swersky, K. (2010). Inductive Principles for Learning Restricted
Boltzmann Machines. Masters thesis, University of British
Columbia.
Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas,
N. (2011). On score matching for energy based models: Gen-
eralizing autoencoders and simplifying deep learning. In Proc.
ICML2011. ACM.
Taylor, G. and Hinton, G. (2009). Factored conditional restricted
Boltzmann machines for modeling motion style. In ICML2009.
Taylor, G., Fergus, R., LeCun, Y., and Bregler, C. (2010). Convolu-
tional learning of spatio-temporal features. In ECCV10.
Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global
geometric framework for nonlinear dimensionality reduction. Sci-
ence, 290(5500), 23192323.
Tieleman, T. (2008). Training restricted Boltzmann machines using
approximations to the likelihood gradient. In ICML2008, pages
10641071.
Tieleman, T. and Hinton, G. (2009). Using fast weights to improve
persistent contrastive divergence. In ICML2009.
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal
components analysis. J. Roy. Stat. Soc. B, (3).
Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M.,
Briggman, K., Denk, W., and Seung, H. S. (2010). Convolutional
networks can learn to generate afnity graphs for image segmen-
tation. Neural Computation, 22, 511538.
van der Maaten, L. (2009). Learning a parametric embedding by
preserving local structure. In AISTATS2009.
van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using
t-SNE. J. Machine Learning Res., 9.
Vincent, P. (2011). A connection between score matching and
denoising autoencoders. Neural Computation, 23(7).
Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In
NIPS2002. MIT Press.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008).
Extracting and composing robust features with denoising autoen-
coders. In ICML 2008.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol,
P.-A. (2010). Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising criterion.
J. Machine Learning Res., 11.
Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of
image manifolds by semidenite programming. In CVPR2004,
pages 988995.
Welling, M. (2009). Herding dynamic weights for partially observed
random eld models. In UAI2009.
Welling, M., Hinton, G. E., and Osindero, S. (2003). Learning sparse
topographic representations with products of Student-t distribu-
tions. In NIPS2002.
Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via
semi-supervised embedding. In ICML 2008.
Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image
annotation: learning to rank with joint word-image embeddings.
Machine Learning, 81(1), 2135.
Wiskott, L. and Sejnowski, T. (2002). Slow feature analysis: Un-
supervised learning of invariances. Neural Computation, 14(4),
715770.
Younes, L. (1999). On the convergence of Markovian stochastic
algorithms with rapidly decreasing ergodicity rates. Stochastics
and Stochastic Reports, 65(3), 177228.
Yu, D., Wang, S., and Deng, L. (2010). Sequential labeling using
deep-structured conditional random elds. IEEE Journal of Se-
lected Topics in Signal Processing.
Yu, K. and Zhang, T. (2010). Improved local coordinate coding using
local tangents. In ICML2010.
Yu, K., Zhang, T., and Gong, Y. (2009). Nonlinear learning using
local coordinate coding. In NIPS2009.
Yu, K., Lin, Y., and Lafferty, J. (2011). Learning image representa-
tions from the pixel level via hierarchical sparse coding. In CVPR.
Yuille, A. L. (2005). The convergence of contrastive divergences. In
NIPS2004, pages 15931600.
Zeiler, M., Krishnan, D., Taylor, G., and Fergus, R. (2010). Decon-
volutional networks. In CVPR2010.
Zou, W. Y., Ng, A. Y., and Yu, K. (2011). Unsupervised learning
of visual invariance with temporal coherence. In NIPS 2011
Workshop on Deep Learning and Unsupervised Feature Learning.