Linear Factor Models and Auto-Encoders
Linear Factor Models and Auto-Encoders
Chapter F
15
Linear
actor Mo
Models
dels and
Auto-Enco
Auto-Encoders
ders
Linear Factor Mo dels and
Auto-Enco
ders
Linear
factor mo
models
dels are generativ
generative
e unsup
unsupervised
ervised learning models in whic
which
h we
imagine that some unobserved factors h explain the observed v
variables
ariables x through
mo dels areAuto-enco
generativeders
unsup
learning
modelsmethods
in whichthat
we
aLinear
linearfactor
transformation.
Auto-encoders
areervised
un
unsup
sup
supervised
ervised
learning
imagine
that some unobserved
factors
h explain
the observed
variables xparametthrough
learn
a representation
of the data,
typically
obtained
by a non-linear
a linear
transformation.
unsup
ervised learning
methods
that
ric
transformation
of th
thee Auto-enco
data, i.e., ders
fromare
x to
h, typically
a feedforward
neural
learn
a representation
of the. data,
by a non-linear
parametnet
network,
work,
but not necessarily
necessarily.
They typically
also learnobtained
a transformation
going backw
backwards
ards
ric transformation
of thetodata,
i.e., from
feedforward
neural
from
the representation
the data,
fromxhtotohx, ,typically
lik
likee the alinear
factor models.
network,
but mo
notdels
necessarily
. They
learn
a transformation
going
backwautoards
Linear
factor
models
therefore
only also
specify
a parametric
deco
decoder,
der,
whereas
from
thealso
representation
to the data,
from
to xlinear
, like the
linear
factor
models.
enco
specify a parametric
enco
factor
mo
like
PCA,
encoder
der
encoder.
der. hSome
models,
dels,
Linear
moond
delstotherefore
only der
specify
a parametric
whereas
autoactuallyfactor
corresp
an auto-enco
(a linear
one), butdeco
for der,
others
the enco
correspond
auto-encoder
encoder
der
enco
der
also
specify
a
parametric
enco
der.
Some
linear
factor
mo
dels,
like
PCA,
is implicitly dened via an inference mechanism that searches for an h that could
actually
correspthe
ond observ
to an ed
auto-enco
der (a linear one), but for others the enco der
ha
x.
have
ve generated
observed
is implicitly
via anders
inference
mechanism
thathistorical
searches landscap
for an h ethat
could
The idea dened
of auto-enco
has been
part of the
of neural
auto-encoders
landscape
ha
ve
generated
the observ
ed x1987;
.
net
for decades
(LeCun,
Bourlard and Kamp, 1988; Hin
networks
works
Hinton
ton and Zemel,
The
idea
auto-enco
ders
beeninpart
oft the
historical
landscapesomewhat
of neural
1994)
but
hasofreally
pic
picked
ked
uphas
speed
recen
recent
years.
They remained
networks for
for decades
(LeCun,
1987;
1988; Hin
ton and Zemel,
marginal
man
many
y years,
in part
dueBourlard
to what and
w
was
as Kamp,
an incomplete
understanding
of
1994)
but has really
picked up speed
in recent years.
They remained
somewhat
the
mathematical
interpretation
and geometrical
underpinnings
of auto-encoders,
marginal
man
yed
years,
in part
due to what
was20.12.
an incomplete understanding of
whic
which
h ar
areefor
dev
develop
elop
eloped
further
in Chapters
17 and
the An
mathematical
interpretation
geometrical
auto-encoder
is simplyand
a neural
net
netw
wunderpinnings
ork that triesoftoauto-encoders,
cop
copy
y its inwhichtoarits
e dev
eloped The
further
in Chapters
17 auto-enco
and 20.12.
put
output.
arc
architecture
hitecture
of an
auto-encoder
der is typically decomp
decomposed
osed
simply a in
neural
work that tries to copy its inin
into
toAn
theauto-encoder
following parts,isillustrated
Figurenet
15.1:
put to its output. The architecture of an auto-encoder is typically decomp osed
an following
input, x parts, illustrated in Figure 15.1:
intothe
an enco
encoder
derx function f
input,
an encoder function f
a code
de
or in
internal
ternal representation
466h = f (x)
reconstruc,on
reconstruc,on!r!
!r!
reconstruc,on!r!
Decoder
Decoder..g!
Decoder.g!
code
code!h!
!h!
code!h!
Encoder.
Encoder.f!
f!
Encoder.f!
input
input!x!
!x!
input!x!
Figure 15.1: General sc
schema
hema of an auto-encoder, mapping an input x to an output (called
reconstruction) r through an internal represen
representation
tation or co
code
de h. The auto-enco
auto-encoder
der has
Figure
15.1:
General
sc
hema
of
an
auto-encoder,
mapping
an
input
x
to
an
output
two comp
components:
onents: the enco
encoder
der f (mapping x to h) and the deco
decoder
der g (mapping h (called
to r).
reconstruction) r through an internal representation or code h. The auto-encoder has
two components: the encoder f (mapping x to h) and the decoder g (mapping h to r).
a deco
decoder
der function g
a deco
der function
g reconstruction
an
output,
also called
onstruction
r = g(h) = g(f (x))
output,
also L
called
reconstruction
) = g(f (xho
)) w go
aanloss
function
computing
a scalar L(rr=
, xg)(h
measuring
how
goo
od of a regiven
en input x. The ob
objective
jective is to minimize the
construction r is of the giv
a loss
function
L computing
a scalar set
L(rof
, xexamples
) measuring
exp
expected
ected
value of
L over the training
{x}ho
. w good of a re construction r is of the given input x. The objective is to minimize the
expected value of L over the training set of examples x .
15.1 Regularized Auto-Enco
Auto-Encoders
ders
{ }
Predicting
the input may sound
useless: what
preventt the auto-enco
auto-encoder
der
15.1 Regularized
Auto-Enco
derscould preven
from simply cop
copying
ying its input in
into
to its output? In the 20th cen
century
tury
tury,, this was
Predicting
input maythe
sound
useless: what
preven
auto-enco
ac
achieved
hieved bythe
constraining
arc
architecture
hitecture
of thecould
auto-enco
auto-encoder
dert the
to av
avoid
oid this, der
by
from
copying of
itsthe
input
totoitsb eoutput?
In the
century
thisinput
was
forcingsimply
the dimension
co
smaller than
the20th
dimension
of, the
code
deinh
ac
x.hieved by constraining the architecture of the auto-encoder to avoid this, by
forcing
the 15.2
dimension
of the
to b e smaller
the dimension
of the input
Figure
illustrates
thecotde
wohtypical
cases ofthan
auto-encoders:
undercomplete
x
.
(with
the dimension of the representation h smaller than the dimension of the in15.2
illustrates(with
the tthe
wo dimension
typical cases
oflarger
auto-encoders:
undercomplete
put Figure
x), ando
of h
than that of
x). Whereas
andovercomplete
vercomplete
(with work
the dimension
of the representation
h smaller
than
the dimensionbottlenec
of the inearly
with auto-encoders,
just lik
likee PCA,
uses an
undercomplete
bottleneck
k
put
x
),
ando
vercomplete
(with
the
dimension
of
h
larger
than
that
of
x
).
Whereas
in the sequence of lay
layers
ers to av
avoid
oid learning the identit
identity
y function, more recent w
work
ork
early work with auto-encoders, just like PCA, uses an undercomplete bottleneck
in the sequence of layers to avoid learning the identity function, more recent work
467
467
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
reconstruc4on
reconstruc4on!r!
!r!
reconstruc4on!r!
Decoder
Decoder**
Decoder
Decoder**
Decoder*
Decoder*
Code*bo,leneck
Code*bo,leneck!h:!
!h:!
undercomplete* !h:!
Code*bo,leneck
representa4on*
undercomplete*
representa4on*
Code
Code!h:!
!h:!
overcomplete*
Code!h:!
representa4on*
overcomplete*
representa4on*
Encoder*
Encoder*
Encoder*
Encoder*
input
input!x!
!x!
input!x!
allo
allows
ws overcomplete representations. What we ha
have
ve learned in recen
recentt years is
that it is p ossible to mak
makee the auto-encoder meaningfully capture the structure
allothe
ws input
overcomplete
representations.
What we
haveis learned
in recen
t years
is
of
distribution
even if the represen
representation
tation
ov
overcomplete,
ercomplete,
with
other
that itofisconstrain
p ossibletto
e the auto-encoder
capture
the structure
forms
constraint
ormak
regularization.
In fact, meaningfully
once you realize
that auto-enco
auto-encoders
ders
of the
input the
distribution
even if the(indirectly
representation
is ov
with other
can
capture
input distribution
(indirectly,
, not as
a ercomplete,
an explicit probabilit
probability
y
forms of constrain
or regularization.
In fact,
realize
function),
you alsot realize
that it should
needonce
moreyou
capacit
capacity
y asthat
oneauto-enco
increasesders
the
can
capture
distribution
(indirectly
, notthe
asamount
a an explicit
y
complexit
complexity
y ofthe
theinput
distribution
to b
bee captured
(and
of dataprobabilit
available):
function),
yoube
also
realize
should
need more
capacit
as one increases
the
it
should not
limited
bythat
the it
input
di
dimension.
mension.
This
is a yproblem
in particular
complexit
of the
distribution to which
be captured
thehidden
amount
of data
vailable):
with
the syhallo
hallow
w auto-encoders,
ha
have
ve a(and
single
la
layer
yer
(for athe
code).
it shouldthat
not hidden
be limited
by size
the input
dimension.
is a problem
in particular
Indeed,
la
lay
yer
con
controls
trols
both the This
dimensionalit
dimensionality
y reduction
conwith
the
shallo
w auto-encoders,
which haand
ve athe
single
hidden
layer
(forws
thetocode).
strain
straint
t (the
code
size at the bottleneck)
capacit
capacity
y (whic
(which
h allo
allows
learn
that hidden
layer size controls both the dimensionality reduction conaIndeed,
more complex
distribution).
strain
t (the the
codebottlenec
size at the
and the capacit
y (whic
h allo
ws to learn
Besides
ottleneck
k bottleneck)
constrain
constraint,
t, alternativ
alternative
e constrai
constraints
nts or
regularization
amethods
moredscomplex
distribution).
metho
hav
havee b
been
een
explored and can guarantee that the auto-enco
auto-encoder
der do
does
es someBesides
ottlenec
k constrain
t, alternativ
ey-lik
constrai
nts or regularization
thing
useful the
andbnot
just learn
some trivial
identit
identity-lik
y-like
e function:
methods have been explored and can guarantee that the auto-encoder do es something
useful and
notthe
justrepresen
learn some
trivialoridentit
y-lik
e function:
Sparsit
Sparsity
y of
representation
tation
of its
deriv
derivativ
ativ
ative
e: ev
even
en if the intermediate representation has a very high dimensionality
dimensionality,, the eective lo
local
cal
Sparsity of ythe
represen
tationofor
of itsthat
deriv
ative:a coordinate
even if thesysindimensionalit
dimensionality
(n
(number
umber
of degrees
freedom
capture
termediate representation has a very high dimensionality, the eective local
468
468
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
tem among the probable xs) could be much smaller if most of the elements
of h are zero (or an
any
y other constan
constant,
t, such that || || is close to zero). When
tem among the probable xs) could be much smaller if most of the elements
|| || is close to zero, h do
does
es not participate in encoding lo
local
cal changes in
of h are zero (or any other constant, such that
is close to zero). When
x. There is a geometrical in
interpretation
terpretation of this situation in terms of maniis closethat
to zero,
h does in
not
participate
encoding
cal cdiscussion
hanges in
||
fold le
learning
arning
is discussed
more
depth in||inChapter
17.loThe
x
There
is 16
a geometrical
interpretation
of this situation
in terms
mani|| . Chapter
||
in
also explains
ho
how
w an auto-encoder
naturally
tends of
to
tow
wards
fold learning
that is discussed
in more
depthfactors
in Chapter
17. Theindiscussion
learning
a co
coordinate
ordinate
system for
the actual
of variation
the data.
int Chapter
also
explains
howders
an auto-encoder
tendsoftosparse
wards
A
least four16typ
ypes
es of
auto-enco
auto-encoders
clearly fall innaturally
this category
learning
a coordinate system for the actual factors of variation in the data.
represen
representation:
tation:
At least four types of auto-enco ders clearly fall in this category of sparse
represen
tation:
Sparse
coding (Olshausen and Field, 1996) has been heavily studied
as an unsup
unsupervised
ervised feature learning and feature inference mechanism.
It
Sparse
coding
(Olshausen
and Field,
1996)
has been
is a linear
factor
model rather
than an
auto-enco
auto-encoder,
der,heavily
becausestudied
it has
as an
unsupparametric
ervised feature
learning
and feature
inference
mechanism.
no
explicit
enco
encoder,
der,
and instead
uses an
iterativ
iterative
e optimizaIt is apro
linear
factor
model rather
than an auto-enco
der, because
it has
tion
procedure
cedure
to compute
the maximally
likely code.
Sparse coding
no
explicit
parametric
encothat
der,are
andbinstead
usesand
an iterativ
optimizalo
looks
oks
for represen
representations
tations
oth sparse
explaine the
input
tion procedure
to compute
theofmaximally
likelya code.
Sparsefunction
coding
through
the decoder.
Instead
the code being
parametric
looks
represen
tations that
both
sparse
andis explain
input
of
the for
input,
it is considered
lik
likeare
e free
v
variable
ariable
that
obtainedthe
through
through
the decoder.
of the
code
being a parametric function
an
optimization,
i.e., aInstead
particular
form
of inference:
of the input, it is considered like free variable that is obtained through
an optimization,
form
h i.e.,
= f (axparticular
) = arg min
minL
L(g(of
h),inference:
x)) + (
(h
h)
(15.1)
469
469
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
math mo
mode,
de, perhaps?
math mode, perhaps?
(i.e. wh
where
ere h
penalt
enalty
y (Lee et
(i.e. 2008a)
where h
gio,
penalty (Lee et
gio, 2008a)
log(1 + h )
log(1 + h )
has a Studen
Student-t
t-t prior densit
density)
y) and the KL-divergence
al.
al.,, 2008; Go
Goodfellow
odfellow et al.
al.,, 2009; Laro
Larochelle
chelle and Benhas a Student-t prior density) and the KL-divergence
al., 2008; Go odfellow et al., 2009; Larochelle and Ben(t log h + (1 t) log
)),,
log(1
(1 h ))
non-linearit
non-linearity
y.
with a target sparsity level t, for h
(0, 1), e.g. through a sigmoid
Con
Contractiv
tractiv
tractive
autoenco
enco
encoders
ders (Rifai et al.
al.,, 2011b), co
covered
vered in Secnon-linearit
ye. auto
tion 15.10, explicitly penalize || || , i.e., the sum of the squared norm
Contractive autoenco ders (Rifai et al., 2011b), covered in Secof the vectors
(eac
(each
h indicating how much each hidden unit h
tion 15.10, explicitly penalize
, i.e., the sum of the squared norm
resp
responds
onds to cchanges
hanges in x and what direction of cchange
hange in x that unit is
|| ||
of thesensitiv
vectorse to, around
(eacha indicating
how
much suc
each
unit h
most
sensitive
particular
x). With
such
h ahidden
regularization
resp
onds
to cauto-enco
hanges inder
x and
what contractiv
direction ofechange
in xthe
that
unit is
penalt
is called
because
mapping
enalty
y, the
auto-encoder
contractive
most input
sensitiv
aroundtation
a particular
x). Withtosuc
a regularization
from
xetoto,represen
representation
h is encouraged
bh
e contractiv
contractive,
e, i.e.,
p
y, small
the auto-enco
der
is called
contractiv
e b
ecause
the mapping
toenalt
ha
deriv
all directions.
Note
that
a sparsit
have
ve
derivativ
ativ
atives
es in
sparsity
y regufrom
inputindirectly
x to represen
h istractiv
encouraged
to b eascontractiv
e, i.e.,
larization
leadstation
to a con
well, when
the
contractiv
tractive
e mapping
to
have small
derivhapp
atives
allhav
directions.
Noteativ
that
ah
sparsit
regunon-linearit
= 0 y(whic
non-linearity
y used
happens
ensin to
have
e a zero deriv
derivativ
ative
e at
(which
h
larization
indirectly
leads
to
a
con
tractiv
e
mapping
as
well,
when
the
is the case for the sigmoid non-linearity).
non-linearity used happens to have a zero derivative at h = 0 (which
Robustness
is the casetoforinjected
the sigmoid
noise
non-linearity).
or missing information
information:: if noise is
injected in inputs or hidd
hidden
en units, or if some inputs are missing, while the
Robustness
injected
or the
missing
information
: if ,noise
is
neural
net
is ask
construct
cle
complete input
then it
network
workto
asked
ed to renoise
clean
an and
input,
in inputs
or the
hiddid
en
units,
or if some
missing,
the
injected
cannot simply
learn
It inputs
has to are
capture
the while
structure
identit
entit
entity
y function.
neural
network
is asked to
econstruct
the cleanpand
complete
input, then it
of
the data
distribution
in rorder
to optimally
erform
this reconstruction.
cannot
simply learn
thecalled
identit
y function.
It hasoders
to capture
structure
Suc
Such
h auto-enco
auto-encoders
ders are
denoising
auto-enc
auto-enco
and arethe
discussed
in
of thedetail
data distribution
in order to optimally perform this reconstruction.
more
in Section 15.9.
Such auto-enco ders are called denoising auto-encoders and are discussed in
more detail in Section 15.9.
15.2
Denoising Auto-enco
Auto-encoders
ders
There
tigh
tightt connection
bet
between
ween theders
denoising auto-enco
auto-encoders
ders and the
15.2 is aDenoising
Auto-enco
con
contractive
tractive auto-encoders: it can be shown (Alain and Bengio, 2013) that
There
a tigh
connection
between
theinput
denoising
ders and
the
in
the is
limit
of tsmall
Gaussian
injected
noise,auto-enco
the denoising
reconcontractive
auto-encoders:
be tractive
shown (Alain
2013) that
struction
error
is equiv
equivalen
alen
alenttittocan
a con
contractive
penaltyand
on Bengio,
the reconstruction
in the limit
smallx Gaussian
input
noise,
thesince
denoising
reconfunction
thatofmaps
to r = g(injected
f (x)). In
other
words,
both x
and
struction error is equivalent to a contractive penalty on the reconstruction
A function
) is contractiv
contractive
nearbywords,
nearby
and ,since
or equivalently
equiv
alently
function( that
maps xe ifto r( )= g( ()f (x)). Inforother
both
x and
if its deriv
derivative
ative
( )
1.
if its derivative
( )
1.
470
470
15.3
Represen
Representational
tational Power, La
Lay
yer Size and Depth
Auto
Autoenco
encoders
ders are often
trained with
single
layer
yer
encoder
a single
15.3enco
Represen
tational
Poonly
wer,a La
yerla
Size
andand
Depth
la
layer
yer decoder. Ho
Howev
wev
wever,
er, this is not a requirement, and using deep enco
encoders
ders and
Auto
encooers
ders are
often
trained
deco
decoders
ders
man
many
y adv
advan
an
antages.
tages. with only a single layer encoder and a single
layer
decoder.
wev
er,that
thisthere
is not
requirement,
andtousing
deep
ders and
Recall
from Ho
Sec.
6.6
area many
adv
advantages
antages
depth
in aenco
feed-forward
deco
ders oers
man
y advantages.
net
network.
work.
Because
auto-enco
auto-encoders
ders are feed-forward net
netw
works, these adv
advan
an
antages
tages also
Recall
from Sec.ders.
6.6 that
there
advantages
depth inard
a feed-forward
apply
to auto-enco
Moreo
themany
enco
is itself atofeed-forw
net
auto-encoders.
Moreover,
ver,are
encoder
der
feed-forward
network
work as
netthe
work.
Because
auto-enco
derscomp
are feed-forward
works, these
advan
tages also
is
decoder,
so each
of these
componen
onen
onents
ts of thenet
auto-enco
auto-encoder
der can
individually
apply
auto-enco
b
enettofrom
depth.ders. Moreover, the encoder is itself a feed-forward network as
is the
decoder,
so each
of of
these
comp onen
ts ofisthe
auto-enco
der
canappro
individually
One
ma
major
jor adv
advantage
antage
non-trivial
depth
that
the univ
universal
ersal
approximator
ximator
benet from
depth.
theorem
guaran
guarantees
tees that a feedforward neural net
netw
work with at least one hidden
One
jor advtantage
of non-trivial
is that the
universal
approclass)
ximator
la
layer
yer
canma
represen
represent
an appro
approximation
ximation ofdepth
any function
(within
a broad
to
theorem
guaran
tees of
that
a feedforward
neural
work
with at
least units.
one hidden
an
arbitrary
degree
accuracy
accuracy,
, pro
provided
vided
that net
it has
enough
hidden
This
la
yer can
represen
t an appro
ximation
of any
function
class)the
to
means
that
an autoenco
a single
hidden
lay
is able atobroad
represent
autoencoder
der with
layer
er(within
an
arbitrary
degree
of accuracy
, provided
has enoughwhidden
units.
iden
identity
tity function
along
the domain
of thethat
datait arbitrarily
ell. How
However,
ever,This
the
means
that
an
autoenco
der
with
a
single
hidden
lay
er
is
able
to
represent
the
mapping from input to code is shallo
shallow.
w. This means that we are not able to
identity arbitrary
function along
thets,domain
the d
ataco
arbitrarily
However,
the
enforce
constrain
constraints,
suc
such
h asof that
the
code
de should w
bell.
e sparse.
A deep
mapping
fromwith
input
code
shallow. hidden
This means
e are
not able
to
auto
autoenco
enco
encoder,
der,
at to
least
oneisadditional
la
lay
yer that
insidewthe
encoder
itself,
enforce
arbitrary
constrain
ts, suc
h as
thattothe
code
should bweell,
sparse.
deep
can
appro
approximate
ximate an
any
y mapping
from
input
co
code
de
arbitrarily
giv
given
en A
enough
autoenco
der, with at least one additional hidden layer inside the encoder itself,
hidden
units.
canThe
appro
ximate
anyoint
mapping
fromates
input
to code arbitrarily
wders,
ell, giv
en enough
ab
abov
ov
ovee viewp
viewpoint
also motiv
motivates
ov
overcomplete
ercomplete
auto
autoenco
enco
encoders,
that
is, auhidden
units.
to
toenco
enco
encoders
ders
with very wide la
lay
yers, in order to ac
achiev
hiev
hievee a rich family of possible
The above viewpoint also motivates overcomplete autoencoders, that is, aufunctions.
toenco
derscan
with
very
wide la
yers, in
to achievecost
a rich
family
of p
Depth
exp
exponentially
onentially
reduce
theorder
computational
of ev
evaluating
aluating
a ossible
reprefunctions.
sen
sentation
tation of some functions, and can also exp
exponen
onen
onentially
tially dec
decrease
rease the amoun
amountt of
Depthdata
can needed
exponentially
the computational cost of evaluating a repretraining
to learnreduce
some functions.
sentation
of some
and can
expmonen
rease the amoun
t of
Exp
Experimen
erimen
erimentally
tally
tally,functions,
, deep auto-enco
auto-encoders
dersalso
yield
uch tially
betterdec
compression
than cortraining
data
neededortolinear
learnauto-enco
some functions.
resp
responding
onding
shallow
auto-encoders
ders (Hin
(Hinton
ton and Salakhutdino
Salakhutdinov,
v, 2006).
Exp
erimentally
, deepfor
auto-enco
yieldauto
much
better
corA
common
strategy
trainingders
a deep
autoenco
enco
encoder
der iscompression
to greedily than
pre-train
responding
shallow
or blinear
auto-enco
ders
tonwand
Salakhutdinoso
v, we
2006).
the
deep arc
architecture
hitecture
y training
a stac
stack
k of(Hin
shallo
shallow
auto-encoders,
often
A common
strategy
for training
a deep
der isgoal
to greedily
pre-train
encoun
encounter
ter shallo
shallow
w auto-encoders,
ev
even
en
when auto
the enco
ultimate
is to train
a deep
the
deep arc
auto-enco
auto-encoder.
der.hitecture by training a stack of shallow auto-encoders, so we often
encounter shallow auto-encoders, even when the ultimate goal is to train a deep
472
472
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
CHAPTER 15. LINEAR FACTOR MODELS AND AUTO-ENCODERS
15.4
Reconstruction Distribution
The
abov
vReconstruction
e parts (enco
(encoder
der function
f , decoder function g, reconstruction loss
15.4abo
Distribution
L) mak
makee sense when the loss L is simply the squared reconstruction error, but
The abo
parts
(enco
der this
function
decoder function
g, reconstruction
there
areveman
many
y cases
where
is notf ,appropriate,
e.g., when
x is a vectorloss
of
L) makev
sense when
the loss
simply
theappro
squared
reconstruction
error,
but
discrete
variables
ariables
or when
P (x L| his) is
not well
approximated
ximated
by a Gaussian
distrithere are. Just
manylik
cases
where
is not
appropriate,
e.g., orks
when(starting
x is a vwith
ectorthe
of
bution
like
e in the
casethis
of other
types
of neural netw
networks
discrete
variables
when
P (xSection
h) is not
wellitappro
ximated
Gaussian
distrifeedforw
feedforward
ard
neuralornet
netw
works,
6.3.2),
is conv
convenient
enientby
to adene
the loss
L
bution
. Justlog-likelihoo
like in the case
of| some
other target
types of
neuralv
netw
orks (starting
with the
as
a negative
log-likelihood
d ov
over
er
random
variables.
ariables.
This probabilistic
feedforw
ard neural
networks,imp
Section
is convenient
to dene
the loss
L
in
interpretation
terpretation
is particularly
important
ortant6.3.2),
for theitdiscussion
in Sections
20.9.3,
20.11
as a negative
log-likelihoo
de ov
er some target
randomders
variables.
This
probabilistic
and
20.12 ab
about
out
generativ
generative
extensions
of auto-enco
auto-encoders
and sto
stochastic
chastic
recurrent
in
terpretation
is the
particularly
for the discussion
in Sections
20.9.3, 20.11
net
networks,
works, where
output ofimp
theortant
auto-encoder
is interpreted
as a probability
disand 20.12Pab
e extensionsxof
auto-enco
dersunits
and hsto
chastic
recurrent
tribution
(xout
| hgenerativ
), for reconstructing
, giv
given
en hidden
. This
distribution
networks,not
where
of the
auto-encoderbut
is interpreted
asertainty
a probability
captures
justthe
theoutput
exp
expected
ected
reconstruction
also the unc
uncertainty
about disthe
tributionxP(whic
(x h
fore reconstructing
given hidden units
. cThis
distribution
original
(which
h),gav
gave
rise to h, eitherx,deterministically
or h
sto
stoc
hastically
hastically,
, given
captures
notsimplest
just
exp ected
alsodistribution
the uncertainty
about i.e.,
the
| the and
h
). In the
mostreconstruction
ordinary cases,but
this
factorizes,
original
e ).rise
to co
h,vers
either
chastically
, given
Ph(xgav
P
(x | h)x=(whic
|h
This
cov
thedeterministically
usual cases of x |orh sto
being
Gaussian
(for
h
).bounded
In the real
simplest
and
cases, this
distribution
i.e.,
un
unb
values)
andmost
x |h ordinary
having a Bernoulli
distribution
(forfactorizes,
binary values
P (readily
x h). generalize
P (),x but
h) one
= can
This coversthis
the to
usual
cases
of x h being
(for
x
other
distributions,
suchGaussian
as mixtures
unbounded
real
values)
x h having a Bernoulli distribution
(for binary values
|
|andand
|
(see
Sections
3.10.6
6.3.2).
x ),Th
but
can generalize
readily generalize
thisoftode
other
ascomixtures
Thus
us one
we can
the| notion
dec
codingdistributions,
function g(h)such
to de
dec
ding dis(see Sections
6.3.2).
tribution
P (x3.10.6
| h). and
Similarly
Similarly,
, we can generalize the notion of enc
enco
oding function
us enc
we ocan
theQnotion
ding function
g(h) to15.3.
decoding
f (xTh
) to
enco
dinggeneralize
distribution
(h | xof
), de
ascoillustrated
in Figure
We disuse
tribution
P (x h
). fact
Similarly
we can
generalizeatthe
encorepr
dingesentation
function
this
to capture
the
that, noise
is injected
thenotion
lev
level
el ofof the
representation
f (x) to encoding
Q(h x), as illustrated in Figure 15.3. We use
| distribution
h, no
lik
now
w considered
likee a laten
latentt variable. This generalization is crucial in the
this
to capture
the vfact
that noise
is
at the 20.9.3)
level ofand
the the
reprgeneralized
esentation
| injected
dev
of the
ariational
auto-enco
development
elopment
auto-encoder
der (Section
h
, cno
w considered
e a laten
t variable. This generalization is crucial in the
sto
hastic
net
(Section
20.12).
stoc
networks
works lik
development
of the
variationalencoder
auto-enco
(Section
and
We also nd
a stochastic
andder
a sto
deco
inthe
thegeneralized
RBM, destoc
chastic20.9.3)
decoder
der
sto
chastic
works20.2.
(Section
20.12).
scrib
Section
In that
case, the encoding distribution Q(h | x) and
scribed
ed in net
ndh,
a stochastic
encoder
der there
in theisRBM,
deP (xW| ehalso
) matc
match,
in the sense
that Qand
(h |axsto
) =chastic
P (h | deco
x), i.e.,
a unique
scrib
in Sectionwhich
20.2. has
In both
that Q
case,
encoding
Q(h x
) and
join
jointt ed
distribution
(h | the
x) and
P (x | distribution
h) as conditionals.
This
is
P (xtrue
h)inmatc
h, for
in the
that
Q(h parametrized
x) = P (h conditionals
x), i.e., therelike
is a|Q(unique
not
general
tw
two
o sense
indep
independently
endently
h | x)
jointP|distribution
which the
has wboth
Q(generativ
h x)| and
P (x |h) as
conditionals.
is,
and
(x | h), although
ork on
generative
e stochastic
net
networks
works (AlainThis
et al.
al.,
not true
in
for twowill
indep
endently
like Q(h(with
x)
| eparametrized
| conditionals
2015)
sho
shows
wsgeneral
that learning
tend
to mak
make
them compatible
asymptotically
and P (xcapacity
h), although
the work on generative stochastic networks (Alain et |al.,
enough
and examples).
2015) shows
| that learning will tend to make them compatible asymptotically (with
See the
link betw
between
een examples).
squared error and normal densit
density
y in Sections 5.6 and 6.3.2
enough
capacity
and
See the link between squared error and normal density in Sections 5.6 and 6.3.2
473
473
15.5
Linear Factor Mo
Models
dels
No
Now
w thatLinear
we ha
hav
ve introduced
the notion
decoder,
der, let us fo
focus
cus
15.5
Factor Mo
dels of a probabilistic deco
on a very special case where the latent variable h generates x via a li
linear
near transNow that we
ve introduced
the notion
of a probabilistic
deco
us focus
formation
plushanoise,
i.e., classical
linear factor
mo
models,
dels, whic
which
h doder,
notlet
necessarily
on
a vaery
special
case parametric
where the latent
variable h generates x via a linear transha
have
ve
corresp
corresponding
onding
encoder.
formation
plus
i.e., classical
linearfactors
factor that
models,
h do not
necessarily
The idea
of noise,
discov
discovering
ering
explanatory
ha
have
vewhic
a simple
join
joint
t distribuhave among
a corresp
onding parametric
tion
themselves
is old, e.g.,encoder.
see Factor Analysis (see below), and has been
The idea
ering
factors that ha
ve
a simple
join
t distribuexplored
rstofindiscov
the con
context
textexplanatory
where the relationship
bet
etween
ween
factors
and
data is
tion among
is old,
seewF
Analysis
(see ws.
below),
and
has been
linear,
i.e., wthemselves
e assume that
thee.g.,
data
asactor
generated
as follo
follows.
First,
sample
the
explored
rst
in the context where the relationship between factors and data is
real-v
real-valued
alued
factors,
linear, i.e., we assume that the data
as follows. First, sample
the
hw
as
P (generated
h),
(15.2)
real-valued factors,
and then sample the real-v
real-valued
alued observ
observable
h able
P (h)v,ariables given the factors:
(15.2)
474
474
15.6
Probabilistic
PCA (PrincipalPCA
Comp
Components
onents
factor analysis and other
15.6 Probabilistic
and Analysis),
Factor Analysis
linear factor models are sp
special
ecial cases of the abo
above
ve equations (15.2 and 15.3) and
Probabilistic
(Principal
onents
Analysis),
factor
and other
only
dier in PCA
the choices
madeComp
for the
prior
(o
(over
ver laten
latent,
t, notanalysis
parameters)
and
lineardistributions.
factor models are special cases of the above equations (15.2 and 15.3) and
noise
onlyIndier
the choices
made for the
prior
(over laten
t, notthe
parameters)
and
factorinanalysis
(Bartholomew,
1987;
Basilevsky,
1994),
laten
latentt variable
noise distributions.
prior is ju
just
st the unit variance Gaussian
In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable
prior is just the unit variance Gaussian
h N (0
(0,, I )
(0,to
I )be conditional
while the observ
observed
ed v
variables
ariables x are h
assumed
onditionally
ly indep
independent
endent
endent,, giv
given
en
h, i.e., the noise is assumed to be coming
covariance
ariance Gaussian
N from a diagonal cov
while the observ
ed vco
ariables
x matrix
are assumed
be c
onditional
distribution,
with
cov
variance
=todiag(
diag(
), with ly indep
= (endent
, ,,. giv
. .) en
a
hector
, i.e., of
the
noise
is assumed
to be coming from a diagonal covariance Gaussian
v
per-v
er-variable
ariable
variances.
distribution,
with
variance
matrixis
= to
diag(
),e the
withdep
endencies
= ( , b,etw
. . .)een
a
The role of
the co
laten
latent
t variables
thus
captur
apture
dependencies
etween
vector
of per-v
ariable
the
dieren
dierent
t observ
observed
ed vvariances.
ariables x . Indeed, it can easily be sho
shown
wn that x is just
The role of the laten(m
t vultiv
ariables
thus to rand
captur
e the
dependencies
between
a Gaussian-distribution
(multiv
ultivariate
ariateis normal)
random
om
variable,
with
the dierent observed variables x . Indeed, it can easily be shown that x is just
a Gaussian-distribution (multiv
ariate
rand
x
N (b,normal)
WW +
)om variable, with
x W induce
(b, W W
+endency
)
where we see that the weights
a dep
dependency
betw
between
een two variables x
and x through a kind of auto-encoder
path, whereby x inuences h = W x
N
where we see that the weights W induce a dependency between two variables x
inuences x via w .
via w (for every k) and h
In order to cast PCA in a probabilistic framew
framework,
ork, we can mak
makee a sligh
slightt
mo
modication
to the
mo
model,
del,via
makin
making
via dication
w (for every
k) factor
and h analysis
inuences
x
w .g the conditional variances
In order to cast PCA in a probabilistic
framework, we can make a slight
475
modication to the factor analysis model, making the conditional variances
475
equal to eac
each
h other. In that case the cov
covariance
ariance of x is just W W + I , where
is now a scalar, i.e.,
equal to each other. In that xcase
the
N
(b,cov
Wariance
W + ofIx) is just W W + I , where
is now a scalar, i.e.,
or equiv
equivalen
alen
alently
tly
x
(b, W W + I )
xN= W h + b + z
or equivalently
where z N (0
(0,, I ) is white noise.
how
w an
x = WTipping
h + b + and
z Bishop (1999) then sho
iterativ
iterativee EM algorithm for estimating the parameters W and .
where
z the probabilistic
(0, I ) is white
noise.
Tipping
and Bishop
thenco
svho
w an
What
PCA
model
is basically
sa
saying
ying is(1999)
that the
cov
ariance
iterativ
eE
MNalgorithm
for estimating
the parameters
W and
. residual reconcaptured
is
mostly
by the
laten
latentt variables
h, up to some
small
What err
theorprobabilistic
PCA
is and
basically
saying
is that
the covariance
struction
error
. As sho
shown
wn
by model
Tipping
Bishop
(1999),
probabilistic
PCA
isecomes
mostlyPCA
captured
by 0.
theInlaten
variables
h, up to some
smallvalue
residual
econb
as
thatt case,
the conditional
expected
of h rgiv
given
en
struction
ororthogonal
. As shoprojection
wn by Tipping
andspace
Bishop
(1999),byprobabilistic
PCA
x
becomeserr
an
on
onto
to the
spanned
the d columns
of
becomes
PCA
as See
Section
0. In that
conditional
expected
valuemec
of h
given
W
, lik
likee in
PCA.
17.1case,
for athe
discussion
of the
inference
mechanism
hanism
x
becomes
an orthogonal
projectionoronnot),
to thei.e.,
space
spanned
byexp
theected
d columns
(probabilistic
asso
PCA
reco
value of
of
associated
ciated with
recovering
vering the
expected
W , laten
like in
PCA. See
Section
for a discussion
of thesection
inference
mechanism
the
latent
t factors
h giv
given
en the17.1
observed
input x. That
also explains
the
asso
withge
PCA
(probabilistic
or not),
i.e.,
recovering
the exp ected value of
v
eryciated
insightful
geometric
ometric
and manifold
interpr
interpretation
etation
of PCA.
the Ho
laten
t er,
factors
thedensity
observed
input
x. That
section
explains
Howev
wev
wever,
as hgiv
0,enthe
model
becomes
very
sharpalso
around
thesethe
d
v
ery insightful
geometric
and manifold
etationinofSection
PCA. 17.1, which w
dimensions
spanned
the columns
of W , interpr
as discussed
would
ould
Ho
wev
er,
as
0,
the
density
model
becomes
v
ery
sharp
around
these
d
not mak
makee it a very faithful mo
model
del of the data, in general (not just because the
dimensions
spanned
the columns of W , manifold,
as discussed
Section
17.1, which
would
data
may liv
live
e on a
higher-dimensional
butinmore
importan
importantly
tly because
not real
makedata
it amanifold
very faithful
model
thehyperplane
data, in general
(not just17because
the
the
ma
may
y not
be aofat
- see Chapter
for more).
data may live on a higher-dimensional manifold, but more importantly because
the real data manifold may not be a at hyperplane - see Chapter 17 for more).
15.6.1
ICA
Indep
Independen
enden
endent
t Component Analysis (ICA) is among the oldest represen
representation
tation learn15.6.1
ICA
ing algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994;
Indep
endent1999;
Component
Analysis
is among
theapproach
oldest represen
tation learnarinen,
arinen
Hyv
arinen,
Hyvarinen
et al.
al.,(ICA)
, 2001).
It is an
to modeling
linear
ing algorithms
(Herault
and Ans,pro
1984;
Jutten
anddata.
Herault,
Comon, 1994;
factors
that seeks
non-Gaussian
projections
jections
of the
Lik
Likee1991;
probabilistic
PCA
Hyvarinen,
1999; Hyv
arinen
, 2001).
It is
an approach
to and
modeling
linear
and
factor analysis,
it also
tsettheal.linear
factor
model
of Eqs. 15.2
15.3. What
factors
that ab
seeks
projections
of the
data.
Like it
probabilistic
PCA
is
particular
about
outnon-Gaussian
ICA is that unlike
PCA and
factor
analysis
do
does
es not assume
and factor
analysis,
it also
ts is
theGaussian
linear factor
modelassumes
of Eqs. 15.2
What
that
the latent
variable
prior
Gaussian.
. It only
that and
it is15.3.
factorize
factorized
d,
is particular about ICA is that unlike PCA and factor analysis it does not assume
i.e.,
that the latent variable prior isPGaussian
d,
(h) = . PIt(honly
). assumes that it is factorize
(15.4)
i.e.,
P (h) =
P (h ).
(15.4)
Since there is no parametric assumption behind the prior, we are really in front
of a so-called semi-p
semi-par
ar
arametric
ametric mo
model
del
del,, with parts of the model being parametric
Since there is no parametric assumption behind the prior, we are really in front
of a so-called semi-parametric model, with parts of the model being parametric
that if
h = Uz
that if
with U an orthonormal (rotation) square
i.e.,
h = Umatrix,
z
z = U matrix,
h,
with U an orthonormal (rotation) square
i.e.,
then, although h migh
mightt ha
have
ve a Normal(0
Normal(0,
z = U, Ih) , distribution, the z also have a unit
covarianc
ovariancee, i.e., they are uncorrelated:
then, although h might have a Normal(0, I ) distribution, the z also have a unit
ar[
[z ] they
= [zare
z ]uncorrelated:
= [U hh U ] = U V ar
ar[[h]U = U U = I .
covariancV
e,ar
i.e.,
In other wVords,
factors
imposing
independence
allow
w one
ar[z imp
] = osing
[zz indep
] = [endence
U hh among
U ] = UGaussian
V ar[h]U
= U does
U =not
I . allo
to disentangle them, and we could as well recov
recover
er any linear rotation of these facIn
other
words,that,
imposing
endence among
factors
not allo
onet
tors.
It means
giv
the observed
x, ev
though w
assume
thewrigh
given
enindep
even
enGaussian
we
e mightdoes
right
to
disentangle
them,
andcannot
we could
aserwthe
ell recov
er any
linear erotation
these
facgenerativ
generative
e mo
model,
del,
PCA
recov
recover
original
generativ
generative
factors. ofHo
Howev
wev
wever,
er,
if
tors.
It means
given the
observed
, even though, then
we might
assume
thethem,
right
w
e assume
thatthat,
the latent
v
variables
ariables
are x
non-Gaussian
non-Gaussian,
we can
reco
recover
ver
generativ
e mo
del,ICA
PCAiscannot
er
the
e factors.
Howev
er, if
and this is
what
trying recov
to ac
In fact, generativ
under these
generativ
achiev
hiev
hieve.
e. original
generative
e model
we assume that
latent
variables
are non-Gaussian
, then
we can1994).
recoverInthem,
assumptions,
thethe
true
underlying
factors
can be recov
recovered
ered
(Comon,
fact,
and
this
is
what
ICA
is
trying
to
ac
hiev
e.
In
fact,
under
these
generativ
e
model
man
many
y ICA algorithms are looking for pro
projections
jections of the data s = V x suc
such
h that
assumptions,
the true
underlying factors
can beerecov
ered (Comon,
1994).
In fact,
they
are maximal
maximally
ly non-Gaussian
non-Gaussian.
. An intuitiv
intuitive
explanation
for these
approaches
man
y ICA
algorithms
are looking
for pro jections
data s = V xalmost
such that
is
that
although
the true
laten
latentt variables
h ma
may
y of
bethe
non-Gaussian,
any
they are
maximal ly of
non-Gaussian
. Anmore
intuitiv
e explanation
approaches
linear
combination
them will look
Gaussian,
becausefor
of these
the cen
central
tral limit
is
that although
the true
laten
t variables
non-Gaussian,
almost any
theorem.
Since linear
com
of thehxma
s yarebealso
linear combinations
of
combinations
binations
linear
combination
lookneed
moretoGaussian,
because
of the central
the
h s,
to recov
recover
er of
thethem
h s will
we just
nd the linear
combinations
thatlimit
are
theorem.
Since
linear
com
binations
of
the
x
s
are
also
linear
combinations
maximally non-Gaussian (while keeping these dierent pro
projections
jections orthogonal of
to
the
s, to recover the h s we just need to nd the linear combinations that are
eac
each
hhother).
maximally
keeping
these
jections
orthogonal
to
There isnon-Gaussian
an in
interesting
teresting(while
connection
bet
between
weendierent
ICA andpro
sparsity
sparsity,
, since
the domeac
ht other).
inan
inant
form of non-Gaussianity in real data is due to sparsit
sparsity
y, i.e., concentration
There is yanatinor
teresting
between
ICA andtypically
sparsity,hav
since
the domof probabilit
probability
near 0. connection
Non-Gaussian
distributions
have
e more
mass
inant form
non-Gaussianity
real
is due to sparsit
, i.e., concentration
around
zero,of although
you can in
also
getdata
non-Gaussianity
by yincreasing
sk
skewness,
ewness,
of
probabilit
y at
or near 0. Non-Gaussian distributions typically have more mass
asymmetry
asymmetry,
, or
kurtosis.
around
you can also
non-Gaussianity
bders
y increasing
ewness,
Lik
Likee zero,
PCAalthough
can be generalized
to get
non-linear
auto-enco
auto-encoders
describ
described
edsklater
in
asymmetry
, orICA
kurtosis.
this
chapter,
can be generalized to a non-linear generative mo
model,
del, e.g., x =
PCASee
canHyv
be arinen
non-linear
ders w
describ
later in
ageneralized
rinen and P
f (hLik
) + enoise.
Pa
atojunen
(1999) auto-enco
for the initial
work
ork oned
non-linear
this cand
hapter,
ICA can use
be generalized
to alearning
non-linear
generative
del,
e.g.,
x=
ICA
its successful
with ensem
ensemble
ble
b
by
y Roberts
andmo
Ev
Everson
erson
(2001);
f (h) + noise. et
See
arinen and Pa junen (1999) for the initial work on non-linear
Lappalainen
al.Hyv
(2000).
ICA and its successful use with ensemble learning by Roberts and Everson (2001);
Lappalainen et al. (2000).
15.6.2
15.6.2
P (h) =
P (h ) =
e
(15.5)
2
and the factori
factorized
zed Studen
Student-t
t-t prior is
and the factorized Student-t prior is
P (h) =
P (h )
(15.6)
1 +1
P (h) =
P (h )
.
(15.6)
1 +for near-zero values but, unlik
Both of these densities ha
hav
ve a strong preference
unlikee
15.7
Although
traditional auto-enco
auto-encoders
ders
(like traditional
neural
net
networks)
works)
15.7 Reconstruction
Error
as Log-Lik
eliho
o d were introduced with an asso
associated
ciated training loss, just like for neural netw
networks,
orks, that training
Although
traditional
ders (like traditional
neural net
were introloss can generally
beauto-enco
giv
in
as works)
a conditional
loggiven
en a probabilistic
interpretation
terpretation
duced
with
asso
ciated input
training
for neural netw
lik
original
x, loss,
giv
thelike
reprensentation
h.orks, that training
likeliho
eliho
elihoo
o d ofanthe
given
enjust
lossWcan
generally
giv
en anegativ
probabilistic
interpretation
as function
a conditional
loge ha
in general
have
ve alreadybecov
covered
ered
negative
e log-lik
log-likeliho
eliho
elihoo
o d as a loss
likeliho
od of theneural
original
input
, giv
en the6.3.2.
reprensentation
h. error for regular
for
feedforward
netw
networks
orksxin
Section
Like prediction
We ha
ve neural
alreadynet
cov
ered reconstruction
negative log-likerror
elihoofor
d as
a loss function
general
feedforw
feedforward
ard
networks,
works,
auto-enco
auto-encoders
ders do
does
esinnot
hav
havee
forbfeedforward
neural
netwworks
in Section
Likee prediction
error
regular
to
e squared error.
When
e view
the loss 6.3.2.
as negativ
negative
log-likelihoo
log-likelihood,
d, w
wefor
e interpret
feedforw
ard neural error
networks,
the
reconstruction
as reconstruction error for auto-encoders does not have
to be squared error. When we view the loss as negative log-likelihood, we interpret
the reconstruction error as
L = log P (x | h)
= h may
log P generally
(x h) be obtained through an enwhere h is the representation, L
whic
which
co
|
coder
der taking x as input.
where h is the representation, which may generally be obtained through an encoder taking x as input.
|
with mean g(f (x)), and cross-entropy if we choose a factorized Bernoulli reconstruction
15.8
Sparse Represen
Representations
tations
Sparse
are auto-encoders
whic
which
h learn a sparse represen
representation,
tation, i.e.,
15.8 auto-encoders
Sparse Represen
tations
one whose elemen
elements
ts are often either zero or close to zero. Sparse coding was inSparse
auto-encoders
are auto-encoders
whichmodel
learn ainsparse
tation,
tro
in Section 15.6.2
as a linear factor
whichrepresen
the prior
P (h)i.e.,
on
troduced
duced
one representation
whose elementsh are
either zerovalues
or close
to near
zero.0.Sparse
coding
was inthe
= foften
(x) encourages
at or
In Section
15.8.1,
we
tro
duced
in
Section
15.6.2
as
a
linear
factor
model
in
which
the
prior
P
(
h
)
on
see how ordinary auto-enco
auto-encoders
ders can b e preven
prevented
ted from learning a useless identit
identity
y
the representation
= f (axsparsit
)sparsity
encourages
values
at orthan
neara0.bottleneck.
In Section The
15.8.1,
we
transformation
by h
using
y penalty
rather
main
see how ordinary
can b e preven
fromco
learning
useless
identit
dierence
bet
between
weenauto-enco
a sparseders
auto-encoder
andted
sparse
coding
ding is athat
sparse
co
codd-y
transformation
by using
a sparsit
yp
enalty
rathersparse
than auto-enco
a bottleneck.
ing
has no explicit
parametric
enco
encoder,
der,
whereas
auto-encoders
ders The
ha
have
vemain
one.
dierence
between
a sparse
auto-encoder
and sparse
ding is the
thatappro
sparse
codThe
encoder
of sparse
coding
is the algorithm
that co
performs
approximate
ximate
ing has noi.e.,
explicit
encoder, whereas sparse auto-encoders have one.
inference,
looksparametric
for
The encoder of sparse coding is the algorithm that performs the approximate
inference, i.e., looks for
||
||x
x (b + W h)||
h (x) = arg max
maxlog
log P (h | x) = arg min
log P (h) (15.7)
x (b + W h)
h (x) = arg max log P (h x) = arg min
log P (h) (15.7)
|| (which
||
where is a reconstruction variance parameter
should
equal the av
average
erage
|
squared reconstruction error ), and P (h) is a sparse prior that puts more probwhere
is aaround
reconstruction
variance
should
equal the
average
abilit
ability
y mass
h = 0, such
as theparameter
Laplacian(which
prior, with
factorized
marginals
squared reconstruction error ), and P (h) is a sparse prior that puts more probability mass around h = 0, such as the Laplacian prior, with factorized marginals
P (h ) = 2 e
but can be lump
lumped
ed into the regularizer
P (h
dened in Eq. 15.8, for example.
but can be lumped into the regularizer
dened in Eq. 15.8, for example.
(15.8)
eh con
which
whic
controls
trols the strength of the sparsity(15.8)
prior,
)=
2
which controls the strength of the sparsity prior,
480
480
or the Stud
Student-t
ent-t prior, with factorized marginals
1
or the Student-t prior, with factorized marginals
P (h )
.
(15.9)
(1 + 1 )
P (h )
.
(15.9)
) and the sparse coding approac
The adv
advan
an
antages
tages of suc
such
h a non-parametric
encoder
approach
h
(1 +
over sparse auto-enco
auto-encoders
ders are that
The advantages of such a non-parametric encoder and the sparse coding approach
over
dersminimize
are that the com
1. sparse
it can auto-enco
in principle
combination
bination of reconstruction error and
log-prior better than an
any
y parametric encoder,
1. it can in principle minimize the combination of reconstruction error and
2. it
performs
whatthan
is called
explaining encoder,
away (see Figure 13.8), i.e., it allo
allows
ws
log-prior
better
any parametric
to choose some explanations (hidden factors) and inhibits the others.
2. it performs what is called explaining away (see Figure 13.8), i.e., it allows
to choose
explanations (hidden factors) and inhibits the others.
The disadv
disadvan
an
antages
tagessome
are that
The
antages
arefor
that
1. disadv
computing
time
encoding the giv
given
en inpu
inputt x, i.e., p erforming inference
(computing the represen
representation
tation h that go
goes
es with the giv
given
en x) can b e sub1. computing
time for
the given inpu
t x, i.e.,
p erforming
inference
stan
stantially
tially larger
thanencoding
with a parametric
encoder
(b
(because
ecause
an optimization
(computing
the represen
tation
h thatx),
goand
es with the given x) can b e subm
ust be performed
for each
example
stantially larger than with a parametric encoder (because an optimization
2. the
encoder
could bxe),non-smo
non-smooth
oth and possibly to
too
o nonmustresulting
be performed
forfunction
each example
and
linear (with two nearby xs being associated with very dierent hs), po2. the
resulting
encoder
function
couldforbethe
non-smo
oth andla
possibly
o nonten
tentially
tially
making
it more
dicult
do
downstream
wnstream
lay
yers to to
prop
roperly
erly
linear (with two nearby xs being associated with very dierent hs), pogeneralize.
tentially making it more dicult for the downstream layers to properly
In generalize.
Section 15.8.2, we describ
describee PSD (Predictive Sparse Decomp
Decomposition),
osition), which
com
combines
bines a non-parametric enco
encoder
der (as in sparse co
coding,
ding, with the represen
representation
tation
In
Section
15.8.2,
we
describ
e
PSD
(Predictive
Sparse
Decomp
osition),
obtained via an optimization) and a parametric encoder (like in the sparse which
autocom
bines
non-parametric
encoder
in sparse
coding, with (DAE),
the represen
enco
encoder).
der). aSection
15.9 in
intro
tro
troduces
duces
the(as
Denoising
Auto-Encoder
whic
which
htation
puts
obtained
via the
an optimization)
a parametric
in the sparse
pressure on
representationand
by requiring
it toencoder
extract(like
information
ab
the
about
outautoencoder). Section
15.9 inand
tro duces
Auto-Encoder
h puts
underlying
distribution
wherethe
it Denoising
concen
concentrates,
trates,
so as to be(DAE),
able towhic
denoise
a
pressure
on
the
representation
by
requiring
it
to
extract
information
ab
out
the
corrupted input. Section 15.10 describ
describes
es the Contractiv
Contractivee Auto-Enco
Auto-Encoder
der (CAE),
underlying
distribution
and regularization
where it concen
trates, that
so asaims
to beatable
to denoise
whic
which
h optimizes
an explicit
penalty
making
the rep-a
corrupted
Section as
15.10
describ
theinput,
Contractiv
Auto-Enco
der (CAE),
resen
resentation
tationinput.
as insensitive
possible
toesthe
while ekeeping
the information
whic
h optimizes
an explicit
regularization
penalty that aims at making the repsucien
sucient
t to reconstruct
the training
examples.
resentation as insensitive as possible to the input, while keeping the information
sucient to reconstruct the training examples.
15.8.1
Sparse Auto-Encoders
A sparse auto-enco
auto-encoder
der is simply an auto-enco
auto-encoder
der whose training criterion in
involv
volv
volves
es
Auto-Encoders
a15.8.1
sparsity Sparse
penalty (
(h
h) in additi
addition
on to the reconstruction error:
A sparse auto-encoder is simply an auto-encoder whose training criterion involves
a sparsity penalty (h) inLadditi
on Pto(xthe
= log
| greconstruction
(h)) + (
(h
h) error:
(15.10)
L=
(15.10)
481
h
log
(h
h)
(15.11)
| 2| + |h | = const + (
log P (h) =
log + h = const + (h)
(15.11)
2
where the constant
depends
ends only of and
term dep
| | not h (which we typically ignore
in the training criterion because we consider as a hyperparameter rather than
the constant
term (as
depper
endsEq.
only
of the
andsparsit
not hy(which
wecorresponding
typically ignore
awhere
parameter).
Similarly
15.9),
sparsity
penalty
to
in the
trainingprior
criterion
becauseand
we Field,
consider
asis a hyperparameter rather than
the
Student-t
(Olshausen
1997)
a parameter). Similarly (as per Eq. 15.9), the sparsity penalty corresponding to
the Student-t prior (Olshausen and Field,
+ 11997) is h
(
(h
h) =
log(1 + )
(15.12)
2
+1
h
(h) =
log(1 + )
(15.12)
2
(
(h
h) =
log h + (1 ) log
log(1
(1 h )
(15.13)
(h)with
= h =
logsigmoid(
h + (1 a ).)This
(15.13)
log(1is just
h ) the cross-en
where 0 < h < 1, usually
sigmoid(a
cross-entropy
tropy
be y p = hand the target Bernoulli
tween the Bernoulli distributions with probabilit
probability
where 0 < h with
< 1, probabilit
usually with
distribution
probability
y ph= =
. sigmoid(a ). This is just the cross-entropy between the Bernoulli distributions with probability p = h and the target Bernoulli
distribution with probability p = . 482
482
One wa
way
y to ac
achieve
hieve actual zer
zeros
os in h for sparse (and denoising) auto-enco
auto-encoders
ders
was in
intro
tro
troduced
duced in Glorot et al. (2011c). The idea is to use a half-rectier (a.k.a.
Oneaswarectier)
y to achieve
os in h for
sparse
(andintroduced
denoising)inauto-enco
simply
oractual
ReLUzer
(Rectied
Linear
Unit,
Glorot etders
al.
was introfor
duced
Glorot et al.
(2011c).
The
idea in
is to
useand
a half-rectier
(a.k.a.
(2011b)
deepinsupervised
netw
networks
orks and
earlier
Nair
Hin
Hinton
ton (2010a)
in
simply
as rectier)
(Rectied
Linear Unit,
introduced
Glorota et
al.
the
context
of RBMs)orasReLU
the output
non-linearit
non-linearity
y of the
encoder.inWith
prior
(2011b)
for deep
supervised
networks and
ine Nair
and Hinvton
in
that
actually
pushes
the representations
to earlier
zero (lik
(like
the absolute
alue(2010a)
penalty),
the context
RBMs) as
the output
non-linearit
y of
of zeros
the encoder.
With a prior
one
can th
thus
usofindirectly
control
the av
average
erage
num
numb
ber
in the representation.
that actually
pushes
the representations
to
(like the
absolute in
value
penalty),
ReLUs
were rst
successfully
used for de
deep
ep zero
fe
feeedforwar
dforward
d networks
Glorot
et al
al..
one can thac
ushieving
indirectly
average
numbyertooftr
zeros
in thede
representation.
(2011a),
achieving
for control
the rstthe
time
the abilit
ability
train
ain fairly
deep
ep sup
supervise
ervise
ervised
d
ReLUs were
rst successfully
for
deepd fe
ee-tr
dforwar
d networks
Glorotout
et al
networks
without
the ne
neeed forused
unsup
unsupervise
ervise
ervised
pr
pre-tr
e-training
aining
aining,
, and thisinturned
to.
(2011a),
acortant
hievingcomponent
for the rst
time2012
the ob
abilit
to train fairly
deep supervise
d
b
e an imp
important
in the
object
jecty recognition
breakthrough
with
networks
without
neorks
ed for(Krizhevsky
unsuperviseet
d al.
pre-tr
aining, and this turned out to
deep
conv
convolutional
olutionalthe
netw
networks
al.,
, 2012b).
be an
important, the
component
in the
2012
ob ject auto-enco
recognition
with
In
Interestingly
terestingly
terestingly,
regularizer
used
in sparse
auto-encoders
dersbreakthrough
does not conform
deep
olutional
networks (Krizhevsky
et al.
to
theconv
classical
in
interpretation
terpretation
of regularizers
as, 2012b).
priors on the parameters. That
Interestingly
, the regularizer
used in sparse
does(Maxim
not conform
classical
in
interpretation
terpretation
of the regularizer
comesauto-enco
from theders
MAP
(Maximum
um A
toosteriori)
the classical
of regularizers
as priors
on the parameters.
P
pointinterpretation
estimation (see
Section 5.5.1)
of parameters
asso
associated
ciated That
with
classical
interpretation
of the regularizer
comes
from and
the considering
MAP (Maxim
At
the
Ba
Bayesian
yesian
view of parameters
as random
variables
theum
join
joint
Posteriori) pof
oint
estimation
(see Section
5.5.1)
of parameters
associated with
distribution
data
x and parameters
(see
Section
5.7):
the Bayesian view of parameters as random variables and considering the joint
distribution ofarg
data
x Pand
(see
max
( |parameters
x) = arg max
(logSection
P (x | )5.7):
+ log P ( ))
maxP
arg max P ( x) = arg max (log P (x ) + log P ( ))
where the rst term on the righ
rightt is the usual data log-lik
log-likeliho
eliho
elihoo
od term and the
|
|
second term, the log-prior ov
over
er parameters, incorporates the preference over parwhere the
rstofterm
ticular
values
. on the right is the usual data log-likelihood term and the
second
term,
the log-prior
over parameters,
incorporates
the preference
over par-e
With
regularized
auto-encoders
such as sparse
auto-encoders
and contractiv
contractive
ticular
values
.
auto-enco
auto-encoders,
ders,ofinstead,
the regularizer corresponds to a lo
log-prior
g-prior over the repr
epreeWith regularized
auto-encoders
such
sparse
auto-encoders
and
sentation,
or over latent
variables
variables.. In
theas
case
of sparse
auto-enco
auto-encoders,
ders,contractiv
predictiv
predictiveee
auto-enco
ders, osition
instead,and
the con
regularizer
corresponds to athe
log-prior
over species
the represparse
decomp
decomposition
contractive
tractive auto-encoders,
regularizer
a
sentation,
over
latent variables
. In the
casethan
of sparse
ders,This
predictiv
pr
over
functions
of the data,
rather
over auto-enco
parameters.
mak
prefer
efer
eferenc
enc
encee or
makes
ese
sparse
decomposition
andendent
contractive
the regularizer
species
a
suc
such
h a regularizer
data-dep
data-dependent
endent,
, unlik
unlikeeauto-encoders,
the classical parameter
log-prior.
Specifpr
efer,enc
over
functions
of the auto-enco
data, rather
overthat
parameters.
mak
es
ically
in ethe
case
of the sparse
it says
we prefer This
an enco
ically,
auto-encoder,
der,than
encoder
der
such a regularizer data-dependent, unlike the classical parameter log-prior. Specifwhose output produces values closer to 0. Indirectly (when we marginalize over
ically
, in thedistribution),
case of the sparse
der, it asays
that weover
prefer
an encoder
the training
this isauto-enco
also indicating
preference
parameters,
of
whose
course.output produces values closer to 0. Indirectly (when we marginalize over
the training distribution), this is also indicating a preference over parameters, of
15.8.2
Predictiv
Predictive
e Sparse Decomp
Decomposition
osition
TODO: we ha
have
ve to
too
o many forward refs to this section. There are 150 lines ab
about
out
15.8.2
Predictiv
Sparse
Decomp
PSD
in this
section e
and
at least
20 linesosition
of forw
forward
ard references to this section
TODO:
e have some
too many
forward
thislines
section.
are 150e lines
about
in
this cw
hapter,
of whic
which
h are refs
justto100
awa
way
yThere
. Predictiv
Predictive
sparse
dePSD
in this (PSD)
section isand
least
20 lines
of forw
ard references
section
comp
composition
osition
a vat
arian
ariant
t that
combines
sparse
coding andtoa this
parametric
in this chapter, some of which are just 100 lines away. Predictive sparse de483
composition (PSD) is a variant that combines
sparse coding and a parametric
483
enco
encoder
der (Ka
(Kavukcuoglu
vukcuoglu et al.
al.,, 2008b), i.e., it has both a parametric enco
encoder
der and
iterativ
iterativee inference. It has b een applied to unsup
unsupervised
ervised feature learning for obenco
(Kavukcuoglu
et al.
, 2008b),
it has botheta al.
parametric
encoder
and
ject der
recognition
in images
and
video i.e.,
(Ka
Jarrett
(Kavukcuoglu
vukcuoglu
al.,
, 2009, 2010b;
iterativ
e inference.
It et
has
applied
to as
unsup
ervised
featureetlearning
for The
obet al.
Farab
al.
2011),
as well
for audio
(Hena
al.
al.,, 2009a;
arabet
et
al.,b, een
al.,, 2011).
ject
recognition
images and
video
(Ka
vukcuoglu
et al.,a 2009,
Jarrett
represen
to be
a free
variable
(p
laten
variable
if we
representation
tation is inconsidered
(possibly
ossibly
latentt2010b;
et
al.
, 2009a;
Farabet interpretation)
et al., 2011), asand
wellthe
as for
audiocriterion
(Hena com
et al.
, 2011).
The
cho
a probabilistic
training
a sparse
hoose
ose
combines
bines
represen
tation iswith
considered
beencourages
a free variable
(p ossibly asparse
latentrepresentation
variable if we
co
a term to
that
the optimized
coding
ding criterion
chho(after
ose a inference)
probabilistic
and theoftraining
criterion
to interpretation)
be close to the output
the enco
): bines a sparse
encoder
der f (xcom
coding criterion with a term that encourages the optimized sparse representation
h (after inference)
close
der
L = to
argbe
min
||
||x
xto
the
g(h)output
|| + |of
h| the
+ enco
||
||h
h
f (fx()x||):
(15.14)
L = arg min x g(h) + h + h f (x)
(15.14)
where f is the encoder and g is the deco
decoder.
der. Lik
Likee in sparse coding, for eac
each
h
||
||
|
|
||
||
example x an iterative optimization is p
performed
erformed in order to obtain a representawhere
is the
is the deco
Like in sparse
coding,
forofeac
h
tion
h.fHow
Howev
ev
ever,
er,encoder
b ecauseand
the giterations
cander.
be initialized
from the
output
the
example
an with
iterative
p erformed
in order
to obtain
a representaenco
encoder,
der, xi.
i.e.,
e.,
h =optimization
f (x), only aisfew
steps (e.g.
10) are
necessary
to obtain
tion
. HowevSimple
er, b ecause
thet descen
iterations
be b
initialized
from
outputAfter
of the
go
goo
odhresults.
gradien
gradient
descent
t on can
h has
been
een used b
by
y thethe
authors.
h
enco
der, i.be.,
h=
f (xupdated
), only ato
few
steps
(e.g. 10) the
are ab
necessary
to obtain
is
settled,
othwith
g and
f are
towards
wards
minimizing
abov
ov
ovee criterion.
The
goodttwo
results.
t descen
on h has
beenwhile
usedthe
by third
the authors.
After h
rst
wo
termsSimple
are thegradien
same as
in L1 tsparse
co
coding
ding
one encourages
both
and f areofupdated
towards
minimizing
the abmaking
ove criterion.
The
fis settled,
to predict
theg outcome
the sparse
co
coding
ding
optimization,
it a better
two
are the sameof
asthe
in L1
sparse optimization.
coding while the
thirdf one
crst
hoice
forterms
the initialization
iterative
Hence
can encourages
be used as
to predict the
outcome
of to
thethe
sparse
coding optimization,
making it
a better
af parametric
appro
approximation
ximation
non-parametric
encoder implicitly
dened
by
choice for
the initialization
of rst
the iterative
Hence
f can
beenc
used
as
sparse
co
coding.
ding.
It is one of the
instancesoptimization.
of le
learne
arne
arned
d appr
approximate
oximate
infer
inferenc
ence
e (see
a parametric
appro
ximation
to the
non-parametric
encoder implicitly
dened
by
also
Sec. 19.8).
Note
that this
is dieren
dierent
t from separately
doing sparse
co
coding
ding
sparsetraining
coding.gIt
is one
of the
rst instances
of learneinference
d approximate
inferenc
(see
(i.e.,
) and
then
training
an approximate
mec
mechanism
hanism
f , esince
also
19.8).
that thisare
is trained
dierenttogether
from separately
doing sparse
coeach
ding
b
oth Sec.
the enco
encoder
derNote
and decoder
to be compatibl
compatible
e with
(i.e., training
) and
then training
an approximate
hanismwill
f , since
other.
Hence gthe
decoder
will be learned
in suc
such
h a inference
way thatmec
inference
tend
both
thesolutions
encoder that
and decoder
are
togetherbto
be compatibl
e with
each
to
nd
can be w
elltrained
appro
approximated
ximated
y the
approximate
in
inference.
ference.
other. Hence
decoderto
will
be learned
such a wwhen
ay that
willthings
tend
TODO:
this isthe
probably
too
o much
forw
forward
ardinreference,
we inference
bring these
to w
nd
solutions
canthat
be w
ell appro
ximated
theit approximate
ference.
in
e can
remind that
people
they
resem
resemble
ble PSD,by
but
doesnt reallyinhelp
the
TODO:tothis
probably
toow
m
forward reference,
when to
wethings
bring they
theseha
things
reader
sa
say
y isthat
the thing
we
e uch
are describing
now is similar
havent
vent
in weyet
canAremind
that
theyvariational
resemble PSD,
but it doesnt
really
the
seen
similar people
example
is the
auto-encoder,
in whic
which
h the help
encoder
reader to say that the thing we are describing now is similar to things they havent
acts as appro
approximate
ximate inference for the decoder, and both are trained join
jointly
tly (Secseen 20.9.3).
yet A similar
example
is the
variational
auto-encoder,
in which the
tion
See also
Section
20.9.4
for a probabilistic
in
interpretation
terpretation
of encoder
PSD in
acts asofappro
ximate inference
for the
and both
terms
a variational
lo
lower
wer bound
on decoder,
the log-likelihoo
log-likelihood.
d. are trained jointly (Section 20.9.3). See also Section 20.9.4 for a probabilistic interpretation of PSD in
15.9
Denoising Auto-Enco
Auto-Encoders
ders
The
Auto-Enco
Auto-Encoder
der (D
(DAE)
AE) was
rst prop
proposed
osed (Vincent et al.
al.,, 2008,
15.9Denoising
Denoising
Auto-Enco
ders
2010) as a means of forcing an auto-enco
auto-encoder
der to learn to capture the data distribuThe without
Denoising
der (DtAE)
was rst
osed (Vincent
et al.y, of
2008,
tion
an Auto-Enco
explicit constrain
constraint
on either
the prop
dimension
or the sparsit
sparsity
the
2010) asrepresen
a meanstation.
of forcing
an motiv
auto-enco
to learn
to capture
thetodata
learned
representation.
It was
motivated
atedder
b
by
y the
idea that
in order
fullydistribucapture
without
an explicit
constrain
t on either
theto
dimension
or theassparsit
of the
ation
complex
distrib
distribution,
ution, an
auto-encoder
needs
ha
have
ve at least
man
many
yyhidden
learned
It was motiv
the idea thatHence
in order
fully capture
units
asrepresen
needed tation.
by the complexit
complexity
y ofated
thatbydistribution.
its to
dimensionalit
dimensionality
y
a complex
ution, an
auto-encoder
needs to have at least as many hidden
should
not distrib
be restricted
to the
input dimension.
units
as principle
needed byofthe
y of
that distribution.
Hence its
dimensionalit
y
The
thecomplexit
denoising
auto-enco
auto-encoder
der is dece
deceptively
ptively
simple
and illusshould in
notFigure
be restricted
the der
input
trated
15.6: thetoenco
encoder
seesdimension.
as in
input
put a corrupted version of the input,
principletries
of the
denoising auto-enco
is deceptively
simple and illusbut The
the decoder
to reconstruct
the cleander
uncorrupted
input.
trated in Figure 15.6: the encoder sees as input a corrupted version of the input,
but the decoder tries to reconstruct the clean uncorrupted input.
Mathematically, and following the notations used in this chapter, this can be
formalized as follows. We in
intro
tro
troduce
duce a corruption process C (x
| x) whic
which
h represen
sents
ts a conditional distribution over corrupted samples x
, giv
given
en a data sample
whic
) estimated
x
. The auto-enco
auto-encoder
der then
a recaonstruction
formalized
as follows.
We learns
introduce
corruption distribution
process C (xP (x)| x
h reprefrom
training
pairs
(
x
,
x
),
as
follo
follows:
ws:
sents a conditional distribution over corrupted samples x
, given a data sample
) estimated
x. The auto-encoder then learns a reconstruction distribution P|(x x
1. training
Sample apairs
training
xws:
= x from the data generating distribution (the
from
(x, x
example
), as follo
|
training set).
1. Sample a training example x = x from the data generating distribution (the
2. Sample
corrupted version x
= x from the conditional distribution C (x
|
trainingaset).
2. Sample a corrupted version x
= 485
x from the conditional distribution C (x
485
x = x).
x = (xx)., x) as a training example for estimating the auto-encoder reconstruc3. Use
) = P (x | g(h)) with h the output of enco
)
tion distribution P (x | x
encoder
der f (x
3. Use
as aoutput
training
for estimating the auto-encoder reconstrucand (gx(,hx))the
of example
the decoder.
) = P (x g(h)) with h the output of encoder f (x
)
tion distribution P (x x
Typically
simply
perform
gradient-based
and we
g (hcan
) the
output
of
decoder.
| the gradien
| t-based approximate minimization (such
as minibatch gradien
gradientt descent) on the negativ
negativee log-lik
log-likeliho
eliho
elihoo
od log P (x | h), i.e.,
Typically
we can
simply perform
approximate
(such
the
denoising
reconstruction
error,gradien
using t-based
back-propagation
to minimization
compute gradients,
as minibatch
gradienfeedforward
t descent) onneural
the negativ
e log-lik
oddierence
log P (xbeing
h), i.e.,
just
lik
likee for regular
netw
networks
orks
(theeliho
only
the
the denoising
reconstruction
error,
using
gradients,
|
corruption
of the
input and the
choice
of back-propagation
target output). to compute
justWlik
forview
regular
neural
orks (thestochastic
only dierence
being
the
e ecan
this feedforward
training ob
objectiv
jectiv
jective
e asnetw
performing
gradient
descent
corruption
of thereconstruction
input and the error,
choicebut
of target
output).
on
the denoising
where the
noise no
now
w has tw
two
o sources:
We can view this training ob jective as performing stochastic gradient descent
choice of
training sample
x from
the data
and now has two sources:
on 1.
thethe
denoising
reconstruction
error,
but where
the set,
noise
2.
corruption
tionsample
applied
x to
. and
x
1. the random
choice ofcorrup
training
x to
from
theobtain
data set,
2.Wthe
random
corrup
tion applied
to xDAE
to obtain
.
x
e can
therefore
consider
that the
is performing
stoc
stochastic
hastic gradien
gradientt
descen
descentt on the follo
following
wing expectation:
We can therefore consider that the DAE is performing stochastic gradient
descent on the following
E expectation:
E
log P (x | g(f (x
)))
E distribution.
E
log P (x g(f (x
)))
where Q(x) is the training
|
where Q(x) is the training distribution.
486
486
Figure 15.7: A denoising auto-encoder is trained to reconstruct the clean data point x
from it corrupted version x
. In the gure, we illustrate the corruption process C (x | x)
Figure
15.7:
A denoising
auto-encoder
is trained
to reconstruct
the corruption
clean data pro
point
x
b
y a grey
circle
of equiprobable
corruptions,
and grey
ar
arrow
row for the
process)
cess)
from it on
corrupted
version
. crosses)
x
In the gure,
we illustrate
the corruption
process
C (xwhic
xh)
acting
examples
x (red
lying near
a lo
low-dimensional
w-dimensional
manifold
near
which
b
y a grey ycircle
of equiprobable
and
grey arrow for
the corruption
process)
|the
probabilit
probability
concentrates.
When corruptions,
the denoising
auto-encoder
is trained
to minimize
x (red
nearreconstruction
a low-dimensional
manifold
near[xwhic
aacting
verageon
of examples
squared errors
||
||gg(crosses)
f (x)) lying
x|| , the
g (f (x))
estimates
| x],h
probabilit
y concentrates.
When
the denoising
auto-encoder
is trained
minimize the
whic
which
h appro
approximately
ximately points
orthogonally
to
towards
wards
the manifold,
since to
it estimates
the
a
verage
of
squared
errors
g
(
f
(
x
))
x
,
the
reconstruction
g
(
f
(
x
))
estimates
[
x
x
],
cen
center
ter of mass of the clean points x which could ha
have
ve given rise to x
. The auto-enco
auto-encoder
der
whic
approaximately
points
towards
manifold,
its estimates
the
|| g(orthogonally
th
thus
us hlearns
vector eld
f (x))
x|| (the
green the
arro
arrows)
ws) and itsince
turn
turns
out that | this
center of mass of the clean points x which could have given rise to x
. The auto-enco der
vector eld estimates the gradient eld
(up to a multiplicativ
multiplicativee factor that is the
us learns
vector
eldreconstruction
g(f (x)) x (the
green
arro
and
it turn
s outgenerating
that this
ath
verage
ro
root
ot amean
square
error),
where
Q ws)
is the
unkno
unknown
wn data
vector eld estimates the gradient eld
(up to a multiplicative factor that is the
distribution.
average root mean square reconstruction error), where Q is the unknown data generating
distribution.
15.9.1
15.9.1
thissc
direction
was (Hyv
proven
by Vincent
that minimizing
arinen,
to
scor
or
oree matching
arinen,
2005a),(2011a),
making sho
thewing
denoising
criterion a squared
regularreconstruction
error
in a denoising
auto-encosc
der
Gaussian
noise and
was LeCun,
related
ized
form of score
matching
called denoising
scor
or
oreewith
matching
(Kingma
to score Score
matching
(Hyvarinen,
2005a), making
the um
denoising
criterion
a regular2010a).
matching
is an alternativ
alternative
e to maxim
maximum
lik
likeliho
eliho
elihoo
od and pro
provides
vides a
ized
formtof
score matching
called denoising
ore matching
(Kingma
and LeCun,
consisten
consistent
estimator.
It is discussed
further inscSection
18.4. The
denoising
v
version
ersion
2010a). Score matching is an alternative to maximum likelihood and provides a
487 in Section 18.4. The denoising version
consistent estimator. It is discussed further
487
488
488
More precisely
precisely,, the main theorem states that
is a consisten
consistentt estimator
is a consistent estimator
of
of
, where Q(x) is the data
generating
distribution,
x
log
Q(x)t, the true score (and(15.15)
g(f (xt ))
x to
so long as f and g ha
have
ve sucien
sucient
capacity
represen
represent
assum
x
ing that the exp
expected
ected training criterion
can be minimized, as usual when proving
so long as f and g have sucient capacity to represent the true score (and assumconsistency asso
associated
ciated with a training ob
objectiv
jectiv
jective).
e).
ing Note
that the
exp
ected
training
criterion
can
b
e
as usual when gproving
that in general, there is no guaran
that the reconstruction
(f (x))
guarantee
teeminimized,
consistency
associated
with
a training
objective).
min
x corresp
to the gradient
of something (the estimated score
minus
us the input
corresponds
onds
Notebethat
general,
is no guaran
tee thaty the
g(finput
(x))
should
the ingradien
the estimated
log-densit
withreconstruction
resp
gradient
t ofthere
log-density
respect
ect to the
minus the input x corresponds to the gradient of something (the estimated score
489 log-density with respect to the input
should be the gradient of the estimated
489
15.10
Con
Contractiv
tractiv
tractive
e Auto-Enco
Auto-Encoders
ders
The
Con
Auto-Encoder
or CAE (Rifai
et al.
Contractive
tractive
al.,, 2011a,c) introduces an ex15.10
Contractiv
e Auto-Enco
ders
plicit regularizer on the co
code
de h = f (x), encouraging the deriv
derivativ
ativ
atives
es of f to be as
The Con
small
as tractive
possible: Auto-Encoder or CAE (Rifai et al., 2011a,c) introduces an explicit regularizer on the co de h = f (x), encouraging the derivatives of f to be as
small as possible:
f (x)
(
(h
h) =
(15.16)
x
f (x)
whic
which
h is the squared Frob
robenius
enius norm (sum of squared elemen
elements)
ts) of the Jacobian
matrix of partial deriv
derivativ
ativ
atives
es associated with the enco
encoder
der function. Whereas the
denoising
auto-enco
auto-encoder
learns
contract
reconstruction
function
comwhich is the
squaredder
Frob
eniustonorm
(sumthe
of squared
elements)
of the(the
Jacobian
matrix
deriv
associated
thelearns
encoder
function.
p
ositionofofpartial
the enco
encoder
derativ
andesdeco
decoder),
der), thewith
CAE
to sp
specically
ecically Whereas
contract the
denoising
auto-enco
learns
the
function
(thepoints
comenco
encoder.
der. See
Figureder
17.13
fortoa contract
view of ho
how
w reconstruction
contraction near
the data
position of the encoder and decoder), the CAE learns to sp ecically contract the
mak
makes
es the auto-enco
auto-encoder
der capture the manifold structure.
encoIfder.
See Figure
17.13
for a view
w contractionerror,
near the
data
points
it werent
for the
opposing
force ofof ho
reconstruction
whic
which
h attempts
mak
es thethe
auto-enco
capture
manifold structure.
to
make
co
code
de hder
keep
all thethe
information
necessary to reconstruct tr
training
aining
If it w
for pthe
opposing
reconstruction
error, whic
h attempts
examples
examples,
, erent
the CAE
enalty
would force
yield of
a co
code
de h that is constan
constant
t and
do
does
es not
to make the code h keep all the information necessary to reconstruct training
490a code h that is constant and does not
examples, the CAE penalty would yield
490
dep
depend
end on the input x. The compromise b et
etween
ween these tw
two
o forces yields an autoenco
encoder
der whose deriv
derivatives
atives
are tiny in most directions, except those that are
depend to
on reconstruct
the input x.training
The compromise
etween
two forces
an autoneeded
examples,bi.e.,
thethese
directions
that yields
are tangent
to
encomanifold
der whosenear
deriv
atives
are tiny in most
directions,
those that(and
are
the
whic
which
h data concentrate.
Indeed,
in orderexcept
to distinguish
needed
to reconstruct
training
i.e., the
thatone
aremtangent
to
th
thus,
us, reconstruct
correctly)
tw
two
o examples,
nearb
nearby
y examples
ondirections
the manifold,
ust assign
the manifold
near
whici.e.,
h data
orderone
to to
distinguish
them
a dierent
co
code,
de,
f (x)concentrate.
must v
vary
ary asIndeed,
x mo
moves
vesinfrom
the other,(and
i.e.,
thus,
o nearb
y examples on the manifold, one must assign
in
thereconstruct
dir
direction
ection ofcorrectly)
a tangen
tangentttwto
the manifold.
them a dierent code, i.e., f (x) must vary as x moves from one to the other, i.e.,
in the direction of a tangent to the manifold.
Figure 15.9: Av
Average
erage (ov
(over
er test examples) of the singular v
value
alue sp
spectrum
ectrum of the Jacobian
matrix
for the enco
encoder
der f learned by a regular auto-encoder (AE) versus a con
contractive
tractive
Figure
15.9:
Av
erage
(ov
er
test
examples)
of
the
singular
v
alue
sp
ectrum
of
the
Jacobian
auto-enco
auto-encoder
der (CAE). This illustrates how the contractiv
contractivee regularizer yields a smaller set
matrix
forinput
the enco
der (those
f learned
by a regularto
auto-encoder
(AE)
versus
a con
tractive
of
directions in
space
corresponding
large singular
value
of the
Jacobian)
auto-enco
der
Thisinillustrates
how
the contractiv
e regularizer
yields
a smaller
set
whic
which
h prov
provok
ok
okee(CAE).
a resp
response
onse
the represen
representation
tation
h while the
represen
representation
tation
remains
almost
of directions in input space (those corresponding to large singular value of the Jacobian)
which provoke a response in the representation h while the representation remains almost
insensitiv
insensitivee for most directions of change in the input.
insensitiv
forin
most
directions
of this
change
in theforces
input.more strongly the represen
Whate is
interesting
teresting
is that
penalty
representation
tation
to be inv
invarian
arian
ariantt in directions orthogonal to the manifold. This can be seen clearly
What is interesting
is that
this
penalty of
forces
more strongly the
represen
by comparing
the singular
v
value
alue
spectrum
the Jacobian
for dieren
dierent
t tation
autoto
beders,
invarian
t in
to the
can btoe concen
seen clearly
enco
as sho
in Figureorthogonal
15.9. We see
thatmanifold.
the CAE This
manages
encoders,
shown
wndirections
concentrate
trate
by comparing
singular
value spectrum
of dimensions
the Jacobian
for dieren
autothe
sensitivit
sensitivity
y the
of the
representation
in few
fewer
er
than a regular
(or tsparse)
encoders, as shown in Figure 15.9. We see that the CAE manages to concentrate
auto-enco
auto-encoder.
der. Figure 17.3 illustrates tangen
tangentt vectors obtained by a CAE on the
the sensitivit
y of
the representation
in the
fewer
dimensions
than
a regular
(or sparse)
MNIST
digits
dataset,
sho
showing
wing that
leading
tangen
tangent
t vectors
correspond
to
auto-enco
der. Figure
17.3
tangen
t vectors
obtained
by a CAE
the
small
deformations
suc
such
h asillustrates
translation.
More
impressiv
impressively
ely
ely,, Figure
15.10 on
sho
shows
ws
MNIST
thatcolor
the (R
leading
tangen
t 10
vectors
correspond
to
tangen
tangentt digits
vectorsdataset,
learnedsho
onwing
32
32
32
(RGB)
GB) CIF
CIFARARAR-10
images
by a CAE,
small deformations such as translation. More impressively, Figure 15.10 shows
491 (RGB) CIFAR-10 images by a CAE,
tangent vectors learned on 32 32 color
491