AItRBM Proof
AItRBM Proof
AItRBM Proof
CC
, called potential functions, with
x, x
]V ]
: (x
c
)
cC
= ( x
c
)
cC
C
(x) =
C
( x) (1)
and
p(x) =
1
Z
CC
C
(x). (2)
The normalization constant Z =
CC
C
(x
C
) is called partition function.
If p is strictly positive, the same holds for the potential functions. Thus we
can write
p(x) =
1
Z
CC
C
(x
C
) =
1
Z
e
CC
lnC(xC)
=
1
Z
e
E(x)
, (3)
where we call E :=
CC
ln
C
(x
C
) the energy function. Thus, the probability
distribution of every MRF can be expressed in the form given by (3), which is
also called Gibbs distribution.
2.2 Unsupervised Learning
Unsupervised learning means learning (important aspects of) an unknown dis-
tribution q based on sample data. This includes nding new representations of
data that foster learning, generalization, and communication. If we assume that
the structure of the graphical model is known and the energy function belongs to
a known family of functions parameterized by , unsupervised learning of a data
distribution with an MRF means adjusting the parameters . We write p(x[)
when we want to emphasize the dependency of a distribution on its parameters.
We consider training data S = x
1
, . . . , x
i=1
p(x
i
[). Maximizing the likelihood is the
same as maximizing the log-likelihood given by
ln /( [ S) = ln
i=1
p(x
i
[) =
i=1
ln p(x
i
[) . (4)
For the Gibbs distribution of an MRF it is in general not possible to nd the
maximum likelihood parameters analytically. Thus, numerical approximation
methods have to be used, for example gradient ascent which is described below.
Maximizing the likelihood corresponds to minimizing the distance between
the unknown distribution q underlying S and the distribution p of the MRF
in terms of the Kullback-Leibler-divergence (KL-divergence), which for a nite
state space is given by:
KL(q[[p) =
x
q(x) ln
q(x)
p(x)
=
x
q(x) ln q(x)
x
q(x) ln p(x) (5)
The KL-divergence is a (non-symmetric) measure of the dierence between two
distributions. It is always positive and zero if and only if the distributions are
the same. As becomes clear by equation (5) the KL-divergence can be expressed
as the dierence between the entropy of q and a second term. Only the latter
depends on the parameters subject to optimization. Approximating the expec-
tation over q in this term by the training samples from q results in the log-
likelihood. Therefore, maximizing the log-likelihood corresponds to minimizing
the KL-divergence.
Optimization by Gradient Ascent. If it is not possible to nd parameters
maximizing the likelihood analytically, the usual way to nd them is gradient
ascent on the log-likelihood. This corresponds to iteratively updating the param-
eters
(t)
to
(t+1)
based on the gradient of the log-likelihood. Let us consider
the following update rule:
(t+1)
=
(t)
+
(t)
_
N
i=1
ln /(
(t)
[x
i
)
_
(t)
+
(t1)
. .
:=
(t)
(6)
If the constants R
+
0
and R
+
0
are set to zero, we have vanilla gradient
ascent. The constant R
+
is the learning rate. As we will see later, it can
be desirable to strive for models with weights having small absolute values.
To achieve this, we can optimize an objective function consisting of the log-
likelihood minus half of the norm of the parameters ||
2
/2 weighted by . This
method called weight decay penalizes weights with large magnitude. It leads to
the
(t)
term in our update rule (6). In a Bayesian framework, weight decay
An Introduction to Restricted Boltzmann Machines 19
can be interpreted as assuming a zero-mean Gaussian prior on the parameters.
The update rule can be further extended by a momentum term
(t1)
, weighted
by the parameter . Using a momentum term helps against oscillations in the
iterative update procedure and can speed-up the learning process as known from
feed-forward neural network training [31].
Introducing Latent Variables. Suppose we want to model an m-dimensional
unknown probability distribution q (e.g., each component of a sample corresponds
to one of m pixels of an image). Typically, not all variables X = (X
v
)
vV
in an
MRF need to correspond to some observed component, and the number of nodes
is larger than m. We split X into visible (or observed) variables V = (V
1
, . . . , V
m
)
corresponding to the components of the observations and latent (or hidden) vari-
ables H = (H
1
, . . . , H
n
) given by the remaining n = [V [ m variables. Using
latent variables allows to describe complex distributions over the visible variables
by means of simple (conditional) distributions. In this case the Gibbs distribu-
tion of an MRF describes the joint probability distribution of (V , H) and one is
usually interested in the marginal distribution of V which is given by
p(v) =
h
p(v, h) =
1
Z
h
e
E(v,h)
, (7)
where Z =
v,h
e
E(v,h)
. While the visible variables correspond to the com-
ponents of an observation, the latent variables introduce dependencies between
the visible variables (e.g., between pixels of an input image).
Log-Likelihood Gradient of MRFs with Latent Variables. Restricted
Boltzmann machines are MRFs with hidden variables and RBM learning algo-
rithms are based on gradient ascent on the log-likelihood. For a model of the
form (7) with parameters , the log-likelihood given a single training example v
is
ln /( [ v) = ln p(v [ ) = ln
1
Z
h
e
E(v,h)
= ln
h
e
E(v,h)
ln
v,h
e
E(v,h)
(8)
and for the gradient we get:
ln/( [ v)
_
ln
h
e
E(v,h)
_
_
ln
v,h
e
E(v,h)
_
=
1
h
e
E(v,h)
h
e
E(v,h)
E(v, h)
+
1
v,h
e
E(v,h)
v,h
e
E(v,h)
E(v, h)
h
p(h[ v)
E(v, h)
v,h
p(v, h)
E(v, h)
(9)
In the last step we used that the conditional probability can be written in the
following way:
20 A. Fischer and C. Igel
p(h[ v) =
p(v, h)
p(v)
=
1
Z
e
E(v,h)
1
Z
h
e
E(v,h)
=
e
E(v,h)
h
e
E(v,h)
(10)
Note that the last expression of equality (9) is the dierence of two expectations:
the expected values of the energy function under the model distribution and
under the conditional distribution of the hidden variables given the training ex-
ample. Directly calculating this sums, which run over all values of the respective
variables, leads to a computational complexity which is in general exponential
in the number of variables of the MRF. To avoid this computational burden, the
expectations can be approximated by samples drawn from the corresponding
distributions based on MCMC techniques.
3 Markov Chains and Markov Chain Monte Carlo
Techniques
Markov chains play an important role in RBM training because they provide a
method to draw samples from complex probability distributions like the Gibbs
distribution of an MRF. This section will serve as an introduction to some funda-
mental concepts of Markov chain theory. A detailed introduction can be found,
for example, in [6] and the aforementioned textbooks [5,22]. The section will
describe Gibbs sampling as an MCMC technique often used for MRF training
and in particular for training RBMs.
3.1 Denition of a Markov Chain and Convergence to Stationarity
A Markov chain is a time discrete stochastic process for which the Markov
property holds, that is, a family of random variables X = X
(k)
[k N
0
which
take values in a (in the following considerations nite) set and for which
k 0 and j, i, i
0
, . . . , i
k1
it holds
p
(k)
ij
:= P
_
X
(k+1)
= j[X
(k)
= i, X
(k1)
= i
k1
, . . . , X
(0)
= i
0
_
(11)
= P
_
X
(k+1)
= j[X
(k)
= i
_
. (12)
This means that the next state of the system depends only on the current state
and not on the sequence of events that preceded it. If for all k 0 the p
(k)
ij
have the same value p
ij
, the chain is called homogeneous and the matrix P =
(p
ij
)
i,j
is called transition matrix of the homogeneous Markov chain.
If the starting distribution
(0)
(i.e., the probability distribution of X
(0)
) is
given by the probability vector
(0)
= (
(0)
(i))
i
, with
(0)
(i) = P(X
(0)
= i),
the distribution
(k)
of X
(k)
is given by
(k) T
=
(0) T
P
k
.
A distribution for which it holds
T
=
T
P is called stationary distribution.
If the Markov chain for any time k reaches the stationary distribution
(k)
=
all subsequent states will be distributed accordingly, that is,
(k+n)
= for
An Introduction to Restricted Boltzmann Machines 21
all n N. A sucient (but not necessary) condition for a distribution to
be stationary w.r.t. a Markov chain described by the transition probabilities
p
ij
, i, j is that i, j it holds:
(i)p
ij
= (j)p
ji
. (13)
This is called the detailed balance condition.
Especially relevant are Markov chains for which it is known that there exists
an unique stationary distribution. For nite this is the case if the Markov
chain is irreducible. A Markov chain is irreducible if one can get from any state
in to any other in a nite number of transitions or more formally i, j
k > 0 with P(X
(k)
= j[X
(0)
= i) > 0.
A chain is called aperiodic if for all i the greatest common divisor of
k[P(X
(k)
= i[X
(0)
= i) > 0 k N
0
is 1. One can show that an irreducible
and aperiodic Markov chain on a nite state space is guarantied to converge
to its stationary distribution (see, e.g., [6]). That is, for an arbitrary starting
distribution it holds
lim
k
d
V
(
T
P
k
,
T
) = 0 , (14)
where d
V
is the distance of variation. For two distributions and on a nite
state space , the distance of variation is dened as
d
V
(, ) =
1
2
[[ =
1
2
x
[(x) (x)[ . (15)
To ease the notation, we allow both row and column probability vectors as
arguments of the functions in (15).
Markov chain Monte Carlo methods make use of this convergence theorem for
producing samples from certain probability distribution by setting up a Markov
chain that converges to the desired distributions. Suppose you want to sample
from a distribution q with a nite state space. Then you construct an irreducible
and aperiodic Markov chain with stationary distribution = q. This is a non-
trivial task. If t is large enough, a sample X
(t)
from the constructed chain is
then approximately a sample from and therefore from q. Gibbs Sampling [13]
is such a MCMC method and will be described in the following section.
3.2 Gibbs Sampling
Gibbs Sampling belongs to the class of Metropolis-Hastings algorithms [14]. It is
a simple MCMC algorithm for producing samples from the joint probability dis-
tribution of multiple random variables. The basic idea is to update each variable
subsequently based on its conditional distribution given the state of the others.
We will describe it in detail by explaining how Gibbs sampling can be used to
simulate the Gibbs distribution of an MRF.
We consider an MRF X = (X
1
, . . . , X
N
) w.r.t. a graph G = (V, E), where
V = 1, . . . , N for the sake of clearness of notation. The random variables X
i
,
i V take values in a nite set and (x) =
1
Z
e
L(x)
is the joint probability
22 A. Fischer and C. Igel
distribution of X. Furthermore, if we assume that the MRF changes its state
during time, we can consider X = X
(k)
[k N
0
as a Markov chain taking
values in =
N
where X
(k)
= (X
(k)
1
, . . . , X
(k)
N
) describes the state of the
MRF at time k 0. At each transition step we now pick a random variable X
i
,
i V with a probability q(i) given by a strictly positive probability distribution
q on V and sample a new value for X
i
based on its conditional probability
distribution given the state (x
v
)
vV \i
of all other variables (X
v
)
vV \i
, i.e. based
on
_
X
i
[(x
v
)
vV \i
_
= (X
i
[(x
w
)
wAi
). Therefore, the transition probability
p
xy
for two states x, y of the MRF X with x ,= y is given by:
p
xy
=
_
q(i)
_
y
i
[(x
v
)
vV \i
_
, if i V so that v V with v ,= i: x
v
= y
v
0, else .
(16)
And the probability, that the state of the MRF x stays the same, is given by:
p
xx
=
iV
q(i)
_
x
i
[(x
v
)
vV \i
_
. (17)
It is easy to see that the joint distribution of the MRF is the stationary
distribution of the Markov chain dened by these transition probabilities by
showing that the detailed balance condition (13) holds: For x = y this follows
directly. If x and y dier in the value of more than one random variable it follows
from the fact that p
yx
= p
xy
= 0. Assume that x and y dier only in the state
of exactly one variable X
i
, i.e., y
j
= x
j
for j ,= i and y
i
,= x
i
, then it holds:
(x)p
xy
= (x)q(i)
_
y
i
[(x
v
)
vV \i
_
=
_
x
i
, (x
v
)
vV \i
_
q(i)
_
y
i
, (x
v
)
vV \i
_
_
(x
v
)
vV \i
_
=
_
y
i
, (x
v
)
vV \i
_
q(i)
_
x
i
, (x
v
)
vV \i
_
_
(x
v
)
vV \i
_
= (y)q(i)
_
x
i
[(x
v
)
vV \i
_
= (y)p
yx
. (18)
Since is strictly positive so are the conditional probability distributions of
the single variables. Thus, it follows that every single variable X
i
can take ev-
ery state x
i
in a single transition step and thus every state of the whole
MRF can reach any other in
N
in a nite number of steps and the Markov
chain is irreducible. Furthermore it follows from the positivity of the conditional
distributions that p
xx
> 0 for all x
N
and thus that the Markov chain is
aperiodic. Aperiodicity and irreducibility guaranty that the chain converges to
the stationary distribution .
In practice the single random variables to be updated are usually not chosen
at random based on a distribution q but subsequently in xed predened order.
The corresponding algorithm is often referred to as periodic Gibbs Sampler. If
P is the transition matrix of the Gibbs chain, the convergence rate of the periodic
An Introduction to Restricted Boltzmann Machines 23
Gibbs sampler to the stationary distribution of the MRF is bounded by the
following inequality (see for example [6]):
[P
k
[
1
2
[ [(1 e
NZ
)
k
, (19)
where = sup
lV
l
and
l
= sup[c(x)c(y)[; x
i
= y
i
i V with i ,= l. Here
is an arbitrary starting distribution and
1
2
[ [ is the distance in variation
as dened in (15).
4 Restricted Boltzmann Machines
A RBM (also denoted as Harmonium [34]) is an MRF associated with a bipar-
tite undirected graph as shown in Fig. 1. It consists of m visible units V =
(V
1
, ..., V
m
) to represent observable data and n hidden units H = (H
1
, ..., H
n
)
to capture dependencies between observed variables. In binary RBMs, our focus
in this tutorial, the random variables (V , H) take values (v, h) 0, 1
m+n
and the joint probability distribution under the model is given by the Gibbs
distribution p(v, h) =
1
Z
e
E(v,h)
with the energy function
E(v, h) =
n
i=1
m
j=1
w
ij
h
i
v
j
m
j=1
b
j
v
j
n
i=1
c
i
h
i
. (20)
For all i 1, ..., n and j 1, ..., m, w
ij
is a real valued weight associated
with the edge between units V
j
and H
i
and b
j
and c
i
are real valued bias terms
associated with the jth visible and the ith hidden variable, respectively.
Fig. 1. The undirected graph of an RBM with n hidden and m visible variables
The graph of an RBM has only connections between the layer of hidden and
visible variables but not between two variables of the same layer. In terms of
probability this means that the hidden variables are independent given the state
of the visible variables and vice versa:
24 A. Fischer and C. Igel
p(h[ v) =
n
i=1
p(h
i
[ v) and p(v [ h) =
m
i=1
p(v
i
[ h) . (21)
The absence of connections between hidden variables makes the marginal distri-
bution (7) of the visible variables easy to calculate:
p(v) =
1
Z
h
p(v, h) =
1
Z
h
e
E(v,h)
=
1
Z
h1
h2
hn
e
m
j=1
bjvj
n
i=1
e
hi
_
ci+
m
j=1
wijvj
_
=
1
Z
e
m
j=1
bjvj
h1
e
h1
_
c1+
m
j=1
w1j vj
_
h2
e
h2
_
c2+
m
j=1
w2jvj
_
hn
e
hn
_
cn+
m
j=1
wnjvj
_
=
1
Z
e
m
j=1
bjvj
n
i=1
hi
e
hi
_
ci+
m
j=1
wijvj
_
=
1
Z
m
j=1
e
bjvj
n
i=1
_
1 +e
ci+
m
j=1
wij vj
_
(22)
This equation shows why a (marginalized) RBM can be regarded as a product of
experts model [15,39], in which a number of experts for individual components
of the observations are combined multiplicatively.
Any distribution on 0, 1
m
can be modeled arbitrarily well by an RBM with
m visible and k +1 hidden units, where k denotes the cardinality of the support
set of the target distribution, that is, the number of input elements from 0, 1
m
that have a non-zero probability of being observed [24]. It has been shown re-
cently that even less units can be sucient depending on the patterns in the
support set [30].
The RBM can be interpreted as a stochastic neural network, where nodes
and edges correspond to neurons and synaptic connections, respectively. The
conditional probability of a single variable being one can be interpreted as the
ring rate of a (stochastic) neuron with sigmoid activation function (x) =
1/(1 +e
x
), because it holds:
p(H
i
= 1 [ v) =
_
m
j=1
w
ij
v
j
+c
i
_
(23)
and
p(V
j
= 1 [ h) =
_
n
i=1
w
ij
h
i
+b
j
_
. (24)
An Introduction to Restricted Boltzmann Machines 25
To see this, let v
l
denote the state of all visible units except the lth one and
let us dene
l
(h) :=
n
i=1
w
il
h
i
b
l
(25)
and
(v
l
, h) :=
n
i=1
m
j=1,j,=l
w
ij
h
i
v
j
m
j=1,j,=l
b
i
v
i
n
i=1
c
i
h
i
. (26)
Then E(v, h) = (v
l
, h) + v
l
l
(h), where v
l
l
(h) collects all terms involving
v
l
and we can write [2]:
p(V
l
= 1 [h) = p(V
l
= 1 [v
l
, h) =
p(V
l
= 1, v
l
, h)
p(v
l
, h)
=
e
E(v
l
=1,v
l
,h)
e
E(v
l
=1,v
l
,h)
+e
E(v
l
=0,v
l
,h)
=
e
(v
l
,h)1
l
(h)
e
(v
l
,h)1
l
(h)
+e
(v
l
,h)0
l
(h)
=
e
(v
l
,h)
e
l
(h)
e
(v
l
,h)
e
l
(h)
+e
(v
l
,h)
=
e
(v
l
,h)
e
l
(h)
e
(v
l
,h)
_
e
l
(v
l
,h)
+ 1
_
=
e
l
(v
l
,h)
e
l
(v
l
,h)
+ 1
=
1
e
l
(h)
1
e
l
(h)
+ 1
=
1
1 +e
l
(v
l
,h)
= (
l
(h)) =
_
n
i=1
w
il
h
i
+b
j
_
(27)
The independence between the variables in one layer makes Gibbs sampling
especially easy: Instead of sampling new values for all variables subsequently, the
states of all variables in one layer can be sampled jointly. Thus, Gibbs sampling
can be performed in just two sub steps: sampling a new state h for the hidden
neurons based on p(h[v) and sampling a state v for the visible layer based on
p(v[h). This is also referred to as block Gibbs sampling.
As mentioned in the introduction, an RBM can be reinterpreted as a standard
feed-forward neural network with one layer of non-linear processing units. From
this perspective the RBM is viewed as a deterministic function 0, 1
m
R
n
that maps an input v 0, 1
m
to y R
n
with y
i
= p(H
i
= 1[v). That is, an
observation is mapped to the expected value of the hidden neurons given the
observation.
4.1 The Gradient of the Log-Likelihood
The log-likelihood gradient of an MRF can be written as the sum of two ex-
pectations, see (9). For RBMs the rst term of (9) (i.e., the expectation of the
energy gradient under the conditional distribution of the hidden variables given
a training sample v) can be computed eciently because it factorizes nicely. For
example, w.r.t. the parameter w
ij
we get:
26 A. Fischer and C. Igel
h
p(h[ v)
E(v, h)
w
ij
=
h
p(h[ v)h
i
v
j
=
h
n
k=1
p(h
k
[ v)h
i
v
j
=
hi
hi
p(h
i
[ v)p(h
i
[ v)h
i
v
j
=
hi
p(h
i
[ v)h
i
v
j
hi
p(h
i
[v)
. .
=1
= p(H
i
= 1[v)v
j
=
_
m
j=1
w
ij
v
j
+c
i
_
v
j
(28)
Since the second term in (9) can also be written as
v
p(v)
h
p(h[ v)
E(v,h)
or
h
p(h)
v
p(v [ h)
E(v,h)
h
p(h[ v)
E(v, h)
w
ij
+
v,h
p(v, h)
E(v, h)
w
ij
=
h
p(h[ v)h
i
v
j
v
p(v)
h
p(h[ v)h
i
v
j
= p(H
i
= 1[ v)v
j
v
p(v)p(H
i
= 1[ v)v
j
. (29)
For the mean of this derivative over a training set S = v
1
, . . . , v
often the
following notations are used:
1
vS
ln/( [ v)
w
ij
=
1
vS
_
E
p(h] v)
_
E(v, h)
w
ij
_
+E
p(h,v)
_
E(v, h)
w
ij
__
=
1
vS
_
E
p(h] v)
[v
i
h
j
] E
p(h,v)
[v
i
h
j
]
= v
i
h
j
p(h] v)q(v)
v
i
h
j
p(h,v)
(30)
with q denoting the empirical distribution. This gives the often stated rule:
vS
ln/( [ v)
w
ij
v
i
h
j
data
v
i
h
j
model
(31)
Analogously to (29) we get the derivatives w.r.t. the bias parameter b
j
of the
jth visible variable
ln/( [ v)
b
j
= v
j
v
p(v)v
j
(32)
An Introduction to Restricted Boltzmann Machines 27
and w.r.t. the bias parameter c
i
of the ith hidden variable
ln/( [ v)
c
i
= p(H
i
= 1[ v)
v
p(v)p(H
i
= 1[ v) . (33)
To avoid the exponential complexity of summing over all values of the visible
variables (or all values of the hidden if one decides to factorize over the visible
variables beforehand) when calculating the second term of the log-likelihood
gradient or the second terms of (29), (32), and (33) one can approximate
this expectation by samples from the model distribution. These samples can
be obtained by Gibbs sampling. This requires running the Markov chain long
enough to ensure convergence to stationarity. Since the computational costs
of such an MCMC approach are still too large to yield an ecient learning
algorithm common RBM learning techniques as described in the following
section introduce additional approximations.
5 Approximating the RBM Log-Likelihood Gradient
All common training algorithms for RBMs approximate the log-likelihood gra-
dient given some data and perform gradient ascent on these approximations.
Selected learning algorithms will be described in the following section, starting
with contrastive divergence learning.
5.1 Contrastive Divergence
Obtaining unbiased estimates of log-likelihood gradient using MCMC methods
typically requires many sampling steps. However, recently it was shown that
estimates obtained after running the chain for just a few steps can be sucient
for model training [15]. This leads to contrastive divergence (CD) learning, which
has become a standard way to train RBMs [15,4,18,3,17].
The idea of k-step contrastive divergence learning (CD-k) is quite simple:
Instead of approximating the second term in the log-likelihood gradient by a
sample from the RBM-distribution (which would require to run a Markov chain
until the stationary distribution is reached), a Gibbs chain is run for only k steps
(and usually k = 1). The Gibbs chain is initialized with a training example v
(0)
of the training set and yields the sample v
(k)
after k steps. Each step t consists
of sampling h
(t)
from p(h[v
(t)
) and sampling v
(t+1)
from p(v[h
(t)
) subsequently.
The gradient (equation (9)) w.r.t. of the log-likelihood for one training pattern
v
(0)
is then approximated by
CD
k
(, v
(0)
) =
h
p(h[v
(0)
)
E(v
(0)
, h)
h
p(h[v
(k)
)
E(v
(k)
, h)
. (34)
The derivatives in direction of the single parameters are obtained by estimating
the expectations over p(v) in (29), (32) and (33) by the single sample v
(k)
. A
batch version of CD-k can be seen in algorithm 1.
28 A. Fischer and C. Igel
Algorithm 1. k-step contrastive divergence
Input: RBM (V1, . . . , Vm, H1, . . . , Hn), training batch S
Output: gradient approximation wij, bj and ci for i = 1, . . . , n,
j = 1, . . . , m
1 init wij = bj = ci = 0 for i = 1, . . . , n, j = 1, . . . , m
2 forall the v S do
3 v
(0)
v
4 for t = 0, . . . , k 1 do
5 for i = 1, . . . , n do sample h
(t)
i
p(hi | v
(t)
)
6 for j = 1, . . . , m do sample v
(t+1)
j
p(vj | h
(t)
)
7 for i = 1, . . . , n, j = 1, . . . , m do
8 wij wij + p(Hi = 1 | v
(0)
) v
(0)
j
p(Hi = 1 | v
(k)
) v
(k)
j
9 bj bj + v
(0)
j
v
(k)
j
10 ci ci + p(Hi = 1 | v
(0)
) p(Hi = 1 | v
(k)
)
Since v
(k)
is not a sample from the stationary model distribution the approx-
imation (34) is biased. Obviously, the bias vanishes as k . That CD is a bi-
ased approximation becomes also clear by realizing that it does not maximize the
likelihood of the data under the model but the dierence of two KL-divergences
[15]:
KL(q[p) KL(p
k
[p) , (35)
where q is the empirical distribution and p
k
is the distribution of the visible
variables after k steps of the Markov chain. If the chain already reached station-
arity it holds p
k
= p and thus KL(p
k
[p) = 0 and the approximation error of CD
vanishes.
The theoretical results from [3] give a good understanding of the CD approx-
imation and the corresponding bias by showing that the log-likelihood gradient
can based on a Markov chain be expressed as a sum of terms containing the
k-th sample:
Theorem 1 (Bengio and Delalleau [3]). For a converging Gibbs chain
v
(0)
h
(0)
v
(1)
h
(1)
. . .
starting at data point v
(0)
, the log-likelihood gradient can be written as
lnp(v
(0)
) =
h
p(h[v
(0)
)
E(v
(0)
, h)
+E
p(v
(k)
]v
(0)
)
_
h
p(h[v
(k)
)
E(v
(k)
, h)
_
+E
p(v
(k)
]v
(0)
)
_
lnp(v
(k)
)
_
(36)
and the nal term converges to zero as k goes to innity.
The rst two terms in equation (36) just correspond to the expectation of the
CD approximation (under p
k
) and the bias is given by the nal term.
An Introduction to Restricted Boltzmann Machines 29
The approximation error does not only depend on the value of k but also on the
rate of convergence or the mixing rate of the Gibbs chain. The rate describes how
fast the Markov chain approaches the stationary distribution. The mixing rate
of the Gibbs chain of an RBM is up to the magnitude of the model parameters
[15,7,3,12]. This becomes clear by considering that the conditional probabilities
p(v
j
[h) and p(h
i
[v) are given by thresholding
n
i=1
w
ij
h
i
+ b
j
and
m
j=1
w
ij
v
j
+ c
i
,
respectively. If the absolute values of the parameters are high, the conditional
probabilities can get close to one or zero. If this happens, the states get more and
more predictable and the Markov chain changes its state slowly. An empirical
analysis of the dependency between the size of the bias and magnitude of the
parameters can be found in [3].
An upper bound on the expectation of the CD approximation error under the
empirical distribution is given by the following theorem [12]:
Theorem 2 (Fischer and Igel [12]). Let p denote the marginal distribution
of the visible units of an RBM and let q be the empirical distribution dened
by a set of samples v
1
, . . . , v
E
q(v
(0)
)
_
E
p(v
(k)
]v
(0)
)
_
lnp(v
(k)
)
a
__
1
2
|q p|
_
1 e
(m+n)
_
k
(37)
with
= max
_
max
l1,...,m]
l
, max
l1,...,n]
l
_
,
where
l
= max
_
i=1
I
w
il
>0]
w
il
+b
l
i=1
I
w
il
<0]
w
il
+b
l
_
and
l
= max
_
j=1
I
w
lj
>0]
w
lj
+c
l
j=1
I
w
lj
<0]
w
lj
+c
l
_
.
The bound (and probably also the bias) depends on the absolute values of the
RBM parameters, on the size of the RBM (the number of variables in the graph),
and on the distance in variation between the modeled distribution and the start-
ing distribution of the Gibbs chain.
As a consequence of the approximation error CD-learning does not neces-
sarily lead to a maximum likelihood estimate of the model parameters. Yuille
[42] species conditions under which CD learning is guaranteed to converge to
the maximum likelihood solution, which need not hold for RBM training in
general. Examples of energy functions and Markov chains for which CD-1 learn-
ing does not converge are given in [27]. The empirical comparisons of the CD-
approximation and the true gradient for RBMs small enough that the gradient
30 A. Fischer and C. Igel
is still tractable conducted in [7] and [3] shows that the bias can lead to a con-
vergence to parameters that do not reach the maximum likelihood.
The bias, however, can also lead to a distortion of the learning process: After
some learning iterations the likelihood can start to diverge (see gure 2) in the
sense that the model systematically gets worse if k is not large [11]. This is espe-
cially bad because the log-likelihood is not tractable in reasonable sized RBMs,
and so the misbehavior can not be displayed and used as a stopping criterion.
Because the eect depends on the magnitude of the weights, weight decay can
help to prevent it. However, the weight decay parameter , see equation (6), is
dicult to tune. If it is too small, weight decay has no eect. If it is too large,
the learning converges to models with low likelihood [11].
0 5000 10000 15000 20000
3
0
0
2
5
0
2
0
0
1
5
0
iterations
l
o
g
-
l
i
k
e
l
i
h
o
o
d
0 5000 10000 15000 20000
3
0
0
2
5
0
2
0
0
1
5
0
iterations
l
o
g
-
l
i
k
e
l
i
h
o
o
d
Fig. 2. Evolution of the log-likelihood during batch training of an RBM. In the left
plot, CD-k with dierent values for k (from bottom to top k = 1, 2, 5, 10, 20, 100) was
used. In the right plot, we employed parallel tempering (PT, see section 5.3) with
dierent numbers M of temperatures (from bottom to top M = 4, 5, 10, 50). In PT the
inverse temperatures were equally distributed in the interval [0, 1], which may not be
the optimal [9]. The training set was given by a 4 4 variant of the Bars-and-Stripes
benchmark problem [28]. The learning rate was = 0.1 for CD and = 0.05 for PT
and neither weight decay nor a momentum term were used ( = = 0). Shown are
medians over 25 runs.
More recently proposed learning algorithms try to yield better approximations
of the log-likelihood gradient by sampling from Markov chains with increased
mixing rate.
5.2 Persistent Contrastive Divergence
The idea of persistent contrastive divergence (PCD, [36]) is described in [41]
for log-likelihood maximization of general MRFs and is applied to RBMs in
[36]. The PCD approximation is obtained from the CD approximation (34) by
replacing the sample v
(k)
by a sample from a Gibbs chain that is independent
An Introduction to Restricted Boltzmann Machines 31
from the sample v
(0)
of the training distribution. The algorithm corresponds to
standard CD learning without reinitializing the visible units of the Markov chain
with a training sample each time we want to draw a sample v
(k)
approximately
from the RBM distribution. Instead one keeps persistent chains which are
run for k Gibbs steps after each parameter update (i.e., the initial state of
the current Gibbs chain is equal to v
(k)
from the previous update step). The
fundamental idea underlying PCD is that one could assume that the chains stay
close to the stationary distribution if the learning rate is suciently small and
thus the model changes only slightly between parameter updates [41,36]. The
number of persistent chains used for sampling (or the number of samples used
to approximate the second term of gradient (9)) is a hyper parameter of the
algorithm. In the canonical from, there exists one Markov chain per training
example in a batch.
The PCD algorithm was further rened in a variant called fast persistent
contrastive divergence (FPCD, [37]). Fast PCD tries to reach faster mixing of the
Gibbs chain by introducing additional parameters w
f
ij
, b
f
j
, c
f
i
(for i = 1, . . . , n and
j = 1, . . . , m) referred to as fast parameters. These new set of parameters is only
used for sampling and not in the model itself. When calculating the conditional
distributions for Gibbs sampling, the regular parameters are replaced by the
sum of the regular and the fast parameters, i.e., Gibbs sampling is based on the
probabilities p(H
i
= 1 [ v) =
_
m
j=1
(w
ij
+w
f
ij
)v
j
+(c
i
+c
f
i
)
_
and p(V
j
= 1 [ h) =
_
n
i=1
(w
ij
+w
f
ij
)h
i
+(b
j
+b
f
j
)
_
instead of the conditional probabilities given by
(23) and (24). The learning update rule for the fast parameters equals the one
for the regular parameters, but with an independent, large learning rate leading
to faster changes as well as a large weight decay parameter. Weight decay can
also be used for the regular parameters, but it was suggested that regularizing
just the fast weights is sucient [37].
Neither PCD nor FPCD seem to enlarge the mixing rate (or decrease the bias
of the approximation) suciently to avoid the divergence problem as can be seen
in the empirical analysis in [11].
5.3 Parallel Tempering
One of the most promising sampling technique used for RBM-training so far is
parallel tempering (PT, [33,10,8]). It introduces supplementary Gibbs chains that
sample form more and more smoothed replicas of the original distribution. This
can be formalized in the following way: Given an ordered set of M temperatures
T
1
, T
2
, . . . , T
M
with 1 = T
1
< T
2
< < T
M
, we dene a set of M Markov
chains with stationary distributions
p
r
(v, h) =
1
Z
r
e
1
Tr
E(v,h)
(38)
for r = 1, . . . , M, where Z
r
=
v,h
e
1
Tr
E(v,h)
is the corresponding partition
function, and p
1
is exactly the model distribution.
32 A. Fischer and C. Igel
Algorithm 2. k-step parallel tempering with M temperatures
Input: RBM (V1, . . . , Vm, H1, . . . , Hn), training batch S, current state vr of
Markov chain with stationary distribution pr for r = 1, . . . , M
Output: gradient approximation wij, bj and ci for i = 1, . . . , n,
j = 1, . . . , m
1 init wij = bj = ci = 0 for i = 1, . . . , n, j = 1, . . . , m
2 forall the v S do
3 for r = 1, . . . , M do
4 v
(0)
r
vr
5 for i = 1, . . . , n do sample h
(0)
r,i
p(hr,i | v
(0)
r
)
6 for t = 0, . . . , k 1 do
7 for j = 1, . . . , m do sample v
(t+1)
r,j
p(vr,j | h
(t)
r
)
8 for i = 1, . . . , n do sample h
(t+1)
r,i
p(hr,i | v
(t+1)
r )
9 vr v
(k)
r
/* swapping order below works well in practice [26] */
10 for r {s | 2 s M and s mod 2 = 0} do
11 swap (v
(k)
r , h
(k)
r ) and (v
(k)
r1
, h
(k)
r1
) with probability given by (40)
12 for r {s | 3 s M and s mod 2 = 1} do
13 swap (v
k
r
, h
k
r
) and (v
k
r1
, h
k
r1
) with probability given by (40)
14 for i = 1, . . . , n, j = 1, . . . , m do
15 wij wij + p(Hi = 1 | v) vj p(Hi = 1 | v
(k)
1
) v
(k)
1,j
16 bj bj + vj v
(k)
1,j
17 ci ci + p(Hi = 1 | v) p(Hi = 1 | v
(k)
1
)
In each step of the algorithm, we run k (usually k = 1) Gibbs sampling steps
in each tempered Markov chain yielding samples (v
1
, h
1
), . . . , (v
M
, h
M
). After
this, two neighboring Gibbs chains with temperatures T
r
and T
r1
may exchange
particles (v
r
, h
r
) and (v
r1
, h
r1
) with an exchange probability based on the
Metropolis ratio,
min
_
1,
p
r
_
v
r1
, h
r1
_
p
r1
_
v
r
, h
r
_
p
r
_
v
r
, h
r
_
p
r1
_
v
r1
, h
r1
_
_
, (39)
which gives for RBMs
min
_
1, exp
__
1
T
r
1
T
r1
_
(E(v
r
, h
r
) E(v
r1
, h
r1
))
__
. (40)
After performing these swaps between chains, which enlarge the mixing rate, we
take the (eventually exchanged) sample v
1
of original chain (with temperature
T
1
= 1) as a sample from the model distribution. This procedure is repeated L
times yielding samples v
1,1
, . . . , v
1,L
used for the approximation of the expec-
tation under the RBM distribution in the log-likelihood gradient (i.e., for the
An Introduction to Restricted Boltzmann Machines 33
approximation of the second term in (9)). Usually L is set to the number of
samples in the (mini) batch of training data as shown in algorithm 2.
Compared to CD, PT introduces computational overhead, but results in a
faster mixing Markov chain and thus a less biased gradient approximation. The
evolution of the log-likelihood during training using PT with dierent values of
M can be seen in gure 2.
6 RBMs with Real-Valued Variables
So far, we considered only observations represented by binary vectors, but often
one would like to model distributions over continuous data. There are several
ways to dene RBMs with real-valued visible units. As demonstrated by [18],
one can model a continuous distribution with a binary RBM by a simple trick.
The input data is scaled to the interval [0, 1] and modeled by the probability
of the visible variables to be one. That is, instead of sampling binary values,
the expectation p(V
j
= 1[h) is regarded as the current state of the variable V
j
.
Except for the continuous values of the visible variables and the resulting changes
in the sampling procedure the learning process remains the same. By keeping
the energy function as given in (20) and just replacing the state space 0, 1
m
of
V by [0, 1]
m
, the conditional distributions of the visible variables belong to the
class of truncated exponential distributions. This can be shown in the same way
as the sigmoid function for binary RBMs is derived in (27). Visible neurons with
a Gaussian distributed conditional are for example gained by augmenting the
energy with quadratical terms
j
d
j
v
2
j
weighted by parameters d
j
, j = 1, . . . , m.
In contrast to the universal approximation capabilities of standard RBMs on
0, 1
m
, the subset of real-valued distributions that can be modeled by an RBM
with real-valued visible units is rather constrained [38].
More generally, it is possible to cover continuous valued variables by extend-
ing the denition of an RBM to any MRF whose energy function is such that
p(h[v) =
i
p(h
i
[v) and p(v[h) =
j
p(v
j
[h). As follows directly from the
Hammersley-Cliord theorem and as also discussed in [18], this holds for any
energy function of the form
E(v, h) =
i,j
i,j
(h
i
, v
j
) +
j
(v
j
) +
i
(h
i
)) (41)
with real-valued functions
i,j
,
j
, and
i
,i = 1, . . . , n and j = 1, . . . , m, ful-
lling the constraint that the partition function Z is nite. Welling et al. [40]
come to almost the same generalized form of the energy function in their frame-
work for constructing exponential family harmoniums from arbitrary marginal
distributions p(v
j
) and p(h
i
) from the exponential family.
7 Loosening the Restrictions
In this closing section, we will give a very brief outlook on selected extensions
of RBMs that loosen the imposed restrictions on the bipartite network topology
34 A. Fischer and C. Igel
by introducing dependencies on further random variables or by allowing for
arbitrary connections between nodes in the model.
Conditional RBMs. Several generalizations and extensions of RBMs exist. A
notable example are conditional RBMs (e.g., example [35,29]). In these mod-
els, some of the parameters in the RBM energy are replaced by parametrized
functions of some conditioning random variables, see [2] for an introduction.
Boltzmann Machines. Removing the R from the RBM brings us back to
where everything started, to the general Boltzmann machine [1]. These are MRFs
consisting of a set of hidden and visible variables where the energy is given by
E(v, h) =
i=1
m
j=1
h
i
w
ij
v
j
m
k=1
l<k
v
k
u
kl
v
l
n
k=1
l<k
h
k
y
kl
h
l
m
j=1
b
j
v
j
n
i=1
c
i
h
i
.
(42)
The graph corresponds to the one of an RBM with additional connections be-
tween the variables of one layer. These dependencies make sampling more com-
plex (in Gibbs sampling each variable has to be updated independently) and thus
training more dicult. However, specialized learning algorithms for particular
deep graph structures have been developed [32].
8 Next Steps
The goal of this tutorial was to introduce RBMs from the probabilistic graphical
model perspective. The text is meant to supplement existing tutorials, and it
is biased in the sense that it focuses on material that we found helpful in our
work. We hope that the reader is now equipped to move on to advanced models
building on RBMs in particular to deep learning architectures, where [2] may
serve as an excellent starting point.
All experiments in this tutorial can be reproduced using the open source
machine learning library Shark [20], which implements most of the models and
algorithms that were discussed.
Acknowledgments. The authors acknowledge support from the German Fed-
eral Ministry of Education and Research within the National Network Computa-
tional Neuroscience under grant number 01GQ0951 (Bernstein Fokus Learning
behavioral models: From human experiment to technical assistance).
References
1. Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann
machines. Cognitive Science 9, 147169 (1985)
2. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine
Learning 21(6), 16011621 (2009)
3. Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neu-
ral Computation 21(6), 16011621 (2009)
An Introduction to Restricted Boltzmann Machines 35
4. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., Montreal, U.: Greedy layer-
wise training of deep networks. In: Sch olkopf, B., Platt, J., Homan, T. (eds.) Ad-
vances in Neural Information Processing (NIPS 19), pp. 153160. MIT Press (2007)
5. Bishop, C.M.: Pattern recognition and machine learning. Springer (2006)
6. Bremaud, P.: Markov chains: Gibbs elds, Monte Carlo simulation, and queues.
Springer (1999)
7. Carreira-Perpi nan, M.