Noise-Contrastive Estimation: A New Estimation Principle For Unnormalized Statistical Models
Noise-Contrastive Estimation: A New Estimation Principle For Unnormalized Statistical Models
Noise-Contrastive Estimation: A New Estimation Principle For Unnormalized Statistical Models
297
Noise-contrastive estimation
298
Michael Gutmann, Aapo Hyvärinen
2.3 Properties of the estimator parameters α in the unnormalized pdf p0m (.; α) and
the log-normalization constant c, which is impossible
We characterize here the behavior of the estimator θ̂T
when using likelihood.
when the sample size T becomes arbitrarily large. The
weak law of large numbers shows that in that case, the Theorem 2 (Consistency). If conditions (a) to (c)
objective function JT (θ) converges in probability to J, are fulfilled then θ̂T converges in probability to θ⋆ , i.e.
P
θ̂T → θ⋆ .
1
J(θ) = E ln [h(x; θ)] + ln [1 − h(y; θ)] . (12)
2 (a) pn (.) is nonzero whenever pd (.) is nonzero
Let us denote by J˜ the objective J seen as a function P
(b) supθ |JT (θ) − J(θ)| → 0
of f (.) = ln pm (.; θ), i.e. R
(c) I = g(x)g(x)T P (x)pd (x)dx has full rank,
˜ ) = 1 where
J(f E ln [r(f (x) − ln pn (x))] +
2
pn (x)
ln [1 − r(f (y) − ln pn (y))] . (13) P (x) = , g(x) = ∇θ ln pm (x; θ)|θ⋆
pd (x) + pn (x)
We start the characterization of the estimator θ̂T with Condition (a) is inherited from Theorem 1, and is eas-
a description of the optimization landscape of J. ˜ The
3
ily fulfilled by choosing, for example, the noise to be
following theorem shows that the data pdf pd (.) can Gaussian. Conditions (b) and (c) have their counter-
be found by maximization of J, ˜ i.e. by learning a
parts in MLE, see e.g.(Wasserman, 2004). We need in
classifier under the ideal situation of infinite amount (b) uniform convergence in probability of JT to J; in
of data. MLE, uniform convergence of the log-likelihood to the
Theorem 1 (Nonparametric estimation). J˜ attains Kullback-Leibler distance is required likewise. Con-
a maximum at f (.) = ln pd (.). There are no other dition (c) assures that for large sample sizes, the ob-
extrema if the noise density pn (.) is chosen such it is jective function JT becomes peaky enough around the
nonzero whenever pd (.) is nonzero. true value θ⋆ . This imposes through the vector g a
condition on the model pm (.; θ). A similar constraint
A fundamental point in the theorem is that the maxi- is required in MLE. For the estimation of normalized
mization is performed without any normalization con- models pm (.; α), where the normalization constant is
straint for f (.). This is in stark contrast to MLE, not part of the parameters, the vector g(x) is the score
where exp(f ) must integrate to one. With our objec- function as in MLE. Furthermore, if P (x) were a con-
tive function, no such constraints are necessary. The stant, I would be proportional to the Fisher informa-
maximizing pdf is found to have unit integral auto- tion matrix.
matically. The positivity condition for pn (.) in the
theorem tells us that the data pdf pd (.) cannot be in- The following theorem describes the distribution of the
ferred where there are no contrastive noise samples for estimation error (θ̂T − θ⋆ ) for large sample sizes.
√
some relevant regions in the data space. This situa- Theorem 3 (Asymptotic normality). T (θ̂T − θ⋆ ) is
tion can be easily avoided by taking, for example, a asymptotically normal with mean zero and covariance
Gaussian distribution for the contrastive noise. matrix Σ,
Z
In practice, the amount of data is limited and a finite −1 −1
Σ = I − 2I g(x)P (x)pd (x)dx × (14)
number of parameters θ ∈ Rm specify pm (.; θ). This
has in general two consequences: First, it restricts the Z
T
space where the data pdf pd (.) is searched for. Second, g(x) P (x)pd (x)dx I −1 .
it may introduce local maxima into the optimization
landscape. For the characterization of the estimator in When we are estimating a normalized model pm (.; α),
this situation, it is normally assumed that pd (.) follows we observe here again some similarities to MLE by
the model, i.e. there is a θ⋆ such that pd (.) = pm (.; θ⋆ ). considering the hypothetical case that P (x) is a con-
stant: The integral in the brackets is then zero because
Our second theorem tells us that θ̂T , the value of θ
it is proportional to the expectation of the score func-
which (globally) maximizes JT , converges to θ⋆ and
tion, which is zero. The covariance matrix Σ is thus up
leads thus to the correct estimate of pd (.) as the sam-
to a scaling constant equal to the Fisher information
ple size T increases. For unnormalized models, the
matrix.
log-normalization constant is part of the parameters.
This means that the maximization of our objective Theorem 3 leads to the following corollary:
function leads to the correct estimates for both the Corollary 1. For large sample sizes T , the mean
3
Proofs are omitted due to a lack of space. squared error E ||θ̂T − θ⋆ ||2 behaves like tr(Σ)/T .
299
Noise-contrastive estimation
2.4 Choice of the contrastive noise gence (Hinton, 2002), and score matching (Hyvärinen,
distribution 2005). MLE gives the performance baseline. It can,
however, only be used if an analytical expression for
The noise distribution pn (.), which is used for contrast, the partition function is available. The other methods
is a design parameter. In practice, we would like to can all be used to learn unnormalized models.
have a noise distribution which fulfills the following:
(1) It is easy to sample from, since the method relies 3.1.1 Data and unnormalized model
on a set of samples Y from the noise distribution.
Data x ∈ R4 is generated via the ICA model
(2) It allows for an analytical expression for the log-
pdf, so that we can evaluate the objective function x = As, (15)
in Eq. (3) without any problems.
(3) It leads to a small mean squared error E ||θ̂T −θ⋆ ||2 . where A = (a1 , . . . , a4 ) is a 4 × 4 mixing matrix. All
Our result on consistency (Theorem 2) also includes four independent sources in s follow a Laplacian den-
some technical constraints on pn (.) but they are so sity of unit variance and zero mean. The data log-pdf
mild that, given an estimation problem at hand, many ln pd (.) is thus
distributions will verify them. In principle, one could 4
√
minimize the mean squared error (MSE) in Corollary 1
X
ln pd (x) = − 2 |b⋆i x| + (ln |detB ⋆ | − ln 4) , (16)
with respect to the noise distribution pn (.). How- i=1
ever, this turns out to be quite difficult, and sampling
from such a distribution might not be straightforward where b⋆i is the i-th row of the matrix B ⋆ = A−1 . The
either. In practice, a well-known noise distribution unnormalized model is
which satisfies points (1) and (2) above seems to be 4
a good choice. Some examples are a Gaussian or uni- X √
ln p0m (x; α) = − 2|bi x|. (17)
form distribution, a Gaussian mixture distribution, or
i=1
an ICA distribution.
Intuitively, the noise distribution should be close to The parameters α ∈ R16 are the row vectors bi . For
the data distribution, because otherwise, the classifica- noise-contrastive estimation, we consider also the nor-
tion problem might be too easy and would not require malization constant to be a parameter and work with
the system to learn much about the structure of the
ln pm (x; θ) = ln p0m (x; α) + c. (18)
data. This intuition is partly justified by the following
theoretical result: If the noise is equal to the data dis- The scalar c is an estimate for the negative log-
tribution, then Σ in Theorem 3 equals two times the partition function. The total set of the parameters for
Cramér-Rao bound. Thus, for a noise distribution that noise-contrastive estimation is thus θ = {α, c} while
is close to the data distribution, we have some guaran- for the other methods, the parameters are given by α.
tee that the MSE is reasonably close to the theoretical The true values of the parameters are the vectors b⋆i
optimum.4 As a consequence, one could choose a noise for α and c⋆ = ln |detB ⋆ | − ln 4 for c.
distribution by first estimating a preliminary model of
the data, and then use this preliminary model as the
3.1.2 Estimation methods
noise distribution.
For noise-contrastive estimation, we choose the con-
3 Simulations
trastive noise y to be Gaussian with the same mean
3.1 Simulations with artificial data and covariance matrix as x. The parameters θ are then
estimated by learning to discriminate between the data
We illustrate noise-contrastive estimation with the es-
x and the noise y, i.e. by maximizing JT in Eq. (3).
timation of an ICA model (Hyvärinen et al., 2001),
The optimization is done with a conjugate gradient
and compare its performance with other estimation
algorithm (Rasmussen, 2006).
methods, namely MLE, MLE where the normaliza-
tion (partition function) is calculated with importance We give now a short overview of the estimation meth-
sampling (see e.g. (Wasserman, 2004) for an intro- ods that we used for comparison and comment on our
duction to importance sampling), contrastive diver- implementation:
4 In MLE, the parameters α are chosen such that the
At a first glance, this might be counterintuitive. In the
setting of logistic regression, however, we will then have to probability for the observed data is maximized, i.e.
learn that the two distributions are equal and that the
1 X
posterior probability for any point belonging to any of the JMLE (α) = ln p0m (x(t); α) − ln Z(α) (19)
two classes is 50%, which is a well defined problem. T t
300
Michael Gutmann, Aapo Hyvärinen
301
Noise-contrastive estimation
1 0
MLE, mix mat MLE, mix mat
0.5 NCE GN, mix mat NCE GN, mix mat
NCE GN, norm const −0.5 SM, mix mat
0 SM, mix mat Imp samp, mix mat
Imp sampl, mix mat CD, mixMat
−1
−0.5
CD, mix mat
−1 −1.5
−1.5
−2
−2
−2.5
−2.5
−3
−3
−3.5 −3.5
2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 −1.5 −1 −0.5 0 0.5 1 1.5 2
log10 sample size log10 comp time [s]
−2.5 −2
−3 −2.5
−3.5 −3
−4 −3.5
2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 2.5 3 3.5 4 4.5
log10 sample size log10 sample size
Figure 1: Noise-contrastive estimation of an ICA model and comparison to other estimation methods. Figure
(a) shows the mean squared error (MSE) for the estimation methods in function of the sample size. Figure (b)
shows the estimation error in function of the computation time. Among the methods for unnormalized models,
noise-contrastive estimation (NCE) requires the least computation time to reach a required level of precision.
Figure (c) shows that NCE with Laplacian contrastive noise leads to a better estimate than NCE with Gaussian
noise. Figure (d) shows that Corollary 1 describes the behavior of the MSE for large sample sizes correctly.
Simulation and plotting details: Figures (a)-(c) show the median of the simulation results. In (d), we took the
average. For each sample size T , averaging is based on 500 random mixing matrices A with condition numbers less
than 10. In figures (a),(c) and (d), we started for each mixing matrix the optimization at 5 different initializations
in order to avoid local optima. This was not possible for contrastive divergence (CD) as it does not have a proper
objective function. This might be the reason for the higher error variance of CD that we have pointed out in
the main text. For NCE and score matching (SM), we relied on the built-in criteria of (Rasmussen, 2006) to
determine convergence. For the other methods, we did not use a particular convergence criterion: They were all
given a sufficiently long running time to assure that the algorithms had converged. In figure (b), we performed
only one optimization run per mixing matrix to make the comparison fair. Note that, while the other figures
show the MSE at time of convergence of the algorithms, this figure shows the behavior of the error during the
runs. In more detail, the curves in the figure show the, on average, minimal possible estimation error at a given
time. For any given method at hand, the curve was created as follows: We monitored for each mixing matrix
the estimation error and the elapsed time during the runs. This was done for all the sample sizes that we used
in figure (a). For any fixed time t0 , we obtained in that way a set of estimation errors, one for each sample size.
We retained then the smallest error. This gives the minimal possible estimation error that can be achieved by
time t0 . Taking the median over all 500 mixing matrices yielded, for a given method, the curve shown in the
figure. Note that, by construction, the curves in the figure do not depend on the stopping criterion. Comparing
figure (b) with figure (a) shows furthermore that the curves in (b) flatten out at error levels that correspond to
the MSE in figure (a) for the sample size T = 16000, i.e. log 10(T ) ≈ 4.2. This was the largest sample size used
in the simulations. For larger sample sizes, the curves would flatten out at lower error levels.
302
Michael Gutmann, Aapo Hyvärinen
3.2 Simulations with natural images Figure 2 (a) shows the estimation results. The first
layer features wi (rows of W ) are Gabor-like (“simple
We use here noise-contrastive estimation to learn the
cells”). The second layer weights vi pool together fea-
statistical structure of natural images. Current models
tures of similar orientation and frequency, which are
of natural images can be broadly divided into patch-
not necessarily centered at the same location (“com-
based models and Markov Random Field (MRF) based
plex cells”). The results correspond to those reported
models. Patch-based models are mostly two-layer
in (Köster & Hyvärinen, 2007) and (Osindero et al.,
models (Osindero et al., 2006; Köster & Hyvärinen,
2006).
2007; Karklin & Lewicki, 2005), although in (Osin-
dero & Hinton, 2008) a three-layer model is presented. 3.2.2 Markov random field
Most of these models are unnormalized. Score match-
We used basically the same data, preprocessing and
ing and contrastive divergence have typically been
contrastive noise as for the patch based model. In
used to estimate them. For the learning of MRF from
order to train a MRF with clique size 15 pixels, we
natural images, contrastive divergence has been used
used, however, image patches of size 45 × 45 pixel.6
in (Roth & Black, 2009), while (Köster et al., 2009)
Furthermore, for whitening, we employed a whitening
employs score matching.
filter of size 9 × 9 pixel. No redundancy reduction was
3.2.1 Patch-model performed.
Natural image data was obtained by sampling patches Denote by I(ξ) the pixel value of an image I(.) at
of size 30 × 30 pixel from images of van Hateren’s position ξ. Our model for an image I(.) is
database which depict wild-life scenes only. As prepro-
cessing, we removed the DC component of each patch, X X
whitened the data, and reduced the dimensions from log pm (I; θ) = fth wi (ξ ′ )Iw (ξ + ξ ′ ) + bi + c,
ξ,i ξ′
900 to 225. The dimension reduction implied that we
retained 92 % of the variance of the data. As a novel where Iw (.) is the image I(.) filtered with the whiten-
preprocessing step, we then further normalized each ing filter. The parameters θ of the model are the filters
image patch so that it had zero DC value and unit wi (.) (size 7×7 pixel), the thresholds bi for i = 1 . . . 25,
variance. The whitened data was thus projected onto and c for the normalization of the pdf.
a sphere. Projection onto a sphere can be considered
as a form of divisive normalization (Lyu & Simoncelli, Figure 2 shows the learned filters wi (.) after convolu-
2009). For the contrastive noise, we used a uniform tion with the whitening filter. The filters are rather
distribution on the sphere. high-frequency and Gabor-like. This is different com-
pared to (Roth & Black, 2009), where the filters had
Our model for a patch x is no clear structure. In (Köster et al., 2009), the filters,
which were shown in the whitened space, were also
h i
2
X
log pm (x; θ) = fth ln vn (W x) + 1 + bn + c,
Gabor-like. However, unlike in our model, a norm con-
n
straint on the filters was there necessary to get several
where the squaring operation is applied to every ele- non-vanishing filters.
ment of the vector W x, and fth (.) is a smooth thresh-
olding function.5 The parameters θ of the model 4 Conclusion
are the matrix W ∈ R225×225 , the 225 row vectors
We proposed here a new estimation principle, noise-
vn ∈ R225 and the equal number of bias terms bn which
contrastive estimation, which consistently estimates
define the thresholds, as well as c for the normaliza-
sophisticated statistical models that do not need to be
tion of the pdf. The only constraint we are imposing
normalized (e.g. energy-based models or Markov ran-
is that the vectors vn are limited to be non-negative.
dom fields). In fact, the normalization constant can
We learned the model in three steps: First, we learned be estimated as any other parameter of the model.
all the parameters but keeping the second layer (the One benefit of having an estimate for the normaliza-
matrix V with row vectors vn ), fixed to identity. The tion constant at hand is that it could be used to com-
second step was learning of V with the other parame- pare the likelihood of several distinct models. Further-
ters held fixed. Initializing V randomly to small values more, the principle shows a new connection between
proved helpful. When the objective function reached unsupervised and supervised learning.
again the level it had at the end of the first step, we
For a tractable ICA model, we compared noise-
switched, as the third step, to concurrent learning of
contrastive estimation with other methods that can
all parameters. For the optimization, we used a con-
6
jugate gradient algorithm (Rasmussen, 2006). Although the MRF is a model for an entire image,
training can be done with image patches, see (Köster et al.,
5
fth (u) = 0.25 ln(cosh(2u)) + 0.5u + 0.17 2009).
303
Noise-contrastive estimation
(a) Patch-model (patch size: 30 pixels) (b) MRF (clique size: 15 pixels)
Figure 2: Noise-contrastive estimation of models for natural images. (a) Random selection of 2 × 10 out of 255
pooling patterns. Every vector vi corresponds to a pooling pattern. The patches in pooling pattern i0 show the
wi and the black bar under each patch indicates the strength vi0 (n) by which a certain wn is pooled by vi0 . (b)
Learned filters wi (.) in the original space, i.e. after convolution with the whitening filter. The black bar under
each patch indicates the norm of the filter.
be used to estimate unnormalized models. Noise- Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Indepen-
contrastive estimation is found to compare favorably. dent component analysis. Wiley-Interscience.
It offers the best trade-off between computational and Karklin, Y., & Lewicki, M. (2005). A hierarchical bayesian
statistical efficiency. We then applied noise-contrastive model for learning nonlinear statistical regularities in
estimation to the learning of an energy-based two-layer nonstationary natural signals. Neural Computation, 17,
model and a Markov random field model of natural im- 397–423.
ages. The results confirmed the validity of the estima- Köster, U., & Hyvärinen, A. (2007). A two-layer ICA-like
tion principle: For the two-layer model, we obtained model estimated by score matching. Proc. Int. Conf. on
simple and complex cell properties in the first two lay- Artificial Neural Networks (ICANN2007).
ers. For the Markov random field, highly structured Köster, U., Lindgren, J., & Hyvärinen, A. (2009). Esti-
Gabor-like filters were obtained. Moreover, the two- mating markov random field potentials for natural im-
layer model could be readily extended to have more ages. Int. Conf. on Independent Component Analysis
layers. An important potential application of our esti- and Blind Source Separation (ICA2009).
mation principle lies thus in deep learning. Lyu, S., & Simoncelli, E. (2009). Nonlinear extraction of
independent components of natural images using radial
We used in previous work classification based on logis- gaussianization. Neural Computation, 21, 1485–1519.
tic regression to learn features from images (Gutmann
& Hyvärinen, 2009). However, only one layer of Gabor MacKay, D. (2002). Information theory, inference & learn-
ing algorithms. Cambridge University Press.
features was learned in that paper, and, importantly,
such learning was heuristic and not connected to esti- Osindero, S., & Hinton, G. (2008). Modeling image patches
mation theory. Here, we showed an explicit connection with a directed hierarchy of markov random fields. In
Advances in neural information processing systems 20,
to statistical estimation and provided a formal analy- 1121–1128. MIT Press.
sis of the learning in terms of estimation theory. This
connection leads to further extensions of the principle Osindero, S., Welling, M., & Hinton, G. E. (2006). Topo-
which will be treated in future work. graphic product models applied to natural scene statis-
tics. Neural Computation, 18 (2).
References Rasmussen, C. (2006). Conjugate gradient algorithm, ver-
sion 2006-09-08. available online.
Gutmann, M., & Hyvärinen, A. (2009). Learning features
by contrasting natural images with noise. Proc. Int. Roth, S., & Black, M. (2009). Fields of experts. Interna-
Conf. on Artificial Neural Networks (ICANN2009). tional Journal of Computer Vision, 82, 205–229.
Hinton, G. (2002). Training products of experts by mini- Teh, Y., Welling, M., Osindero, S., & Hinton, G. (2004).
mizing contrastive divergence. Neural Computation, 14, Energy-based models for sparse overcomplete represen-
1771–1800. tations. Journal of Machine Learning Research, 4, 1235–
1260.
Hyvärinen, A. (2005.). Estimation of non-normalized sta-
tistical models using score matching. Journal of Machine Wasserman, L. (2004). All of statistics. Springer.
Learning Research, 6, 695–709.
304