Reducing The Dimensionality of Data With Neural Networks

larized nearly vertically. For completeness, Fig. metamaterials would be highly desirable but is
1B shows the off-resonant case for the smaller currently not available.
SRRs for vertical incident polarization.
materials are identical for all configurations. The
blue bars in Fig. 1 summarize the measured SHG
signals. For excitation of the LC resonance in Fig.
1A (horizontal incident polarization), we find Reducing the Dimensionality of
an SHG signal that is 500 times above the noise
level. As expected for SHG, this signal closely
scales with the square of the incident power
Data with Neural Networks
(Fig. 2A). The polarization of the SHG emission G. E. Hinton* and R. R. Salakhutdinov
is nearly vertical (Fig. 2B). The small angle with
respect to the vertical is due to deviations from High-dimensional data can be converted to low-dimensional codes by training a multilayer neural
perfect mirror symmetry of the SRRs (see network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent
electron micrographs in Fig. 1). Small detuning can be used for fine-tuning the weights in such ‘‘autoencoder’’ networks, but this works well only if
of the LC resonance toward smaller wavelength the initial weights are close to a good solution. We describe an effective way of initializing the
(i.e., to 1.3-mm wavelength) reduces the SHG weights that allows deep autoencoder networks to learn low-dimensional codes that work much
signal strength from 100% to 20%. For ex- better than principal components analysis as a tool to reduce the dimensionality of data.
citation of the Mie resonance with vertical
incident polarization in Fig. 1D, we find a small imensionality reduction facilitates the finds the directions of greatest variance in the
signal just above the noise level. For excitation
of the Mie resonance with horizontal incident
polarization in Fig. 1C, a small but significant
D classification, visualization, communi-
cation, and storage of high-dimensional
data. A simple and widely used method is
data set and represents each data point by its
coordinates along each of these directions. We
describe a nonlinear generalization of PCA that
SHG emission is found, which is again po- principal components analysis (PCA), which uses an adaptive, multilayer Bencoder[ network

28 JULY 2006 VOL 313 SCIENCE

to transform the high-dimensional data into a Starting with random weights in the two called an Bautoencoder[ and is depicted in
low-dimensional code and a similar Bdecoder[ networks, they can be trained together by Fig. 1.
network to recover the data from the code. minimizing the discrepancy between the orig- It is difficult to optimize the weights in
inal data and its reconstruction. The required nonlinear autoencoders that have multiple
Department of Computer Science, University of Toronto, 6 gradients are easily obtained by using the chain hidden layers (2–4). With large initial weights,
King’s College Road, Toronto, Ontario M5S 3G4, Canada. rule to backpropagate error derivatives first autoencoders typically find poor local minima;
*To whom correspondence should be addressed; E-mail: through the decoder network and then through with small initial weights, the gradients in the
hinton@cs.toronto.edu the encoder network (1). The whole system is early layers are tiny, making it infeasible to
train autoencoders with many hidden layers. If
the initial weights are close to a good solution,
gradient descent works well, but finding such
initial weights requires a very different type of
W4 algorithm that learns one layer of features at a
500 time. We introduce this Bpretraining[ procedure
W1 W 1 +ε 8 for binary data, generalize it to real-valued data,
2000 2000 and show that it works well for a variety of
W 2 +ε 7 data sets.
W3 1000 1000 An ensemble of binary vectors (e.g., im-
1000 T
W 3 +ε 6 ages) can be modeled using a two-layer net-
500 500 work called a Brestricted Boltzmann machine[

W4 T
W 4 +ε 5
(RBM) (5, 6) in which stochastic, binary pixels
30 Code layer 30
are connected to stochastic, binary feature
1000 detectors using symmetrically weighted con-
W4 W 4 +ε 4
W2 nections. The pixels correspond to Bvisible[
2000 500 500
RBM units of the RBM because their states are
W3 W 3 +ε 3
observed; the feature detectors correspond to
1000 1000
Bhidden[ units. A joint configuration (v, h) of
W2 W 2 +ε 2 the visible and hidden units has an energy (7)
2000 2000 2000 given by
W1 W1 W 1 +ε 1 X X
Eðv, hÞ 0 j bi vi j bj hj
iZpixels jZfeatures
X ð1Þ
j vi hj wij
i, j
RBM Encoder

Pretraining Unrolling Fine-tuning where vi and hj are the binary states of pixel i
Fig. 1. Pretraining consists of learning a stack of restricted Boltzmann machines (RBMs), each and feature j, bi and bj are their biases, and wij
having only one layer of feature detectors. The learned feature activations of one RBM are used is the weight between them. The network as-
as the ‘‘data’’ for training the next RBM in the stack. After the pretraining, the RBMs are signs a probability to every possible image via
‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of this energy function, as explained in (8). The
error derivatives. probability of a training image can be raised by

Fig. 2. (A) Top to bottom:

Random samples of curves from
the test data set; reconstructions
produced by the six-dimensional
deep autoencoder; reconstruc-
tions by ‘‘logistic PCA’’ (8) using
six components; reconstructions
by logistic PCA and standard
PCA using 18 components. The
average squared error per im-
age for the last four rows is
1.44, 7.64, 2.45, 5.90. (B) Top
to bottom: A random test image
from each class; reconstructions
by the 30-dimensional autoen-
coder; reconstructions by 30-
dimensional logistic PCA and
standard PCA. The average
squared errors for the last three
rows are 3.00, 8.01, and 13.87.
(C) Top to bottom: Random
samples from the test data set;
reconstructions by the 30-
dimensional autoencoder; reconstructions by 30-dimensional PCA. The average squared errors are 126 and 135.

28 JULY 2006 VOL 313 SCIENCE

adjusting the weights and biases to lower the the hidden units are then updated once more so same learning rule is used for the biases. The
energy of that image and to raise the energy of that they represent features of the confabula- learning works well even though it is not
similar, Bconfabulated[ images that the network tion. The change in a weight is given by exactly following the gradient of the log
would prefer to the real data. Given a training probability of the training data (6).
image, the binary state hj of each feature de-   A single layer of binary features is not the
Dwij 0 e bvi hj Àdata j bvi hj Àrecon ð2Þ
P j is set to 1 with probability s(bj þ best way to model the structure in a set of im-
iviwij), where s(x) is the logistic function ages. After learning one layer of feature de-
1/E1 þ exp (–x)^, bj is the bias of j, vi is the where e is a learning rate, bvi hjÀdata is the tectors, we can treat their activities—when they
state of pixel i, and wij is the weight between i fraction of times that the pixel i and feature are being driven by the data—as data for
and j. Once binary states have been chosen for detector j are on together when the feature learning a second layer of features. The first
the hidden units, a Bconfabulation[ is produced detectors are being driven by data, and layer of feature detectors then become the
P setting each vi to 1 with probability s(bi þ
by bvi hjÀrecon is the corresponding fraction for visible units for learning the next RBM. This
jhjwij), where bi is the bias of i. The states of confabulations. A simplified version of the layer-by-layer learning can be repeated as many

Fig. 3. (A) The two-

dimensional codes for 500
digits of each class produced
by taking the first two prin-
cipal components of all
60,000 training images.

(B) The two-dimensional
codes found by a 784-
1000-500-250-2 autoen-
coder. For an alternative
visualization, see (8).

Fig. 4. (A) The fraction of

retrieved documents in the
same class as the query when
a query document from the
test set is used to retrieve other
test set documents, averaged
over all 402,207 possible que-
ries. (B) The codes produced
by two-dimensional LSA. (C)
The codes produced by a 2000-
500-250-125-2 autoencoder.

28 JULY 2006 VOL 313 SCIENCE

times as desired. It can be shown that adding an pi is the intensity of pixel i and ĝpi is the tion task, the best reported error rates are 1.6% for
extra layer always improves a lower bound on intensity of its reconstruction. randomly initialized backpropagation and 1.4%
the log probability that the model assigns to the The autoencoder consisted of an encoder for support vector machines. After layer-by-layer
training data, provided the number of feature with layers of size (28  28)-400-200-100- pretraining in a 784-500-500-2000-10 network,
detectors per layer does not decrease and their 50-25-6 and a symmetric decoder. The six backpropagation using steepest descent and a
weights are initialized correctly (9). This bound units in the code layer were linear and all the small learning rate achieves 1.2% (8). Pretraining
does not apply when the higher layers have other units were logistic. The network was helps generalization because it ensures that most
fewer feature detectors, but the layer-by-layer trained on 20,000 images and tested on 10,000 of the information in the weights comes from
learning algorithm is nonetheless a very effec- new images. The autoencoder discovered how modeling the images. The very limited informa-
tive way to pretrain the weights of a deep auto- to convert each 784-pixel image into six real tion in the labels is used only to slightly adjust
encoder. Each layer of features captures strong, numbers that allow almost perfect reconstruction the weights found by pretraining.
high-order correlations between the activities of (Fig. 2A). PCA gave much worse reconstruc- It has been obvious since the 1980s that
units in the layer below. For a wide variety of tions. Without pretraining, the very deep auto- backpropagation through deep autoencoders
data sets, this is an efficient way to pro- encoder always reconstructs the average of the would be very effective for nonlinear dimen-
gressively reveal low-dimensional, nonlinear training data, even after prolonged fine-tuning sionality reduction, provided that computers
structure. (8). Shallower autoencoders with a single were fast enough, data sets were big enough,
After pretraining multiple layers of feature hidden layer between the data and the code and the initial weights were close enough to a
detectors, the model is Bunfolded[ (Fig. 1) to can learn without pretraining, but pretraining good solution. All three conditions are now
produce encoder and decoder networks that greatly reduces their total training time (8). satisfied. Unlike nonparametric methods (15, 16),
initially use the same weights. The global fine- When the number of parameters is the same, autoencoders give mappings in both directions

tuning stage then replaces stochastic activities deep autoencoders can produce lower recon- between the data and code spaces, and they can
by deterministic, real-valued probabilities and struction errors on test data than shallow ones, be applied to very large data sets because both
uses backpropagation through the whole auto- but this advantage disappears as the number of the pretraining and the fine-tuning scale linearly
encoder to fine-tune the weights for optimal parameters increases (8). in time and space with the number of training
reconstruction. Next, we used a 784-1000-500-250-30 auto- cases.
For continuous data, the hidden units of the encoder to extract codes for all the hand-
first-level RBM remain binary, but the visible written digits in the MNIST training set (11).
28 JULY 2006 VOL 313 SCIENCE

