Autoencoder
Autoencoder
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/2
What is an autoencoder? (1/2)
• autoencoder is a
special MLP. Output: 𝑥ො
hidden layers.
• The teacher signal Input: 𝑥
equals to the input.
The main purpose of using an autoencoder is to find a new
(internal or latent) representation for the given feature space,
with the hope of obtaining the true factors that control the
distribution of the input data.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/3
What is an autoencoder? (2/2)
• Using principal component
analysis (PCA) we can obtain a Output layer
linear autoencoder.
• Each hidden unit corresponds to
an eigenvector of the co- Hidden layer(s)
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/4
Training of autoencoder (1/4)
• Implementing an autoencoder using an MLP, we
can find a more compact representation for the
given problem space.
• Compared with classification MLP, an
autoencoder can be trained with un-labelled data.
• However, training of an autoencoder is
supervised learning because the input itself is the
teacher signal.
• BP algorithm can also be used for training.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/5
Training of autoencoder (2/4)
• The 𝑛-th input is denoted by 𝑥, and the corresponding output is
denoted by 𝑥ෞ𝑛 .
• Normally, the objective (loss) function used for training an
autoencoder is defined as follows:
𝐸 𝑤 = σ𝑁
𝑛=1 𝑥𝑛 − 𝑥
ෞ𝑛 2
(1)
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/6
Training of
autoencoder (3/4)
• Train an autoencoder for
the well-known IRIS
database using Matlab.
• The hidden layer size is 5.
• We can also specify
other parameters. For
detail, visit the web page
given below.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/7
Training of autoencoder (4/4)
1. X=iris_dataset;
2. Nh=5;
3. enc=trainAutoencoder(X,Nh);
4. view(enc)
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/8
Internal representation of data (1/2)
1. X=digitTrainCellArrayData;
2. Nh=36;
3. enc=trainAutoencoder(X,Nh);
4. plotWeights(enc);
• An autoencoder is trained for the dataset
containing 5,000 handwritten digits.
• We used 500 iterations for training, and
the hidden layer size 36.
• The right figure shows the weights of the
hidden layer.
• These are similar to the Eigenfaces.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/10
Training with
l2-norm regularization (1/2)
• The main purpose of autoencoder is to reconstruct the input
space using a smaller number of basis functions.
• If we use the objective function given by (1), the results may
not generalize well for test data.
• To improve the generalization ability, a common practice is to
introduce a penalty in the objective function as follows:
𝐸 𝑤 = σ𝑁 𝑛=1 𝑛 𝑥 − 𝑥
ෞ𝑛
2+𝜆 𝑤 2
(2)
𝐿 σ 𝑁𝑘 𝑘 2
𝑤 2 = σ𝑁 σ
𝑘=1 𝑗=1 𝑖=1 𝑤𝑗𝑖 (3)
• The effect of introducing this l2-norm is to make the
solution more “smooth”.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/11
Training with
l2-norm regularization (2/2)
𝑋 − 𝑋 =0.0261
For this example, we cannot see the positive effect clearly. Generally speaking,
however, if the inputs are noisy, regularization can obtain better results.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/12
Training with sparsity regularization (1/6)
• In nearest neighbor-based approximation, each datum is
approximated by one of the already observed data (i.e. the
nearest one).
• In PCA, each datum is approximated by a point in a linear
space spanned by the basis vectors (eigenvectors).
• Using autoencoder, each datum is approximate by a linear
combination of the hidden neuron outputs.
• Usually, the basis functions are global in the sense that ANY
given datum can be approximated well by using the same set
of basis functions.
• Usually, the number 𝑁𝑏 of basis functions equals to the rank 𝑟
of the linear space. For PCA, 𝑁𝑏 ≪ 𝑟 because we use the
“principal basis functions”.
[Lv and ZHAO, 2007] https://www.emeraldinsight.com/doi/abs/10.1108/17427370710847327
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/13
Training with sparsity regularization (2/6)
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/14
Training with sparsity regularization (3/6)
• For sparse representation, we introduce another penalty in
the objective function as follows:
𝐸 𝑤 = σ𝑁 ෞ𝑛 2 + 𝜆 𝑤 2 + 𝛽 ∙ 𝐹𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦 (5)
𝑛=1 𝑥𝑛 − 𝑥
• To define the sparsity term 𝐹𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦 , we need the average
output activation value of a neuron given by
1 𝑁ℎ (1)
𝜌ො = σ𝑗=1 𝑔 𝑢𝑗 (6)
𝑁
(1)
• where 𝑁 is the number of training data, and 𝑢𝑗 is the
effective input of the 𝑗-th hidden neuron.
• A neuron is very “active” if 𝜌ො is high. To obtain a sparse
neural network, it is necessary to make the neurons less
active. This way, we can reconstruct any given datum using
less number of hidden neurons (basis functions).
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/15
Training with sparsity regularization (4/6)
𝐹𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦 = 𝐾𝐿(𝜌 ∥ 𝜌)
ො
𝑖=1
𝑁ℎ 𝜌 1−𝜌
= σ𝑖=1 𝜌 log + 1−𝜌 log( ) (7)
ෝ
𝜌 1−ෝ
𝜌
• where 𝜌 is a sparsity parameter to be specified by
the user. The smaller, the more sparse.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/16
Training with sparsity regularization (5/6)
enc = trainAutoencoder(XTrain,Nh,...
'L2WeightRegularization',0.004,...
'SparsityRegularization',4,...
'SparsityProportion',0.10);
𝐸 𝑤 = σ𝑁
𝑛=1 𝑥𝑛 − 𝑥
ෞ𝑛 2
+𝜆 𝑤 2
+ 𝛽 ∙ 𝐹𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦 (5’)
• By specifying a small sparsity proportion 𝜌, we can get an
autoencoder with less active hidden neurons.
• If we use a proper norm of w, we can also reduced the
number of non-zero weights of the hidden neurons, and make
the network more sparse.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/17
Training with sparsity regularization (6/6)
𝑋 − 𝑋 =0.0268
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/19
Training of deep network (2/5)
• autoenc1 =
trainAutoencoder(X,Nh1,'DecoderTransferFunction','purelin');
• features1 = encode(autoenc1,X);
• autoenc2 =
trainAutoencoder(features1,Nh2'DecoderTransferFunction',…
'purelin','ScaleData',false);
• features2 = encode(autoenc2,features1);
• softnet =
trainSoftmaxLayer(features2,T,'LossFunction','crossentropy');
• deepnet = stack(autoenc1,autoenc2,softnet);
• deepnet = train(deepnet,X,T);
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/20
Training of deep network (3/5)
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/21
Training of deep network (4/5)
• To summarize, we can design a deep neural network
(with K layers, not include the input layer) as follows:
– Step 1: i=1; X(i)=X(0); % X(0) is the given data
– Step 2: Train an autoencoder A(i) based on X(i);
– Step 3: X(i+1)=encoder(X(i));
– Step 4: If i<K, return to Step 3;
– Step 5: Train a regression layer R using BP
• Training data: X(K-1)
• Teacher signal: Provided in the training set
– Step 6: Stack [A(1), A(2),…,A(K-1),R] to form a deep MLP.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/22
Training of deep network (5/5)
• We can also train a deep autoencoder by modifying the
algorithm slightly as follows:
– Step 1: i=1; X(i)=X(0); % X(0) is the given data
– Step 2: Train an autoencoder A(i) based on X(i);
– Step 3: X(i+1)=encoder(X(i));
– Step 4: If i<K, return to Step 3;
• K is the specified number of layers
– Step 5: Train a regression layer R using BP
• Training data: X(K-1)
• Teacher signal: X(0)
– Step 6: Stack [A(1), A(2),…,A(K-1),R] to form a deep
autoencoder.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/23
Homework
• Try the Matlab program for digit reconstruction given in the
following page:
https://www.mathworks.com/help/nnet/ref/trainautoencoder.html
• See what happen if we change the parameter for l2-norm
regularization; and
• See what happen if we change the parameter for sparsity
regularization.
• You may plot
– The weights of the hidden neurons (as images);
– The outputs of the hidden neurons; or
– The reconstructed data.
Machine Learning: Produced by Qiangfu Zhao (Since 2018), All rights reserved (C) Lec08/24