Deep Learning

Deep Learning Overview
Foreword
 The chapter describes the basic knowledge of deep learning, including the
development history of deep learning, components and types of deep
learning neural networks, and common problems in deep learning projects.
1 Huawei Confidential
Objectives
On completion of this course, you will be able to:

 Describe the definition and development of neural networks.
 Learn about the important components of deep learning neural networks.
 Understand training and optimization of neural networks.
 Describe common problems in deep learning.
Contents
1. Deep Learning Summary
2. Training Rules
3. Activation Function
4. Normalizer
5. Optimizer
6. Types of Neural Networks
7. Common Problems
Traditional Machine Learning and Deep Learning
 As a model based on unsupervised feature learning and feature hierarchy learning, deep
learning has great advantages in fields such as computer vision, speech recognition, and
natural language processing.
Traditional Machine Learning Deep Learning

Higher hardware requirements on the computer: To
Low hardware requirements on the computer: Given
execute matrix operations on massive data, the
the limited computing amount, the computer does
computer needs a GPU to perform parallel
not need a GPU for parallel computing generally.
computing.
Applicable to training under a small data amount The performance can be high when high-
and whose performance cannot be improved dimensional weight parameters and massive
continuously as the data amount increases. training data are provided.
Level-by-level problem breakdown E2E learning
Manual feature selection Algorithm-based automatic feature extraction
Easy-to-explain features Hard-to-explain features
Traditional Machine Learning
Issue analysis
Problem locating
Data Feature Feature

cleansing extraction selection
Model
training
Question: Can we use

an algorithm to Execute inference,
automatically execute prediction, and
the procedure? identification
Deep Learning
 Generally, the deep learning architecture is a deep neural network. "Deep" in
"deep learning" refers to the number of layers of the neural network.
Dendrite Synapse Output

layer
Nucleus
Hidden
layer
Axon Output layer

Input layer
Input layer Hidden layer
Human neural network Perceptron Deep neural network
Neural Network
 Currently, the definition of the neural network has not been determined yet. Hecht Nielsen, a
neural network researcher in the U.S., defines a neural network as a computer system composed
of simple and highly interconnected processing elements, which process information by dynamic
response to external inputs.
 A neural network can be simply expressed as an information processing system designed to
imitate the human brain structure and functions based on its source, features, and explanations.
 Artificial neural network (neural network): Formed by artificial neurons connected to each other,
the neural network extracts and simplifies the human brain's microstructure and functions. It is an
important approach to simulate human intelligence and reflect several basic features of human
brain functions, such as concurrent information processing, learning, association, model
classification, and memory.
Development History of Neural Networks
Deep
SVM
XOR network
Perceptron MLP
Golden age AI winter
1958 1970 1986 1995 2006
Single-Layer Perceptron
 Input vector: 𝑋 = [𝑥0 , 𝑥1 , … , 𝑥𝑛 ]𝑇 𝑥1
 Weight: 𝑊 = [𝜔0 , 𝜔1 , … , 𝜔𝑛 ]𝑇 , in which 𝜔0 is the offset. 𝑥2
𝑥𝑛
1, 𝑛𝑒𝑡 > 0, 𝑛
 Activation function: 𝑂 = 𝑠𝑖𝑔𝑛 𝑛𝑒𝑡 =
−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. 𝑛𝑒𝑡 = 𝜔𝑖 𝑥𝑖 = 𝑾𝑻 𝐗
𝑖=0
 The preceding perceptron is equivalent to a classifier. It uses the high-dimensional 𝑋 vector as the input and
performs binary classification on input samples in the high-dimensional space. When 𝑾𝑻 𝐗 > 0, O = 1. In this
case, the samples are classified into a type. Otherwise, O = −1. In this case, the samples are classified into the
other type. The boundary of these two types is 𝑾𝑻 𝐗 = 0, which is a high-dimensional hyperplane.
Classification point Classification line Classification plane Classification hyperplane

𝐴𝑥 + 𝐵 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶 = 0 𝐴𝑥 + 𝐵𝑦 + 𝐶𝑧 + 𝐷 = 0 𝑊𝑇X + 𝑏 = 0
XOR Problem
 In 1969, Minsky, an American mathematician and AI pioneer, proved that a
perceptron is essentially a linear model that can only deal with linear
classification problems, but cannot process non-linear data.
AND OR XOR
Feedforward Neural Network
Input layer Output layer

Hidden layer 1 Hidden layer 2
Solution of XOR
w0
w1
w2
w3
w4
XOR w5
XOR
Impacts of Hidden Layers on A Neural Network
0 hidden layers 3 hidden layers 20 hidden layers
Contents
2. Training Rules
4. Normalizer
5. Optimizer
7. Common Problems
Gradient Descent and Loss Function
𝑇
 The gradient of the multivariate function 𝑜 = 𝑓 𝑥 = 𝑓 𝑥0 , 𝑥1 , … , 𝑥𝑛 at 𝑋 ′ = [𝑥0 ′ , 𝑥1 ′ , … , 𝑥𝑛 ′ ] is shown as
follows:
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
′ ′
𝛻𝑓 𝑥0 , 𝑥1 , … , 𝑥𝑛 ′
= [ , ,…, ] |𝑋=𝑋 ′ ,
𝜕𝑥0 𝜕𝑥1 𝜕𝑥𝑛
The direction of the gradient vector is the fastest growing direction of the function. As a result, the direction
of the negative gradient vector −𝛻𝑓 is the fastest descent direction of the function.
 During the training of the deep learning network, target classification errors must be parameterized. A loss
function (error function) is used, which reflects the error between the target output and actual output of
the perceptron. For a single training sample x, the most common error function is the Quadratic cost
function.
1
𝐸 𝑤 = 𝑑∈𝐷 𝑡𝑑 − 𝑜𝑑 2 ,
2
In the preceding function, 𝑑 is one neuron in the output layer, D is all the neurons in the output layer, 𝑡𝑑 is
the target output, and 𝑜𝑑 is the actual output.
 The gradient descent method enables the loss function to search along the negative gradient direction and
update the parameters iteratively, finally minimizing the loss function.
Extrema of the Loss Function
 Purpose: The loss function 𝐸(𝑊) is defined on the weight space. The objective is to search for the weight
vector 𝑊 that can minimize 𝐸(𝑊).
 Limitation: No effective method can solve the extremum in mathematics on the complex high-dimensional
1
surface of 𝐸 𝑊 = 2 𝑑∈𝐷 𝑡𝑑 − 𝑜𝑑 2 .
Example of gradient descent of

binary paraboloid
Common Loss Functions in Deep Learning
 Quadratic cost function:
1 2
𝐸 𝑊 = 𝑡𝑑 − 𝑜𝑑
2
𝑑∈𝐷
 Cross entropy error function:
1
𝐸 𝑊 =− [𝑡𝑑 ln 𝑜𝑑 + (1 − 𝑡𝑑 ) ln( 1 − 𝑜𝑑 )]
𝑛
𝑥 𝑑∈𝐷
 The cross entropy error function depicts the distance between two probability
distributions, which is a widely used loss function for classification problems.
 Generally, the mean square error function is used to solve the regression problem, while
the cross entropy error function is used to solve the classification problem.
Batch Gradient Descent Algorithm (BGD)
 In the training sample set 𝐷, each sample is recorded as < 𝑋, 𝑡 >, in which 𝑋 is the input vector, 𝑡
the target output, 𝑜 the actual output, and 𝜂 the learning rate.
 Initializes each 𝑤𝑖 to a random value with a smaller absolute value.
 Before the end condition is met:
 Initializes each ∆𝑤𝑖 to zero.
 For each < 𝑋, 𝑡 > in D:
− Input 𝑋 to this unit and calculate the output 𝑜.
1 𝜕C(𝑡𝑑 ,𝑜𝑑 )
− For each 𝑤𝑖 in this unit: ∆𝑤𝑖 += -η𝑛 x 𝑑∈𝐷 𝜕𝑤𝑖
.
 For each 𝑤𝑖 in this unit: 𝑤𝑖 += ∆𝑤𝑖 .
 The gradient descent algorithm of this version is not commonly used because:
 The convergence process is very slow as all training samples need to be calculated every time the weight
is updated.
Stochastic Gradient Descent Algorithm (SGD)
 To address the BGD algorithm defect, a common variant called Incremental Gradient Descent
algorithm is used, which is also called the Stochastic Gradient Descent (SGD) algorithm. One
implementation is called Online Learning, which updates the gradient based on each sample:
1 𝜕C(𝑡𝑑 ,𝑜𝑑 ) 𝜕C(𝑡𝑑 ,𝑜𝑑 )

∆𝑤𝑖 = −𝜂 𝑛 x 𝑑∈𝐷 𝜕𝑤𝑖
⟹ ∆𝑤𝑖 = −𝜂 𝑑∈𝐷 𝜕𝑤𝑖
.
 ONLINE-GRADIENT-DESCENT
 Generates a random <X, t> from D and does the following calculation:
 Input X to this unit and calculate the output o.
𝜕C(𝑡𝑑 ,𝑜𝑑 )
 For each 𝑤𝑖 in this unit: 𝑤𝑖 += -η 𝑑∈𝐷 𝜕𝑤𝑖
.
Mini-Batch Gradient Descent Algorithm (MBGD)
 To address the defects of the previous two gradient descent algorithms, the Mini-batch Gradient
Descent Algorithm (MBGD) was proposed and has been most widely used. A small number of
Batch Size (BS) samples are used at a time to calculate ∆𝑤𝑖 , and then the weight is updated
accordingly.
 Batch-gradient-descent
 Initializes each ∆𝑤𝑖 to zero.
 For each < 𝑋, 𝑡 > in the BS samples in the next batch in 𝐷:
− Input 𝑋 to this unit and calculate the output 𝑜.
1 𝜕C(𝑡𝑑 ,𝑜𝑑 )
− For each 𝑤𝑖 in this unit: ∆𝑤𝑖 += -η𝑛 x 𝑑∈𝐷 𝜕𝑤𝑖
 For each 𝑤𝑖 in this unit: 𝑤𝑖 += ∆𝑤𝑖

 For the last batch, the training samples are mixed up in a random order.
Backpropagation Algorithm (1)
 Signals are propagated in forward direction,
and errors are propagated in backward Forward propagation direction
direction.
 In the training sample set D, each sample is 𝑥1
recorded as <X, t>, in which X is the input
𝑜1
vector, t the target output, o the actual output, 𝑥2
and w the weight coefficient. 𝑜2
𝑥3
 Loss function:
Output layer
1
E  w     dD  (td  od ) 2 Input layer
Hidden layer
2
Backpropagation direction
 According to the following formulas, errors in the input, hidden, and output layers are
accumulated to generate the error in the loss function.
 wc is the weight coefficient between the hidden layer and the output layer, while wb is the weight
coefficient between the input layer and the hidden layer. 𝑓 is the activation function, D is the
output layer set, and C and B are the hidden layer set and input layer set respectively. Assume
that the loss function is a quadratic cost function:
1
 Output layer error: E 
2  dD 
(t d  od ) 2
2
1 1
E    dD  td  f (netd )     dD  td  f (  cC  wc yc ) 
2
 Expanded hidden
2 2  
layer error:
 
2
1 
E    dD  td  f   cC  wc f (netc )  
 Expanded input 2  
   
2
layer error: 1 t  f

2  dD  
d  cC  wc f  bB wb xb
 To minimize error E, the gradient descent iterative calculation can be used to
solve wc and wb, that is, calculating wc and wb to minimize error E.
 Formula:
E
wc   ,c C
wc
E
wb   ,b  B
wb
 If there are multiple hidden layers, chain rules are used to take a derivative for
each layer to obtain the optimized parameters by iteration.
 For a neural network with any number of layers, the arranged formula for training is as follows:
wljk   kl 1 f j ( z lj )
 f j ' ( z lj )(t j  f j ( z lj )), l  outputs, (1)

 lj  
 
 k
 l 1 l
k w f
jk j
'
( z l
j ), otherwise, (2)
 The BP algorithm is used to train the network as follows:

 Takes out the next training sample <X, T>, inputs X to the network, and obtains the actual output o.
 Calculates output layer δ according to the output layer error formula (1).
 Calculates δ of each hidden layer from output to input by iteration according to the hidden layer error
propagation formula (2).
 According to the δ of each layer, the weight values of all the layer are updated.
Contents
2. Training Rules
4. Normalizer
5. Optimizer
7. Common Problems
Activation Function
 Activation functions are important for the neural network model to learn and
understand complex non-linear functions. They allow introduction of non-linear
features to the network.
 Without activation functions, output signals are only simple linear functions.
The complexity of linear functions is limited, and the capability of learning
complex function mappings from data is low.
Activation Function
output  f ( w1 x1  w2 x2  w3 x3 K )  f (W  X )
t
Sigmoid
1
𝑓 𝑥 =
1 + 𝑒 −𝑥
Tanh
𝑒 𝑥 − 𝑒 −𝑥
tanh 𝑥 = 𝑥
𝑒 + 𝑒 −𝑥
Softsign
𝑥
𝑓 𝑥 =
𝑥 +1
Rectified Linear Unit (ReLU)
𝑥, 𝑥 ≥ 0
𝑦=
0, 𝑥 < 0
Softplus
𝑓 𝑥 = ln 𝑒 𝑥 + 1
Softmax
 Softmax function:
𝑒 𝑧𝑗
σ(z)𝑗 = 𝑧𝑘
𝑘𝑒
 The Softmax function is used to map a K-dimensional vector of arbitrary real

values to another K-dimensional vector of real values, where each vector
element is in the interval (0, 1). All the elements add up to 1.
 The Softmax function is often used as the output layer of a multiclass
classification task.
Contents
2. Training Rules
4. Normalizer
5. Optimizer
7. Common Problems
Normalizer
 Regularization is an important and effective technology to reduce generalization
errors in machine learning. It is especially useful for deep learning models that
tend to be over-fit due to a large number of parameters. Therefore, researchers
have proposed many effective technologies to prevent over-fitting, including:
 Adding constraints to parameters, such as 𝐿1 and 𝐿2 norms
 Expanding the training set, such as adding noise and transforming data
 Dropout
 Early stopping
Penalty Parameters
 Many regularization methods restrict the learning capability of models by
adding a penalty parameter Ω(𝜃) to the objective function 𝐽. Assume that the
target function after regularization is 𝐽.
𝐽 𝜃; 𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝜃),
 Where 𝛼𝜖[0, ∞) is a hyperparameter that weights the relative contribution of
the norm penalty term Ω and the standard objective function 𝐽(𝑋; 𝜃). If 𝛼 is set
to 0, no regularization is performed. The penalty in regularization increases with
𝛼.
𝐿1 Regularization
 Add 𝐿1 norm constraint to model parameters, that is,
𝐽 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 + 𝛼 𝑤 1,
 If a gradient method is used to resolve the value, the parameter gradient is

𝛻 𝐽 𝑤 =∝ 𝑠𝑖𝑔𝑛 𝑤 + 𝛻𝐽 𝑤 .
𝐿2 Regularization
 Add norm penalty term 𝐿2 to prevent overfitting.
1 2
𝐽 𝑤; 𝑋, 𝑦 = 𝐽 𝑤; 𝑋, 𝑦 +
2
𝛼 𝑤 2,
 A parameter optimization method can be inferred using an optimization

technology (such as a gradient method):
𝑤 = 1 − 𝜀𝛼 𝜔 − 𝜀𝛻𝐽(𝑤),
 where 𝜀 is the learning rate. Compared with a common gradient optimization
formula, this formula multiplies the parameter by a reduction factor.
𝐿1 v.s. 𝐿2
 The major differences between 𝐿2 and 𝐿1 :
 According to the preceding analysis, 𝐿1 can generate a more sparse model than 𝐿2 . When the value of parameter 𝑤 is
small, 𝐿1 regularization can directly reduce the parameter value to 0, which can be used for feature selection.
 From the perspective of probability, many norm constraints are equivalent to adding prior probability distribution to
parameters. In 𝐿2 regularization, the parameter value complies with the Gaussian distribution rule. In 𝐿1 regularization,
the parameter value complies with the Laplace distribution rule.
𝐿1 𝐿2
Dataset Expansion
 The most effective way to prevent over-fitting is to add a training set. A larger training set has a
smaller over-fitting probability. Dataset expansion is a time-saving method, but it varies in
different fields.
 A common method in the object recognition field is to rotate or scale images. (The prerequisite to image
transformation is that the type of the image cannot be changed through transformation. For example, for
handwriting digit recognition, categories 6 and 9 can be easily changed after rotation).
 Random noise is added to the input data in speech recognition.
 A common practice of natural language processing (NLP) is replacing words with their synonyms.
 Noise injection can add noise to the input or to the hidden layer or output layer. For example, for Softmax
classification, noise can be added using the label smoothing technology. If noise is added to categories 0
𝜀 𝑘−1
and 1, the corresponding probabilities are changed to 𝑘
and 1 − 𝑘
𝜀 respectively.
Dropout
 Dropout is a common and simple regularization method, which has been widely used since 2014. Simply put,
Dropout randomly discards some inputs during the training process. In this case, the parameters
corresponding to the discarded inputs are not updated. As an integration method, Dropout combines all sub-
network results and obtains sub-networks by randomly dropping inputs. See the figures below:
Early Stopping
 A test on data of the validation set can be inserted during the training. When
the data loss of the verification set increases, perform early stopping.
Early stopping
Contents
2. Training Rules
4. Normalizer
5. Optimizer
7. Common Problems
Optimizer
 There are various optimized versions of gradient descent algorithms. In object-
oriented language implementation, different gradient descent algorithms are
often encapsulated into objects called optimizers.
 Purposes of the algorithm optimization include but are not limited to:
 Accelerating algorithm convergence.
 Preventing or jumping out of local extreme values.
 Simplifying manual parameter setting, especially the learning rate (LR).
 Common optimizers: common GD optimizer, momentum optimizer, Nesterov,
AdaGrad, AdaDelta, RMSProp, Adam, AdaMax, and Nadam.
Momentum Optimizer
 A most basic improvement is to add momentum terms for ∆𝑤𝑗𝑖 . Assume that the weight correction of the 𝑛-th iteration is
∆𝑤𝑗𝑖 (𝑛) . The weight correction rule is:
 ∆𝑤𝑗𝑖𝑙 𝑛 = −𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 (𝑛) + 𝛼∆𝑤𝑗𝑖𝑙 𝑛 − 1
 where 𝛼 is a constant (0 ≤ 𝛼 < 1) called Momentum Coefficient and 𝛼∆𝑤𝑗𝑖 𝑛 − 1 is a momentum term.
 Imagine a small ball rolls down from a random point on the error surface. The introduction of the momentum term is
equivalent to giving the small ball inertia.
−𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 (𝑛)
Advantages and Disadvantages of Momentum Optimizer
 Advantages:
 Enhances the stability of the gradient correction direction and reduces mutations.
 In areas where the gradient direction is stable, the ball rolls faster and faster (there is a speed upper limit
because 𝛼 < 1), which helps the ball quickly overshoot the flat area and accelerates convergence.
 A small ball with inertia is more likely to roll over some narrow local extrema.
 Disadvantages:
 The learning rate 𝜂 and momentum 𝛼 need to be manually set, which often requires more experiments to
determine the appropriate value.
AdaGrad Optimizer (1)
 The common feature of the random gradient descent algorithm (SGD), small-batch gradient descent
algorithm (MBGD), and momentum optimizer is that each parameter is updated with the same LR.
 According to the approach of AdaGrad, different learning rates need to be set for different parameters.
C (t , o)
gt = Gradient calculation
wt
rt  rt 1  gt2 Square gradient accumulation

wt   gt Computing update
  rt
Application update
wt 1 =wt  wt
 𝑔𝑡 indicates the t-th gradient, 𝑟 is a gradient accumulation variable, and the initial value of 𝑟 is 0, which
increases continuously. 𝜂 indicates the global LR, which needs to be set manually. 𝜀 is a small constant, and
is set to about 10-7 for numerical stability.
AdaGrad Optimizer (2)
 The AdaGrad optimization algorithm shows that the 𝑟 continues increasing while the
overall learning rate keeps decreasing as the algorithm iterates. This is because we hope
LR to decrease as the number of updates increases. In the initial learning phase, we are
far away from the optimal solution to the loss function. As the number of updates
increases, we are closer to the optimal solution, and therefore LR can decrease.
 Pros:
 The learning rate is automatically updated. As the number of updates increases, the learning
rate decreases.
 Cons:
 The denominator keeps accumulating so that the learning rate will eventually become very
small, and the algorithm will become ineffective.
RMSProp Optimizer
 The RMSProp optimizer is an improved AdaGrad optimizer. It introduces an attenuation coefficient to ensure
a certain attenuation ratio for 𝑟 in each round.
 The RMSProp optimizer solves the problem that the AdaGrad optimizer ends the optimization process too
early. It is suitable for non-stable target handling and has good effects on the RNN.
C (t , o)
gt = Gradient calculation
wt
rt = rt 1  (1   ) gt2 Square gradient accumulation

wt   gt Computing update
  rt
wt 1  wt  wt Application update
 𝑔𝑡 indicates the t-th gradient, 𝑟 is a gradient accumulation variable, and the initial value of 𝑟 is 0, which may
not increase and needs to be adjusted using a parameter. 𝛽 is the attenuation factor,𝜂 indicates the global
LR, which needs to be set manually. 𝜀 is a small constant, and is set to about 10-7 for numerical stability.
Adam Optimizer (1)
 Adaptive Moment Estimation (Adam): Developed based on AdaGrad and
AdaDelta, Adam maintains two additional variables 𝑚𝑡 and 𝑣𝑡 for each variable
to be trained:
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )𝑔𝑡
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + (1 − 𝛽2 )𝑔𝑡2
 Where 𝑡 represents the 𝑡-th iteration and 𝑔𝑡 is the calculated gradient. 𝑚𝑡 and 𝑣𝑡
are moving averages of the gradient and square gradient. From the statistical
perspective, 𝑚𝑡 and 𝑣𝑡 are estimates of the first moment (the average value)
and the second moment (the uncentered variance) of the gradients respectively,
which also explains why the method is so named.
Adam Optimizer (2)
 If 𝑚𝑡 and 𝑣𝑡 are initialized using the zero vector, 𝑚𝑡 and 𝑣𝑡 are close to 0 during the initial
iterations, especially when 𝛽1 and 𝛽2 are close to 1. To solve this problem, we use 𝑚𝑡 and 𝑣𝑡 :
𝑚𝑡
𝑚𝑡 =
1 − 𝛽1𝑡
𝑣𝑡
𝑣𝑡 =
1 − 𝛽2𝑡
 The weight update rule of Adam is as follows:

𝜂
𝑤𝑡+1 = 𝑤𝑡 − 𝑚𝑡
𝑣𝑡 + 𝜖
 Although the rule involves manual setting of 𝜂, 𝛽1 , and 𝛽2 , the setting is much simpler. According
to experiments, the default settings are 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−8 , and 𝜂 = 0.001. In practice, Adam
will converge quickly. When convergence saturation is reached, xx can be reduced. After several
times of reduction, a satisfying local extremum will be obtained. Other parameters do not need to
be adjusted.
Optimizer Performance Comparison
Comparison of optimization Comparison of optimization

algorithms in contour maps of algorithms at the saddle point
loss functions
Contents
2. Training Rules
4. Normalizer
5. Optimizer
7. Common Problems
Convolutional Neural Network
 A convolutional neural network (CNN) is a feedforward neural network. Its artificial
neurons may respond to surrounding units within the coverage range. CNN excels at
image processing. It includes a convolutional layer, a pooling layer, and a fully
connected layer.
 In the 1960s, Hubel and Wiesel studied cats' cortex neurons used for local sensitivity
and direction selection and found that their unique network structure could simplify
feedback neural networks. They then proposed the CNN.
 Now, CNN has become one of the research hotspots in many scientific fields, especially
in the pattern classification field. The network is widely used because it can avoid
complex pre-processing of images and directly input original images.
Main Concepts of CNN
 Local receptive field: It is generally considered that human perception of the outside
world is from local to global. Spatial correlations among local pixels of an image are
closer than those among distant pixels. Therefore, each neuron does not need to
know the global image. It only needs to know the local image. The local information is
combined at a higher level to generate global information.
 Parameter sharing: One or more filters/kernels may be used to scan input images.
Parameters carried by the filters are weights. In a layer scanned by filters, each filter
uses the same parameters during weighted computation. Weight sharing means that
when each filter scans an entire image, parameters of the filter are fixed.
Architecture of Convolutional Neural Network
Input Three-feature Three-feature Five-feature Five-feature Output
image image image image image layer
Convolutional Pooling Convolutional Pooling Fully connected

layer layer layer layer layer
Bird Pbird
Sunset Psunset
Dog Pdog
Cat Pcat
Vectorization
Convolution + nonlinearity Max pooling
Multi-
Convolution layers + pooling layers
Fully connected layer category
Single-Filter Calculation (1)
 Description of convolution calculation
Single-Filter Calculation (2)
 Demonstration of the convolution calculation
Han Bingtao, 2017, Convolutional Neural Network
Convolutional Layer
 The basic architecture of a CNN is multi-channel convolution consisting of multiple single convolutions. The
output of the previous layer (or the original image of the first layer) is used as the input of the current layer.
It is then convolved with the filter in the layer and serves as the output of this layer. The convolution kernel
of each layer is the weight to be learned. Similar to FCN, after the convolution is complete, the result should
be biased and activated through activation functions before being input to the next layer.
Wn bn
Fn
Input Output
tensor tensor
F1
W2 b2 Activate
Output
W1 b1
Convolutional Bias
kernel
Pooling Layer
 Pooling combines nearby units to reduce the size of the input on the next layer, reducing dimensions.
Common pooling includes max pooling and average pooling. When max pooling is used, the maximum value
in a small square area is selected as the representative of this area, while the mean value is selected as the
representative when average pooling is used. The side of this small area is the pool window size. The
following figure shows the max pooling operation whose pooling window size is 2.
Sliding direction
Fully Connected Layer
 The fully connected layer is essentially a classifier. The features extracted on the
convolutional layer and pooling layer are straightened and placed at the fully
connected layer to output and classify results.
 Generally, the Softmax function is used as the activation function of the final
fully connected output layer to combine all local features into global features
and calculate the score of each type.
𝑒 𝑧𝑗
σ(z)𝑗 = 𝑧𝑘
𝑘𝑒
Recurrent Neural Network
 The recurrent neural network (RNN) is a neural network that captures dynamic
information in sequential data through periodical connections of hidden layer nodes. It
can classify sequential data.
 Unlike other forward neural networks, the RNN can keep a context state and even
store, learn, and express related information in context windows of any length. Different
from traditional neural networks, it is not limited to the space boundary, but also
supports time sequences. In other words, there is a side between the hidden layer of the
current moment and the hidden layer of the next moment.
 The RNN is widely used in scenarios related to sequences, such as videos consisting of
image frames, audio consisting of clips, and sentences consisting of words.
Recurrent Neural Network Architecture (1)
 𝑋𝑡 is the input of the input sequence at time t.
 𝑆𝑡 is the memory unit of the sequence at time t and caches
previous information.
𝑆𝑡 = 𝑡𝑎𝑛ℎ 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 .

 𝑂𝑡 is the output of the hidden layer of the sequence at time t.
𝑂𝑡 = 𝑡𝑎𝑛ℎ 𝑉𝑆𝑡
 𝑂𝑡 after through multiple hidden layers, it can get the final
output of the sequence at time t.
Recurrent Neural Network Architecture (2)
LeCun, Bengio, and G. Hinton, 2015, A Recurrent Neural Network and the
Unfolding in Time of the Computation Involved in Its Forward Computation
Types of Recurrent Neural Networks
Andrej Karpathy, 2015, The Unreasonable Effectiveness of Recurrent Neural Networks
Backpropagation Through Time (BPTT)
 BPTT:
 Traditional backpropagation is the extension on the time sequence.
 There are two sources of errors in the sequence at time of memory unit: first is from the hidden layer output error at t
time sequence; the second is the error from the memory cell at the next time sequence t + 1.
 The longer the time sequence, the more likely the loss of the last time sequence to the gradient of w in the first time
sequence causes the vanishing gradient or exploding gradient problem.
 The total gradient of weight w is the accumulation of the gradient of the weight at all time sequence.
 Three steps of BPTT:

 Computing the output value of each neuron through forward propagation.
 Computing the error value of each neuron through backpropagation 𝛿𝑗 .
 Computing the gradient of each weight.
 Updating weights using the SGD algorithm.
Recurrent Neural Network Problem
 𝑆𝑡 = 𝜎 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 is extended on the time sequence.
 𝑆𝑡 = σ 𝑈𝑋𝑡 + 𝑊 𝜎 𝑈𝑋𝑡−1 + 𝑊 𝜎 𝑈𝑋𝑡−2 + 𝑊 …
 Despite that the standard RNN structure solves the problem of information memory,
the information attenuates during long-term memory.
 Information needs to be saved long time in many tasks. For example, a hint at the
beginning of a speculative fiction may not be answered until the end.
 The RNN may not be able to save information for long due to the limited memory unit
capacity.
 We expect that memory units can remember key information.
Long Short-term Memory Network
Colah, 2015, Understanding LSTMs Networks

Gated Recurrent Unit (GRU)
Generative Adversarial Network (GAN)
 Generative Adversarial Network is a framework that trains generator G and discriminator D through the
adversarial process. Through the adversarial process, the discriminator can tell whether the sample from the
generator is fake or real. GAN adopts a mature BP algorithm.
 (1) Generator G: The input is noise z, which complies with manually selected prior probability distribution,
such as even distribution and Gaussian distribution. The generator adopts the network structure of the
multilayer perceptron (MLP), uses maximum likelihood estimation (MLE) parameters to represent the
derivable mapping G(z), and maps the input space to the sample space.
 (2) Discriminator D: The input is the real sample x and the fake sample G(z), which are tagged as real and
fake respectively. The network of the discriminator can use the MLP carrying parameters. The output is the
probability D(G(z)) that determines whether the sample is a real or fake sample.
 GAN can be applied to scenarios such as image generation, text generation, speech enhancement, image
super-resolution.
GAN Architecture
 Generator/Discriminator
Generative Model and Discriminative Model
 Generative network  Discriminator network
 Generates sample data  Determines whether sample data is real
 Input: Gaussian white noise vector z  Input: real sample data 𝑥𝑟𝑒𝑎𝑙 and
 Output: sample data vector x generated sample data 𝑥 = 𝐺 𝑧
 Output: probability that determines
whether the sample is real
x  G ( z; )
G
y  D( x; D )
𝑥𝑟𝑒𝑎𝑙
G
z x D y
x
Training Rules of GAN
 Optimization objective:
 Value function
min maxV ( D, G )  Ex  pdata（x）[logD( x)]  Ez  pz( z ) [log (1  D(G ( z )))]

G D
 In the early training stage, when the outcome of G is very poor, D determines that
the generated sample is fake with high confidence, because the sample is obviously
different from training data. In this case, log(1-D(G(z))) is saturated (where the
gradient is 0, and iteration cannot be performed). Therefore, we choose to train G
only by minimizing [-log(D(G(z))].
Contents
2. Training Rules
4. Normalizer
5. Optimizer
7. Common Problems
Data Imbalance (1)
 Problem description: In the dataset consisting of various task categories, the number of
samples varies greatly from one category to another. One or more categories in the
predicted categories contain very few samples.
 For example, in an image recognition experiment, more than 2,000 categories among a
total of 4251 training images contain just one image each. Some of the others have 2-5
images.
 Impacts:
 Due to the unbalanced number of samples, we cannot get the optimal real-time result
because model/algorithm never examines categories with very few samples adequately.
 Since few observation objects may not be representative for a class, we may fail to obtain
adequate samples for verification and test.
Data Imbalance (2)
Random Random Synthetic

undersampling oversampling Minority
• Deleting redundant • Copying samples Oversampling
samples in a Technique
category
• Sampling
• Merging samples
Vanishing Gradient and Exploding Gradient Problem (1)
 Vanishing gradient: As network layers increase, the derivative value of
backpropagation decreases, which causes a vanishing gradient problem.
 Exploding gradient: As network layers increase, the derivative value of
backpropagation increases, which causes an exploding gradient problem.
 Cause: y𝑖 = 𝜎(𝑧𝑖) = 𝜎 𝑤𝑖 𝑥𝑖 + 𝑏𝑖 Where σ is sigmoid function.
w2 w3 w4
b1 b2 b3 C
 Backpropagation can be deduced as follows:

𝜕C 𝜕C 𝜕𝑦4 𝜕𝑧4 𝜕𝑥4 𝜕𝑧3 𝜕𝑥3 𝜕𝑧2 𝜕𝑥2 𝜕𝑧1
= 𝜕𝑦
𝜕𝑏1 4 𝜕𝑧4 𝜕𝑥4 𝜕𝑧3 𝜕𝑥3 𝜕𝑧2 𝜕𝑥2 𝜕𝑧1 𝜕𝑏1
𝜕C ′ ′ 𝑧 𝑤 𝜎′ 𝑧 𝑤 𝜎′ 𝑧
= 𝜎 𝑧4 𝑤 4 𝜎 3 3 2 2 1 𝑥
𝜕𝑦4
Vanishing Gradient and Exploding Gradient Problem (2)
1
 The maximum value of 𝜎 ′ (𝑥) is 4:
1
 However, the network weight 𝑤 is usually smaller than 1. Therefore, 𝜎 ′ 𝑧 𝑤 ≤ 4. According to the chain
𝜕C
rule, as layers increase, the derivation result 𝜕𝑏1
decreases, resulting in the vanishing gradient problem.
 When the network weight 𝑤 is large, resulting in 𝜎 ′ 𝑧 𝑤 > 1, the exploding gradient problem occurs.
 Solution: For example, gradient clipping is used to alleviate the exploding gradient problem, ReLU activation
function and LSTM are used to alleviate the vanishing gradient problem.
Overfitting
 Problem description: The model performs well in the training set, but badly in
the test set.
 Root cause: There are too many feature dimensions, model assumptions, and
parameters, too much noise, but very few training data. As a result, the fitting
function perfectly predicts the training set, while the prediction result of the test
set of new data is poor. Training data is over-fitted without considering
generalization capabilities.
 Solution: For example, data augmentation, regularization, early stopping, and
dropout
Summary
 This chapter describes the definition and development of neural networks,

perceptrons and their training rules, common types of neural networks
(CNN, RNN, and GAN), and the Common Problems of neural networks in
AI engineering and solutions.

Deep Learning

Uploaded by

Copyright:

Available Formats

Deep Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning

Uploaded by

Copyright:

Available Formats

Deep Learning Overview

On completion of this course, you will be able to:

1. Deep Learning Summary

6. Types of Neural Networks

Traditional Machine Learning Deep Learning

Easy-to-explain features Hard-to-explain features

Data Feature Feature

Question: Can we use

Dendrite Synapse Output

Axon Output layer

Human neural network Perceptron Deep neural network

Golden age AI winter

1958 1970 1986 1995 2006

Classification point Classification line Classification plane Classification hyperplane

Input layer Output layer

0 hidden layers 3 hidden layers 20 hidden layers

1. Deep Learning Summary

6. Types of Neural Networks

Example of gradient descent of

 Cross entropy error function:

 For each 𝑤𝑖 in this unit: 𝑤𝑖 += ∆𝑤𝑖 .

1 𝜕C(𝑡𝑑 ,𝑜𝑑 ) 𝜕C(𝑡𝑑 ,𝑜𝑑 )

 For each 𝑤𝑖 in this unit: 𝑤𝑖 += ∆𝑤𝑖

 The BP algorithm is used to train the network as follows:

1. Deep Learning Summary

6. Types of Neural Networks

 The Softmax function is used to map a K-dimensional vector of arbitrary real

1. Deep Learning Summary

6. Types of Neural Networks

 If a gradient method is used to resolve the value, the parameter gradient is

 A parameter optimization method can be inferred using an optimization

1. Deep Learning Summary

6. Types of Neural Networks

 ∆𝑤𝑗𝑖𝑙 𝑛 = −𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 (𝑛) + 𝛼∆𝑤𝑗𝑖𝑙 𝑛 − 1

−𝜂𝛿𝑖𝑙+1 𝑥𝑗𝑙 (𝑛)

 The weight update rule of Adam is as follows:

Comparison of optimization Comparison of optimization

1. Deep Learning Summary

6. Types of Neural Networks

Convolutional Pooling Convolutional Pooling Fully connected

Han Bingtao, 2017, Convolutional Neural Network

𝑆𝑡 = 𝑡𝑎𝑛ℎ 𝑈𝑋𝑡 + 𝑊𝑆𝑡−1 .

Andrej Karpathy, 2015, The Unreasonable Effectiveness of Recurrent Neural Networks

 Three steps of BPTT:

 Computing the error value of each neuron through backpropagation 𝛿𝑗 .

 Computing the gradient of each weight.

 Updating weights using the SGD algorithm.

 𝑆𝑡 = σ 𝑈𝑋𝑡 + 𝑊 𝜎 𝑈𝑋𝑡−1 + 𝑊 𝜎 𝑈𝑋𝑡−2 + 𝑊 …

Colah, 2015, Understanding LSTMs Networks

min maxV ( D, G )  Ex  pdata（x）[logD( x)]  Ez  pz( z ) [log (1  D(G ( z )))]

1. Deep Learning Summary

6. Types of Neural Networks

Random Random Synthetic

 Backpropagation can be deduced as follows:

 This chapter describes the definition and development of neural networks,

You might also like