Brief Introduction To Artificial Neural Networks Ensps
Brief Introduction To Artificial Neural Networks Ensps
Brief Introduction To Artificial Neural Networks Ensps
Xihe Ge is a mechanical engineering postgraduate in the "Future industry and smart systems" discipline at ENS
Paris-Saclay, France. Prof. Laurent Oudre is Full-Professor at ENS Paris-Saclay in the Centre Borelli (UMR
9010) laboratory; his research activities focus on signal processing, pattern recognition and machine learning for
time series.
1 Introduction
In recent years, with the rapid developments of data processing techniques and the influx of venture capital, artificial
intelligence (AI) has asserted its effectiveness in automating tasks and has begun to profoundly impact all aspects
of our society, including academic, industry, and public life.
In 2011, IBM’s W atson, a prominent question-answering computer system, beat the two most successful human
contestants of the American popular quiz show Jeopardy!, which made people discuss "the potential thinking
ability of machines". In 2016, after the world Go champion Lee Sedol was defeated (1:4) by Google’s Go program
AlphaGo, the terms "artificial intelligence (AI)", "machine learning (ML)" and "artificial neural networks (ANN)"
draw the attention of the media and public consciousness once again. One year later, the next-generation program,
AlphaGo M aster, won the match by 3:0 against Ke Jie, the top-ranked human player in the world, which opened
a new period of competitive games that AI dominated.
This article will first introduce the definitions, applications and widely used methods of AI to bring an overall and
intuitive understanding. Furthermore, it will investigate how the human brain neurons bring inspiration to the
origin of artificial neural networks. Then, it will provide a general introduction and summary of the related key
techniques, including framework, model training and optimization.
1
• Health
Nowadays, smartwatches can monitor our health data such as electrocardiography, blood oxygen level and
sleep status. The AI can then comprehensively analyze our health state and provide personalized health
advice. The watches can also monitor walking stability to predict whether older people are at risk of falling
and prevent accidents.
In cancer diagnosis, AI is able to learn considerable numbers of computed tomography images worldwide and
has now achieved astonishing predicting accuracy in cancer determination, comparable to the robustness of
experienced experts in this field. Thus an excellent AI model, once validated, can be applied to all patients
worldwide in the absence of a nearby experienced doctor.
• Language
AI is also making a big difference in the field of languages. Thanks to today’s translation software, all texts
can be read by everyone, regardless of the original language, which makes it easier for people to communicate
with each other in the business world when collaborating or in the personal world when travelling.
The voice assistant is common nowadays. It can answer all kinds of questions and help us control various
smart products by simply calling on them, making our daily lives far more convenient.
• Finance
In the past, the lending departments of the bank had to check the qualifications of borrowers constantly,
which was time-consuming and costly. AI can help financial institutions analyze their daily activity records
to establish comprehensive credit scores and behavior models for the public in real need to obtain financial
support while automatically conducting error audits, improving anti-fraud capabilities, reducing the risk of
bad debts and shortening lending times.
Besides the aforementioned applications, other fields, such as the Internet industry, public safety, customer ser-
vice, education, culture, tourism, game, logistics, new energy, pharmacy, manufacturing and construction, are also
undergoing dramatic changes with the help of digital transformation and AI.
Machine Learning
Neural Networks
Unsupervised Supervised Reinforcement
Learning Learning Learning
(Unlabeled Data) (Labeled Data) (Action/Mistake)
Figure 1: "NN" ⊆ "ML" ⊆ "AI" Figure 2: Three types of machine learning algorithms
Machine learning is the most famous AI method that allows a system or software to learn knowledge from acquired
data. As shown in figure 1, Neural Networks (NN) is included in Machine Learning (ML) and they are included in
Artificial Intelligence (AI). Machine learning is a set of methods to achieve AI. Machine learning involves knowledge
like probability theory, statistics, approximation theory, convex analysis, and many other multi-disciplinary subjects.
The "data-driven" approach is the core idea of machine learning allowing algorithms to make predictions and
decisions based on data analysis and interpretation. For the present, machine learning can be broadly divided into
three categories (Figure 2):
• Supervised learning
Labeled data is a designation for data elements that have been tagged with one or several labels identifying
specific properties, characteristics, classifications or contained objects. Supervised learning can establish a
mapping model from inputs to outputs based on labeled example data sets, then predict the result of new
input according to the input-output relationships that one has seen before.
2
• Unsupervised learning
Unlabelled data refers to data elements that have not been labeled with tags identifying characteristics,
properties or classifications. Unsupervised learning can automatically search features and structures from
unlabeled data, group the data into various clusters, identify association rules, and reduce dimensions to
realize tasks such as segmentation, pattern detection, and anomaly detection.
• Reinforcement learning
Unlike the above two methods, the data is not mandatory anymore during reinforcement learning. It is
a process of receiving quantified rewards from the environment with different actions to update the model
parameters. In other words, reinforcement learning is a “trial-and-error” learning approach to constantly
interact with the environment to obtain the best strategy by maximizing the reward. “State, action, reward”
are the three key elements of reinforcement learning. The model observes the decision outcome at each step,
leading to the next decision to win the final goal. The game and robots are the most widely used areas of this
method at present.
The artificial neural network (ANN) is currently one of the most representative supervised learning algorithms.
This article will now focus on it to talk about its origin and how it works.
Surrounding
stimulus
xn wn ∑
1
z a
Bias 0
Synapse Threshold
b
potential
e−x
S ′ (x) = = S(x)(1 − S(x)) (3)
(1 + e−x )2
It is observed that when the input of the Sigmoid function is very big or very small, the output will enter the "flat"
area, and the gradient will "vanish". Whereas when the input is close to 0, the value of the sigmoid derivative is
larger.
In addition, the sigmoid function directly outputs a half-saturated state (output = 0.5) when the activation threshold
is just reached (input = 0), which does not correspond to the real biological neural networks. Typically, only 1%−4%
of neurons in the brain are active (output > 0) simultaneously and are not easily saturated.
Therefore, the rectified linear unit function (ReLU) (Figure 5 - 4) is more in line with physiological models and is
more adapted to avoid exploding and vanishing gradients. Besides, it can also simplify the calculation and accelerate
the convergence because its derivative always equals 1 if the input is positive:
Figure 5: Step function, Sigmoid function, its derivative and ReLU function
The above non-linear activation functions can introduce non-linear characteristics into the ANN model. Under
certain conditions, these functions can help the model approximate any continuous non-linear mapping relations
between input and output with any accuracy if there are enough neurons and hidden layers [4].
4
Input layer Hidden layer Output layer
b1 b3 b5
1 1 1
w5
w1 a3 w9
x1 ∑ z1 a1
w6
∑ z3
∑ z5 y1
w2 0 0 w10 0
w3 w7 w11
1 1 1
x2
w4
∑ z2 a2
w8 ∑ z4 a4
w12 ∑ z6 y2
0 0 0
b2 b4 b6
Input Sum Activation function Bias weight Predictive value Error (MSE) True value
Once we understand how individual perceptrons receive, process and transmit signals, it is time to see how to
calculate the values layer-by-layer from the input to the output in a fully-connected model to realize forward
propagation:
a1 = σ(z1 ) = σ(x1 w1 + x2 w3 + b1 )
a2 = σ(z2 ) = σ(x1 w2 + x2 w4 + b2 )
a3 = σ(z3 ) = σ(a1 w5 + a2 w7 + b3 ) (5)
...
a = σ(z ) = σ(a ∗ W + b )
n n n
where an is the output of the current neuron, σ is the activation function, a and W are the matrix forms of the
outputs and their corresponding weights from the previous layer, bn is the bias.
rising direction of the function at the given point. The local minimum can be finally reached by using this method
iteratively (Figure7). In other words, the fastest way to descend is to find the steepest direction of the current
position and then go down in this direction. Then, it is necessary to find the next position’s steepest direction and
advance in that new direction until finally reach the local lowest point. Therefore, once the gradient is obtained,
the loss value can decrease by optimizing and iterating the ANN’s internal parameters to reach the local minimum
of the loss function [5]:
b1 b3 b5
1 1 1
w5
w1 a3 w9
x1 ∑ z1
0
a1 ∑ z3
∑ z5 y1
w2 w6 0
w10
0
Loss
w11
w3 1 w7 1 1
x2
w4
∑ z2 a2
w8 ∑ z4 a4 w12 ∑ z6 y2
0 0 0
b2 b4 b6
Figure 8: Backpropagation
In order to obtain the parameters minimizing the loss value, it is necessary to calculate the gradient at each
iteration, which is complicated due to the massive internal parameters (wj and bk ) of the ANN model. However,
6
these parameters are often related to each other among successive layers, so that the backpropagation method was
invented by Rumelhart et al. in 1986 [6] to help us calculate the gradient layer by layer backward from the last one.
The backpropagation can divide the error amount among the connections, which is a milestone for the development
of ANN’s algorithmic perspective, reducing the repeat calculation and significantly improving the efficiency of
gradient searching. This method is essential to understand but slightly difficult. If you are not interested in the
detailed algorithm for optimising the weight minimising loss value, you can continue on the section 3.8.
The BP method is mainly based on the derivative chain rule, for example, for the differentiable functions h, g, k,
and the variables z, y, x, s:
• case 1
If z = h(y) and y = g(x), then dz
dx = dz dy
dy dx
• case 2
If z = k(x, y), x = g(s) and y = h(s), then dz
ds = ∂z dx
∂x ds + ∂z dy
∂y ds
• ∂L
∂w9 calculation (see in the figure 8, Similar process for the ∂w10 , ∂w11 , ∂w12 , ∂b5 , ∂b6 )
∂L ∂L ∂L ∂L ∂L
∂L ∂L ∂ yˆ1 ∂z5
= (9)
∂w9 ∂ yˆ1 ∂z5 ∂w9
Pn
1. ∂L
∂ yˆ1 = yˆ1 − y1 , because L = 1
2n i=1 (yˆi − yi )2
e−z5
2. ∂ yˆ1
∂z5 = yˆ1 (1 − yˆ1 ), because yˆ1 = S(z5 ) = 1
1+e−z5
and S ′ (z5 ) = (1+e−z5 )2
= S(z5 )(1 − S(z5 ))
3. ∂z5
∂w9 = a3 , because z5 = a3 ∗ w9 + a4 ∗ w11 + b5
• ∂L
∂w5 calculation (Similar process for the ∂w6 , ∂w7 , ∂w8 , ∂b3 , ∂b4 )
∂L ∂L ∂L ∂L ∂L
1. ∂L
∂z5 = ∂L ∂ yˆ1
∂ yˆ1 ∂z5 and ∂L
∂z6 = ∂L ∂ yˆ2
∂ yˆ2 ∂z6 are already calculated above
2. ∂z5
∂a3 = w9 and ∂z6
∂a3 = w10
3. ∂a3
∂z3 = a3 (1 − a3 )
4. ∂z3
∂w5 = a1
• ∂L
∂w1 calculation (Similar process for the ∂w2 , ∂w3 , ∂w4 , ∂b1 , ∂b2 )
∂L ∂L ∂L ∂L ∂L
Similarly, according to the chain rule and the above calculated gradients, we have:
7
The calculation above is only for one set of data. The parameter gradients of all the dataset’s loss function are the
summation of each data’s results:
PN N
∂L(θ) ∂ Ln (θ) X ∂Ln (θ)
= n=1
= (15)
∂wj ∂wj n=1
∂wj
where n is the number of each data and N represents the data volume
The same gradient calculation principle is used for the ANN with a large number of layers and neurons:
∂L(θ))
∂w1
...
∂L(θ))
▽ L(θ) = (16)
∂w
j
∂L(θ))
∂b1
...
∂L(θ))
∂bk
• Network structure
Such as the number of layers, the number of neurons per layer, the type of activation function, etc.
• Optimization parameter
Such as the optimization method, learning rate, batch size, etc.
• Grid search
If the hyperparameter is continuous, such as the learning rate, it needs to be discretized according to the
"empirical values" used in other models. Then, we can try the arrangement combination of all hyperparameters
to select the best performing set.
• Random search
Some hyperparameters have a limited impact, while others have a much more significant influence on the
model’s performance. The grid search method will make unnecessary attempts at unimportant hyperparam-
eters. The random search method generates random combinations of the hyperparameters to configure out
the optimum one.
• Bayesian optimization
Unlike the above two methods, when several sets of hyperparameters are already tested, it is reasonable to
take advantage of these existing results to determine the next one. The Bayesian optimization first establishes
a probabilistic proxy function that fits the tested results to approximate the real performance function. It then
searches for a new set of hyperparameters with the highest probability of optimal performance in the proxy
function and tests its real performance. After several iterations, the probabilistic proxy function becomes closer
to the real performance function, especially in the highest performance area. Finally, the proxy function may
help us find the optimal set of hyperparameters.
The computational volume of these methods will elevate exponentially as the dimension of hyperparameters in-
creases. When we test a set of hyperparameters to train a model, if we find that the learning process is not
progressing correctly, we can stop training to drop this set and other similar hyperparameter sets. Finally, we need
to rely on engineers with in-depth knowledge and experience to optimize the hyperparameters.
8
3.9 What is Deep learning?
Generally, the more model parameters we have, the better prediction accuracy we can obtain. If the number of
internal parameters is limited, we tend to use more layers but fewer neurons of each layer rather than fewer layers
but lots of nodes per layer.
Deep learning is a kind of neural networks comprised of multiple processing hidden layers to learn data representa-
tions with multiple levels of abstraction [7]. For instance, the convolutional neural networks (CNN) take face image
pixels as inputs, utilize a deep learning framework, and summarize features layer by layer. The features "converge"
progressively from the lower level (such as curve segment, oriented edge, color) to the higher level (such as eye,
nose, mouth and their corresponded sizes, forms, distances, angles) and output a score to identify the person at the
end.
The CNN model’s parameters can be considerably reduced thanks to novel ideas like local perception and param-
eter sharing. Besides, with the advancement of electronic and information technology, especially GPU (graphics
processing unit) development and large-scale sensor application for massive data acquisition, it is now possible to
train the CNN model with more and more layers and decrease the prediction error as time goes by [8]: (Figure 9)
7.3 %
9 80
6.7 %
3.57 %
4,5 40
19 layers 22 layers
8 layers
0 0
AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)
ANN has some limitations that cannot be ignored at present, such as obtaining high-quality annotations, overfitting
problem and ecological impact.
As we know, for supervised learning, the training data need to be labeled. The amount and quality of the training
data largely determine the performance of the ANN model. However, high-quality annotations are now expensive
and difficult to obtain.
Overfitting is another common issue during the ANN training, which means the model fits too closely to the training
set and starts to learn the noise and useless information. It will deteriorate the model’s generalization performance
on the new dataset and need some techniques to avoid it.
In the Alpha Go example cited at the very beginning of this article, a lot of energy was consumed to train the AI
model. During the Go match against Lee Sedol, the environmental impact was not negligible: Alpha Go utilized
1202 CPUs (central processing unit) and 176 GPUs for the calculation!
4 Conclusion
Due to the explosion in computing power and data volume in recent years, artificial neural networks have experienced
a great expansion in agriculture, industry and services, accomplishing the functions that cannot be done with
traditional methods. The ANNs have already brought significant added-value and efficiency optimisation for human
society.
9
In some aspects, such as large-scale data storage/processing and statistical analysis, the ANN is sometimes more
coherent than humans. It also demonstrates excellent performance in specific fields, for instance, computer vision,
speech recognition, and intelligent recommendations. Whereas, in terms of innovation/design and decision-making,
it still has a long way to go, because "the machine can not think and have consciousness like humans do".
Even though we possess lots of tools allowing us to infer some patterns and rules backward from the qualified model,
this technology still has very limited interpretability. It can not gain people’s trust in applications requiring high
reliability, accuracy and stability, such as autonomous driving, quantitative trading, medical equipment, nuclear
power plant, collaborative robot and aerospace industry.
Therefore, to achieve an AI project, it may be critical to analyze the user needs, profitable possibility from a business
perspective, and evaluate the feasibility from a data science perspective at the same time. We may also need the
help of traditional efficient algorithms, customized models and close cooperation with other advanced technologies
to integrate a well-organized ecosystem to finally reach our society’s digital and intelligent transformation.
10
References
[1] Nilsson, N. J., Nilsson, N. J. (1998). Artificial intelligence: a new synthesis. Morgan Kaufmann.
[2] Stephenson Smith, S. (2003). The new international webster’s comprehensive dictionary of the english language:
deluxe encyclopedic edition (No. REF 428.03 STE. CIMMYT).
[3] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the
brain. Psychological review, 65(6), 386.
[4] Hornik, K., Stinchcombe, M., White, H. (1989). Multilayer feedforward networks are universal approximators.
Neural networks, 2(5), 359-366.
[5] Machine learning course: https://www.coursera.org/learn/machine-learning
[6] Rumelhart, D. E., Hinton, G. E., Williams, R. J. (1986). Learning representations by back-propagating errors.
nature, 323(6088), 533-536.
[7] LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.
[8] ILSVRC data: http://www.image-net.org/challenges/LSVRC/