SECA4002
SECA4002
SECA4002
1
UNIT I INTRODUCTION TO DEEP LEARNING
Introduction to machine learning - Linear models (SVMs and Perceptron’s, logistic
regression)- Introduction to Neural Nets: What are a shallow network computes- Training a network:
loss functions, back propagation and stochastic gradient descent- Neural networks as universal
function approximates.
2
their parallel processing only makes the brain’s abilities possible. Figure 1 represents a human
biological nervous unit. Various parts of biological neural network(BNN) is marked in Figure 1.
3
Information flow in a neural cell
The input/output and the propagation of information are shown below.
1.3. Artificial neuron model
An artificial neuron is a mathematical function conceived as a simple model of a real (biological)
neuron.
The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic Unit.
A set of input connections brings in activations from other neuron.
A processing unit sums the inputs, and then applies a non-linear activation function (i.e.
squashing/transfer/threshold function).
An output line transmits the result to other neurons.
1.3.1 Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single activation
function. An Artificial neural network(ANN) model based on the biological neural sytems is shown
in figure 2.
4
Different Training /Learning procedure available in ANN are
Supervised learning
Unsupervised learning
Reinforced learning
Hebbian learning
Gradient descent learning
Competitive learning
Stochastic learning
1.4.1. Requirements of Learning Laws:
• Learning Law should lead to convergence of weights
• Learning or training time should be less for capturing the information from the training
pairs
• Learning should use the local information
• Learning process should able to capture the complex non linear mapping available
between the input & output pairs
• Learning should able to capture as many as patterns as possible
• Storage of pattern information's gathered at the time of learning should be high for the
given network
1.4.1.1.Supervised learning :
5
Every input pattern that is used to train the network is associated with an output pattern which is
the target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the error.The
error can then be used to change network parameters, which result in an improvement in performance.
1.4.1.2 Unsupervised learning:
In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering and
adapting to structural features in the input patterns.
1.4.1.3 Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but only indicates
if the computed output correct or incorrect.The information provided helps the network in the
learning process.
1.4.1.4 Hebbian learning:
This rule was proposed by Hebb and is based on correlative weight adjustment.This is the oldest
learning mechanism inspired by biology.In this, the input-output pattern pairs (𝑥𝑖 , 𝑦𝑖 ) are associated
by the weight matrix W, known as the correlation matrix.
It is computed as
W = ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 𝑇 ------------ eq(1)
Here 𝑦𝑖 𝑇 is the transposeof the associated output vector 𝑦𝑖 .Numerous variants of the rule have
been proposed.
1.4.1.5 Gradient descent learning:
This is based on the minimization of error E defined in terms of weights and activation function
of the network.Also it is required that the activation function employed by the network is
differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗 is the weight update of the link connecting the 𝑖𝑡ℎ and 𝑗𝑡ℎ neuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗 is defined as,
𝜕𝐸
∆𝑤𝑖𝑗 = ɳ ----------- eq(2)
𝜕𝑤𝑖𝑗
𝜕𝐸
Where, ɳ is the learning rate parameter and is the error gradient with reference to the
𝜕𝑤𝑖𝑗
weight 𝑤𝑖𝑗 .
6
into only two categories, all we need is a single output neuron. Here we will use bipolar neurons. The
simplest architecture that could do the job consists of a layer of N input neurons, an output layer with
a single output neuron, and no hidden layers. This is the same architecture as we saw before for Hebb
learning. However, we will use a different transfer function here for the output neurons as given
below in eq (7). Figure 7 represents a single layer perceptron network.
--------------------- eq (7)
Equation 7 gives the bipolar activation function which is the most common function used in
the perceptron networks. Figure 7 represents a single layer perceptron network. The inputs arising
from the problem space are collected by the sensors and they are fed to the aswociation
units.Association units are the units which are responsible to associate the inputs based on their
similarities. This unit groups the similar inputs hence the name association unit. A single input from
each group is given to the summing unit.Weights are randomnly fixed intially and assigned to this
inputs. The net value is calculate by using the expression
x = Σ wiai – θ ___________________ eq(8)
This value is given to the activation function unit to get the final output response.The actual
output is compared with the Target or desired .If they are same then we can stop training else the
weights haqs to be updated .It means there is error .Error is given as δ = b-s , where b is the desired
7
/ Target output and S is the actual outcome of the machinehere the weights are updated based on the
perceptron Learning law as given in equation 9.
Weight change is given as Δw= η δ ai. So new weight is given as
Wi (new) = Wi (old) + Change in weight vector (Δw) _________eq(9)
1.5.2. Perceptron Algorithm
Step 1: Initialize weights and bias.For simplicity, set weights and bias to zero.Set learning rate
in the range of zero to one.
• Step 2: While stopping condition is false do steps 2-6
• Step 3: For each training pair s:t do steps 3-5
• Step 4: Set activations of input units xi = ai
• Step 5: Calculate the summing part value Net = Σ aiwi-θ
• Step 6: Compute the response of output unit based on the activation functions
• Step 7: Update weights and bias if an error occurred for this pattern(if yis not equal to t)
Weight (new) = wi(old) + atxi , & bias (new) = b(old) + at
Else wi(new) = wi(old) & b(new) = b(old)
• Step 8: Test Stopping Condition
1.5.3. Limitations of single layer perceptrons:
• Uses only Binary Activation function
• Can be used only for Linear Networks
• Since uses Supervised Learning ,Optimal Solution is provided
• Training Time is More
• Cannot solve Linear In-separable Problem
8
Figure 5: Multi-Layer Perceptron
1. Initialize the weights (Wi) & Bias (B0) to small random values near Zero
2. Set learning rate η or α in the range of “0” to “1”
3. Check for stop condition. If stop condition is false do steps 3 to 7
4. For each Training pairs do step 4 to 7
5. Set activations of Output units: xi = si for i=1 to N
6. Calculate the output Response
yin = b0 + Σ xiwi
7. Activation function used is Bipolar sigmoidal or Bipolar Step functions
For Multi Layer networks, based on the number of layers steps 6 & 7 are repeated
8. If the Targets is (not equal to) = to the actual output (Y), then update weights and bias based
on Perceptron Learning Law
Wi (new) = Wi (old) + Change in weight vector
Change in weight vector = ηtixi
Where η = Learning Rate
ti = Target output of ith unit
xi = ith Input vector
b0(new) = b0 (old) + Change in Bias
Change in Bias = ηti
Else Wi (new) = Wi (old)
b0(new) = b0 (old)
9. Test for Stop condition
9
1.6. linearly seperable & Linear in separable tasks:
Perceptron are successful only on problems with a linearly separable solution sapce.Figure 9
represents both linear separable as well as linear in seperable problem.Perceptron cannot handle, in
particular, tasks which are not linearly separable.(Known as linear inseparable problem).Sets of
points in two dimensional spaces are linearly separable if the sets can be seperated by a straight
line.Generalizing, a set of points in n-dimentional space are that can be seperated by a straight line.is
called Linear seperable as represented in figure 9.
Single layer perceptron can be used for linear separation.Example AND gate.But it cant be
used for non linear ,inseparable problems.(Example XOR Gate).Consider figure 10.
10
Convex regions can be created by multiple decision lines arising from multi layer
networks.Single layer network cannot be used to solve inseparable problem.Hence we go for
multilayer network there by creating convex regions which solves the inseparable problem.
1.6.1 Convex Region:
Select any Two points in a region and draw a straight line between these two points. If the
points selected and the lines joining them both lie inside the region then that region is known as
convex regions.
1.6.2. Types of convex regions
(a) Open Convex region (b) Closed Convex region
Figure 9 A: Circle - Closed convex region Figure 9 B: Triangle - Closed convex region
1.7. Logistic Regression
Logistic regression is a probabilistic model that organizes the instances in terms of
probabilities. Because the classification is probabilistic, a natural method for optimizing the
parameters is to ensure that the predicted probability of the observed class for each training
occurrence is as large as possible. This goal is achieved by using the notion of maximumlikelihood
estimation in order to learn the parameters of the model. The likelihood of the training data is defined
as the product of the probabilities of the observed labels of each training instance. Clearly, larger
values of this objective function are better. By using the negative logarithm of this value, one obtains
a loss function in minimization form. Therefore, the output node uses the negative log-likelihood as
a loss function. This loss function replaces the squared error used in the Widrow-Hoff method. The
output layer can be formulated with the sigmoid activation function, which is very common in neural
network design.
11
Logistic regression is another supervised learning algorithm which is used
to solve the classification problems. In classification problems, we have
dependent variables in a binary or discrete format such as 0 or 1.
12
1.8. Support Vector Machines
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However, primarily,
it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to
create the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
13
1.8.2. Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair (x1, x2) of coordinates
in either green or blue. Consider the below image figure11. It is 2-d space so by just
using a straight line, we can easily separate these two classes. But there can be multiple
lines that can separate these classes. Consider the below image:
14
1.9.1. Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
1.9.2. Stochastic Gradient Descent (SGD):
The word ‘stochastic‘ means a system or a process that is linked with a random probability.
Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total
number of samples from a dataset that is used for calculating the gradient for each iteration. In typical
Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole
dataset. Although, using the whole dataset is really useful for getting to the minima in a less noisy
and less random manner, but the problem arises when our datasets gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing one
iteration while performing the Gradient Descent, and it has to be done for every iteration until the
minima is reached. Hence, it becomes computationally very expensive to perform.
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
Note: For further reference, kindly refer the class notes, PPTs, Video lectures
available in the Learning Management System (Moodle)
15
SCHOOL OF ELECTRICAL AND ELECTRONICS
1
UNIT II INTRODUCTION TO DEEP LEARNING
History of Deep Learning- A Probabilistic Theory of Deep Learning- Backpropagation and
regularization, batch normalization- VC Dimension and Neural Nets-Deep Vs Shallow Networks
Convolutional Networks- Generative Adversarial Networks (GAN), Semi-supervised Learning
The chain rule that underlies the back-propagation algorithm was invented in the
seventeenth century (Leibniz, 1676; L’Hôpital, 1696)
Beginning in the 1940s, the function approximation techniques were used to motivate
machine learning models such as the perceptron
The earliest models were based on linear models. Critics including Marvin Minsky
pointed out several of the flaws of the linear model family, such as its inability to learn
the XOR function, which led to a backlash against the entire neural network approach
Efficient applications of the chain rule based on dynamic programming began to appear
in the 1960s and 1970s
Werbos (1981) proposed applying chain rule techniques for training artificial neural
networks. The idea was finally developed in practice after being independently
rediscovered in different ways (LeCun, 1985; Parker, 1985; Rumelhart et al., 1986a)
Following the success of back-propagation, neural network research gained popularity
and reached a peak in the early 1990s. Afterwards, other machine learning techniques
became more popular until the modern deep learning renaissance that began in 2006
The core ideas behind modern feedforward networks have not changed substantially
since the 1980s. The same back-propagation algorithm and the same approaches to
gradient descent are still in use.
Most of the improvement in neural network performance from 1986 to 2015 can be
attributed to two factors. First, larger datasets have reduced the degree to which statistical
generalization is a challenge for neural networks. Second, neural networks have become much
larger, because of more powerful computers and better software infrastructure.A small
number of algorithmic changes have also improved the performance of neural networks
noticeably. One of these algorithmic changes was the replacement of mean squared error with
the cross-entropy family of loss functions. Mean squared error was popular in the 1980s and
1990s but was gradually replaced by cross-entropy losses and the principle of maximum
likelihood as ideas spread between the statistics community and the machine learning
community.
The other major algorithmic change that has greatly improved the performance of
feedforward networks was the replacement of sigmoid hidden units with piecewise linear
hidden units, such as rectified linear units. Rectification using the max{0, z} function was
introduced in early neural network models and dates back at least as far as the Cognitron and
Neo-Cognitron (Fukushima, 1975, 1980).
For small datasets, Jarrett et al. (2009) observed that using rectifying nonlinearities is
even more important than learning the weights of the hidden layers. Random weights are
2
sufficient to propagate useful information through a rectified linear network, enabling the
classifier layer at the top to learn how to map different feature vectors to class identities. When
more data is available, learning begins to extract enough useful knowledge to exceed the
performance of randomly chosen parameters. Glorot et al. (2011a) showed that learning is far
easier in deep rectified linear networks than in deep networks that have curvature or two-sided
saturation in their activation functions.
When the modern resurgence of deep learning began in 2006, feedforward networks
continued to have a bad reputation. From about 2006 to 2012, it was widely believed that
feedforward networks would not perform well unless they were assisted by other models,
such as probabilistic models. Today, it is now known that with the right resources and
engineering practices, feedforward networks perform very well. Today, gradient-based
learning in feedforward networks is used as a tool to develop probabilistic models.
Feedforward networks continue to have unfulfilled potential. In the future, we expect they
will be applied to many more tasks, and that advances in optimization algorithms and model
design will improve their performance even further.
3
2.2 Back Propagation Networks (BPN)
2.2.1. Need for Multilayer Networks
Single Layer networks cannot used to solve Linear Inseparable problems &
can only be used to solve linear separable problems
Single layer networks cannot solve complex problems
Single layer networks cannot be used when large input-output data set is
available
Single layer networks cannot capture the complex information’s available in
the training pairs
Hence to overcome the above said Limitations we use Multi-Layer Networks.
2.2.2. Multi-Layer Networks
Any neural network which has at least one layer in between input and output
layers is called Multi-Layer Networks
Layers present in between the input and out layers are called Hidden Layers
Input layer neural unit just collects the inputs and forwards them to the next
higher layer
Hidden layer and output layer neural units process the information’s feed to
them and produce an appropriate output
Multi -layer networks provide optimal solution for arbitrary classification
problems
Multi -layer networks use linear discriminants, where the inputs are non
linear
2.2.3. Back Propagation Networks (BPN)
Introduced by Rumelhart, Hinton, & Williams in 1986. BPN is a Multi-
layer Feedforward Network but error is back propagated, Hence the name Back
Propagation Network (BPN). It uses Supervised Training process; it has a
systematic procedure for training the network and is used in Error Detection and
Correction. Generalized Delta Law /Continuous Perceptron Law/ Gradient Descent
Law is used in this network. Generalized Delta rule minimizes the mean squared
error of the output calculated from the output. Delta law has faster convergence rate
when compared with Perceptron Law. It is the extended version of Perceptron
Training Law. Limitations of this law is the Local minima problem. Due to this the
convergence speed reduces, but it is better than perceptron’s. Figure 1 represents a
BPN network architecture. Even though Multi level perceptron’s can be used they
are flexible and efficient that BPN. In figure 1 the weights between input and the
hidden portion is considered as Wij and the weight between first hidden to the next
layer is considered as Vjk. This network is valid only for Differential Output
functions. The Training process used in backpropagation involves three stages,
which are listed as below
1. Feedforward of input training pair
4
2. Calculation and backpropagation of associated error
3. Adjustments of weights
5
Yk = f(yink)
III. Backpropagation of Errors
Step 7: δk = (tk – Yk)f(yink )
Step 8: δinj = Σ δjVjk
IV. Updating of Weights & Biases
Step 8: Weight correction is Δwij = αδkZj
bias Correction is Δwoj = αδk
V. Updating of Weights & Biases
Step 9: continued:
New Weight is
Wij(new) = Wij(old) + Δwij
Vjk(new) = Vjk(old) + ΔVjk
New bias is
Woj(new) = Woj(old) + Δwoj
Vok(new) = Vok(old) + ΔVok
2.2.5 Merits
• Has smooth effect on weight correction
• Computing time is less if weight’s are small
• 100 times faster than perceptron model
• Has a systematic weight updating procedure
2.2.6. Demerits
• Learning phase requires intensive calculations
• Selection of number of Hidden layer neurons is an issue
• Selection of number of Hidden layers is also an issue
• Network gets trapped in Local Minima
• Temporal Instability
• Network Paralysis
• Training time is more for Complex problems
2.3 Regularization
A fundamental problem in machine learning is how to make an algorithm that
will perform well not just on the training data, but also on new inputs. Many strategies
used in machine learning are explicitly designed to reduce the test error, possibly at
the expense of increased training error. These strategies are known collectively as
regularization.
Definition: - “any modification we make to a learning algorithm that is intended to
reduce its generalization error but not its training error.”
In the context of deep learning, most regularization strategies are based on
regularizing estimators.
Regularization of an estimator works by trading increased bias for reduced
variance.
6
An effective regularizer is one that makes a profitable trade, reducing variance
significantly while not overly increasing the bias.
Many regularization approaches are based on limiting the capacity of models, such as
neural networks, linear regression, or logistic regression, by adding a parameter norm
penalty Ω(θ) to the objective function J. We denote the regularized objective function
by J˜
J˜(θ; X, y) = J(θ; X, y) + αΩ(θ)
where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution of the norm
penalty term, Ω, relative to the standard objective function J. Setting α to 0 results in no
regularization. Larger values of α correspond to more regularization.
The parameter norm penalty Ω that penalizes only the weights of the affine transformation at
each layer and leaves the biases unregularized.
2.3.1 L2 Regularization
One of the simplest and most common kind of parameter norm penalty is L2 parameter & it’s
also called commonly as weight decay. This regularization strategy drives the weights closer
to the origin by adding a regularization term . L2
regularization is also known as ridge regression or Tikhonov regularization. To simplify, we
assume no bias parameter, so θ is just w. Such a model has the following total objective
function.
We can see that the addition of the weight decay term has modified the learning rule to
multiplicatively shrink the weight vector by a constant factor on each step, just before
performing the usual gradient update. This describes what happens in a single step.
The approximation ^J is Given by
7
The minimum of ˆJ occurs where its gradient ∇wˆJ(w) = H(w − w∗) is equal to ‘0’
To study the eff ect of weight decay,
As α approaches 0, the regularized solution ˜w approaches w*. But what happens as α grows?
Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an
orthonormal basis of eigenvectors, Q, such that H = QΛQT. Applying Decomposition to the
above equation, We Obtain
L1 weight decay controls the strength of the regularization by scaling the penalty Ω using a
positive hyperparameter α. Thus, the regularized objective function J˜(w; X, y) is given by
By inspecting equation 1, we can see immediately that the effect of L 1 regularization is quite
different from that of L 2 regularization. Specifically, we can see that the regularization
contribution to the gradient no longer scales linearly with each wi ; instead it is a constant factor
with a sign equal to sign(wi).
9
L1 regularization can add the penalty term in cost function. But L2 regularization appends the
squared value of weights in the cost function.
L1 regularization can be helpful in features selection by eradicating the unimportant features,
whereas, L2 regularization is not recommended for feature selection
L1 doesn’t have a closed form solution since it includes an absolute value and it is a non-
differentiable function, while L2 has a solution in closed form as it’s a square of a weight
10
Image Source: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-normalization/
Even though the input X was normalized but the output is no longer on the same scale. The
data passes through multiple layers of network with multiple times(sigmoidal) activation functions
are applied, which leads to an internal co-variate shift in the data.
This motivates us to move towards Batch Normalization
Normalization is the process of altering the input data to have mean as zero and standard deviation
value as one.
2.4.1 Procedure to do Batch Normalization:
(1) Consider the batch input from layer h, for this layer we need to calculate the mean of this hidden
activation.
(2) After calculating the mean the next step is to calculate the standard deviation of the hidden
activations.
(3) Now we normalize the hidden activations using these Mean & Standard Deviation values. To do
this, we subtract the mean from each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
(4) As the final stage, the re-scaling and offsetting of the input is performed. Here two components
of the BN algorithm is used, γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and
shifting(β) the vector contains values from the previous operations.
These two parameters are learnable parameters, Hence during the training of neural network,
the optimal values of γ and β are obtained and used. Hence we get the accurate normalization of each
batch.
11
2.5. Shallow Networks
Shallow neural networks give us basic idea about deep neural network which consist
of only 1 or 2 hidden layers. Understanding a shallow neural network gives us an
understanding into what exactly is going on inside a deep neural network A neural network
is built using various hidden layers. Now that we know the computations that occur in a
particular layer, let us understand how the whole neural network computes the output for a
given input X. These can also be called the forward-propagation equations.
12
2.5.1 Difference Between a Shallow Net & Deep Learning Net:
1 One Hidden layer(or very less no. of Deep Net’s has many layers of Hidden
Hidden Layers) layers with more no. of neurons in
each layers
2 Takes input only as VECTORS DL can have raw data like image, text
as inputs
3 Shallow net’s needs more parameters DL can fit functions better with less
to have better fit parameters than a shallow network
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
Note: For further reference, kindly refer the class notes, PPTs, Video lectures
available in the Learning Management System (Moodle)
13
SCHOOL OF ELECTRICAL AND ELECTRONICS
1
UNIT III DIMENTIONALITY REDUCTION
Linear (PCA, LDA) and manifolds, metric learning - Auto encoders and dimensionality
reduction in networks - Introduction to Convnet - Architectures – AlexNet, VGG, Inception, ResNet
- Training a Convnet: weights initialization, batch normalization, hyperparameter optimization.
3.1 Linear Factor Models:
linear factor models are used as building blocks of mixture models of larger, deep
probabilistic models. A linear factor model is defined by the use of a stochastic linear decoder
function that generates x by adding noise to a linear transformation of h. It allows us to
discover explanatory factors that have a simple joint distribution. A linear factor model
describes the data-generation process as follows. ( we sample the explanatory factors h from
a distribution)
h ∼ p(h)
Figure 3A: PCA for Data Representation Figure 3B: PCA Dimension Reduction
If the variation in a data set is caused by some natural property, or is caused by random
experimental error, then we may expect it to be normally distributed. In this case we show
the nominal extent of the normal distribution by a hyper-ellipse (the two-dimensional ellipse
in the example). The hyper ellipse encloses data points that are thought of as belonging to a
class. It is drawn at a distance beyond which the probability of a point belonging to the class
is low, and can be thought of as a class boundary.
If the variation in the data is caused by some other relationship, then PCA gives us a
way of reducing the dimensionality of a data set. Consider two variables that are nearly related
linearly as shown in figure 3B. As in figure 3A the principal direction in which the data varies
is shown by the U axis, and the secondary direction by the V axis. However in this case all
the V coordinates are all very close to zero. We may assume, for example, that they are only
non zero because of experimental noise. Thus in the U V axis system we can represent the
data set by one variable U and discard V . Thus we have reduced the dimensionality of the
problem by 1Computing the Principal Components
where I is the n n identity matrix. This equation is called the characteristic equation (or
×
characteristic polynomial) and has n roots.
Let λ be an eigenvalue of A. Then there exists a vector x such that:
Ax = λx
The vector x is called an eigenvector of A associated with the eigenvalue λ. Notice that there
is no unique solution for x in the above equation. It is a direction vector only and can be
scaled to any magnitude. To find a numerical solution for x we need to set one of its elements
to an arbitrary value, say 1, which gives us a set of simultaneous equations to solve for the
other elements. If there is no solution, we repeat the process with another element. Ordinarily
we normalize the final values so that x has length one, that is x · xT = 1.
Suppose we have a 3 × 3 matrix A with eigenvectors x1, x2, x3, and eigenvalues λ1, λ2, λ3
so:
Ax1 = λ1x1 Ax2 = λ2x2 Ax3 = λ3x3
Putting the eigenvectors as the columns of a matrix gives:
4. Improves Visualization:
5
Linear Discriminant Analysis as its name suggests is a linear model for classification and
dimensionality reduction. Most commonly used for feature extraction in pattern classification
problems.
3.4.1 Need for LDA:
Logistic Regression is perform well for binary classification but fails in the case of multiple
classification problems with well-separated classes. While LDA handles these quite
efficiently.
LDA can also be used in data pre-processing to reduce the number of features just as PCA
which reduces the computing cost significantly.
3.4.2. Limitations:
Linear decision boundaries may not effectively separate non-linearly separable classes. More
flexible boundaries are desired.
In cases where the number of observations exceeds the number of features, LDA might not
perform as desired. This is called Small Sample Size (SSS) problem. Regularization is
required.
1. Simple prototype classifier: Distance to the class mean is used, it’s simple to interpret.
2. Decision boundary is linear: It’s simple to implement and the classification is robust.
3. Dimension reduction: It provides informative low-dimensional view on the data, which is
both useful for visualization and feature engineering.
Shortcomings of LDA:
6
1. Linear decision boundaries may not adequately separate the classes. Support for more
general boundaries is desired.
2. In a high-dimensional setting, LDA uses too many parameters. A regularized version of
LDA is desired.
3. Support for more complex prototype classification is desired.
3.5. Manifold Learnings:
Manifold learning for dimensionality reduction has recently gained much attention to
assist image processing tasks such as segmentation, registration, tracking,
recognition, and computational anatomy.
The drawbacks of PCA in handling dimensionality reduction problems for non-linear
weird and curved shaped surfaces necessitated development of more advanced
algorithms like Manifold Learning.
There are different variant’s of Manifold Learning that solves the problem of reducing
data dimensions and feature-sets obtained from real world problems representing
uneven weird surfaces by sub-optimal data representation.
This kind of data representation selectively chooses data points from a low-dimensional
manifold that is embedded in a high-dimensional space in an attempt to generalize linear
frameworks like PCA.
Manifolds give a look of flat and featureless space that behaves like Euclidean space.
Manifold learning problems are unsupervised where it learns the high-dimensional
structure of the data from the data itself, without the use of predetermined
classifications and loss of importance of information regarding some characteristic of
the original variables.
The goal of the manifold-learning algorithms is to recover the original domain
structure, up to some scaling and rotation. The nonlinearity of these algorithms allows
them to reveal the domain structure even when the manifold is not linearly embedded.
It uses some scaling and rotation for this purpose.
Manifold learning algorithms are divided in to two categories:
Global methods: Allows high-dimensional data to be mapped from high-dimensional
to low-dimensional such that the global properties are preserved. Examples
include Multidimensional Scaling (MDS), Isomaps covered in the following
sections.
Local methods: Allows high-dimensional data to be mapped to low dimensional such
that local properties are preserved. Examples are Locally linear embedding (LLE),
Laplacian eigenmap (LE), Local tangent space alignment (LSTA), Hessian
Eigenmapping (HLLE)
Three popular manifold learning algorithms:
IsoMap (Isometric Mapping)
7
Isomap seeks a lower-dimensional representation that maintains
‘geodesic distances’ between the points. A geodesic distance is a generalization
of distance for curved surfaces. Hence, instead of measuring distance in pure
Euclidean distance with the Pythagorean theorem-derived distance formula,
Isomap optimizes distances along a discovered manifold
Locally Linear Embeddings
Locally Linear Embeddings use a variety of tangent linear patches (as
demonstrated with the diagram above) to model a manifold. It can be thought of
as performing a PCA on each of these neighborhoods locally, producing a linear
hyperplane, then comparing the results globally to find the best nonlinear
embedding. The goal of LLE is to ‘unroll’ or ‘unpack’ in distorted fashion the
structure of the data, so often LLE will tend to have a high density in the center
with extending rays
t-SNE
t-SNE is one of the most popular choices for high-dimensional
visualization, and stands for t-distributed Stochastic Neighbor Embeddings.
The algorithm converts relationships in original space into t-distributions, or
normal distributions with small sample sizes and relatively unknown standard
deviations. This makes t-SNE very sensitive to the local structure, a common
theme in manifold learning. It is considered to be the go-to visualization method
because of many advantages it possesses.
3.6.Auto Encoders:
AutoEncoder is an unsupervised Artificial Neural Network that attempts to
encode the data by compressing it into the lower dimensions (bottlenecklayer or code) and
then decoding the data to reconstruct the original input.The bottleneck layer (or code) holds the
compressed representation of theinput data. In AutoEncoder the number of output units must
be equal to the number ofinput units since we’re attempting to reconstruct the input data.
AutoEncoders usually consist of an encoder and a decoder. The encoder encodes the
provided data into a lower dimension which is the size of thebottleneck layer and the decoder
decodes the compressed data into itsoriginal form.The number of neurons in the layers of the
encoder will be decreasing as we move on with further layers, whereas the number of neurons
in the layers of the decoder will be increasing as we move on with further layers. There are three
layers used in the encoder and decoder in the following example. The encoder contains 32, 16,
and 7 units in each layer respectively and the decoder contains 7, 16, and 32 units in each layer
respectively. The code size/ the number of neurons in bottle-neck must be less than the
number of features in the data. Before feeding the data into the AutoEncoder the data must
definitely be scaled between 0 and 1 using MinMaxScaler since we are going to use sigmoid
8
activation function in the output layer which outputs values between0 and 1.When we are
using AutoEncoders for dimensionality reduction we’ll beextracting the bottleneck layer and
use it to reduce the dimensions. Thisprocess can be viewed as feature extraction.
The type of AutoEncoder that we’re using is Deep AutoEncoder, where theencoder and the
decoder are symmetrical. The Autoencoders don’t necessarily have a symmetrical encoder
and decoder but we can have the encoder and decoder non-symmetrical as well.
Deep Autoencoder
Sparse Autoencoder
Under complete Autoencoder
Variational Autoencoder
LSTM Autoencoder
9
3.7. AlexNet:
Alexnet model was proposed in 2012 in the research paper named Imagenet
Classification with Deep Convolution Neural Network by Alex Krizhevsky and his colleagues
10
Then the fourth convolution operation with 384 filters of size 3X3. The stride value
along with the padding is 1.The output size remains unchanged as 13X13X384.
After this, we have the final convolution layer of size 3X3 with 256 such filters. The
stride and padding are set to 1,also the activation function is relu. The resulting feature
map is of shape 13X13X256
If we look at the architecture now, the number of filters is increasing as we are going
deeper. Hence more features are extracted as we move deeper into the architecture.
Also, the filter size is reducing, which means a decrease in the feature map shape.
3.8. VGG-16
The major shortcoming of too many hyper-parameters of AlexNet was solved by VGG
Net by replacing large kernel-sized filters (11 and 5 in the first and second convolution
layer, respectively) with multiple 3×3 kernel-sized filters one after another.
The architecture developed by Simonyan and Zisserman was the 1st runner up of the
Visual Recognition Challenge of 2014.
The architecture consist of 3*3 Convolutional filters, 2*2 Max Pooling layer with a
stride of 1.
Padding is kept same to preserve the dimension.
There are 16 layers in the network where the input image is RGB format with dimension
of 224*224*3, followed by 5 pairs of Convolution(filters: 64, 128, 256,512,512) and
Max Pooling.
The output of these layers is fed into three fully connected layers and a softmax function
in the output layer.
In total there are 138 Million parameters in VGG Net
11
3.9 ResNet:
ResNet, the winner of ILSVRC-2015 competition is a deep network with over 100
layers. Residual networks (ResNet) is similar to VGG nets however with a sequential
approach they also use “Skip connections” and “batch normalization” that helps to train deep
layers without hampering the performance. After VGG Nets, as CNNs were going deep, it
was becoming hard to train them because of vanishing gradients problem that makes the
derivate infinitely small. Therefore, the overall performance saturates or even degrades. The
idea of skips connection came from highway network where gated shortcut connections were
used
3.10 Inception Net:
Figure 7: InceptionNet
Inception network also known as GoogleLe Net was proposed by developers at google
in “Going Deeper with Convolutions” in 2014. The motivation of InceptionNet comes from the
presence of sparse features Salient parts in the image that can have a large variation in size. Due
to this, the selection of right kernel size becomes extremely difficult as big kernels are selected
for global features and small kernels when the features are locally located. The InceptionNets
resolves this by stacking multiple kernels at the same level. Typically it uses 5*5, 3*3 and 1*1
filters in one go.
3.11. Hyperparameter Optimization:
Hyperparameter optimization in machine learning intends to find the hyperparameters
of a given machine learning algorithm that deliver the best performance as measured on a
validation set. Hyperparameters, in contrast to model parameters, are set by the machine
learning engineer before training. The number of trees in a random forest is a hyperparameter
while the weights in a neural network are model parameters learned during training.
Hyperparameter optimization finds a combination of hyperparameters that returns an optimal
12
model which reduces a predefined loss function and in turn increases the accuracy on given
independent data
3.11.1 Hyperparameter Optimization methods
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
Note: For further reference, kindly refer the class notes, PPTs, Video lectures
available in the Learning Management System (Moodle)
13
SCHOOL OF ELECTRICAL AND ELECTRONICS
1
UNIT IV DIMENTIONALITY REDUCTION
Optimization in deep learning– Non-convex optimization for deep networks- Stochastic
Optimization Generalization in neural networks- Spatial Transformer Networks- Recurrent networks,
LSTM Recurrent Neural Network Language Models- Word-Level RNNs & Deep Reinforcement
Learning - Computational & Artificial Neuroscience.
4.1 Optimization in Deep Learning:
In Deep Learning, with the help of loss function, the performance of the model is estimated/
evaluated. This loss is used to train the network so that it performs better. Essentially, we try
to minimize the Loss function. Lower Loss means the model performs better. The Process of
minimizing any mathematical function is called Optimization.
Optimizers are algorithms or methods used to change the features of the neural network such
as weights and learning rate so that the loss is reduced. Optimizers are used to solve optimization
problems by minimizing the function
The Goal of an Optimizer is to minimize the Objective Function(Loss Function based on the
Training Data set). Simply Optimization is to minimize the Training Error.
4.1.1 Need for Optimization:
Prescence of Local Minima reduces the model performance
Prescence of Saddle Points which creates Vanishing Gradients or Exploding Gradient Issues
To select appropriate weight values and other associated model parameters
To minimize the loss value (Training error)
2
Figure 4.1: Convex Regions
4
Localisation Net:
With input feature map U, with width W, height H and C channels, outputs
are θ, the parameters of transformation Tθ. It can be learnt as affine transform
Grid Generator:
Suppose we have a regular grid G, this G is a set of points with target
coordinates (xt_i, yt_i). Then we apply transformation T θ on G, i.e. T θ( G).
After Tθ(G), a set of points with destination coordinates (xt_i, yt_i) is outputted.
These points have been altered based on the transformation parameters. It can be
Translation, Scale, Rotation or More Generic Warping depending on how we set θ as
mentioned above.
Sampler:
Based on the new set of coordinates (xt_i, yt_i), we generate a transformed
output feature map V. This V is translated, scaled, rotated, warped, projective
transformed or affined, whatever. It is noted that STN can be applied to not only input
image, but also intermediate feature maps.
STN is a mechanism that rotates or scales an input image or a feature map
in order to focus on the target object and to remove rotational variance .
One of the most notable features of STNs is their modularity (the module can
be injected into any part of the model) and their ability to be trained with a single backprop
algorithm without modification of the initial model.
4.4.1. Advantages:
Helps in learning explicit spatial transformations like translation, rotation, scaling,
cropping, non-rigid deformations, etc. of features.
Can be used in any networks and at any layer and learnt in an end-to-end trainable
manner.
Provides improvement in the performance of existing models.
4.5. Recurrent Neural Networks:
RNNs are very powerful, because they combine two properties:
Distributed hidden state that allows them to store a lot of information about
the past efficiently.
Non-linear dynamics that allows them to update their hidden state in
complicated ways.
With enough neurons and time, RNNs can compute anything that can be computed by
your computer.
4.5.1. Need for RNN:
Normal Networks cannot handle sequential data
5
They considers only the current input
Normal Neural networks cannot memorize previous inputs
The solution to these issues is the RNN
RNN works on the principle of saving the output of a particular layer and feeding
this back to the input in order to predict the output of the layer. We can convert a Feed-
Forward Neural Network into a Recurrent Neural Network as given below in figure 4.4.
8
Step 2: Decide how much this unit adds to the current state
In the second layer, there are two parts. One is the sigmoid function, and the other is the
tanh function. In the sigmoid function, it decides which values to let through (0 or
1). tanh function gives weightage to the values which are passed, deciding their level of
importance (-1 to 1).
Step 3: Decide what part of the current cell state makes it to the output
The third step is to decide what the output will be. First, we run a sigmoid layer, which
decides what parts of the cell state make it to the output. Then, we put the cell state through
tanh to push the values to be between -1 and 1 and multiply it by the output of the sigmoid
gate.
4.6.2. Applications of LSTM include:
• Robot control
• Time series prediction
• Speech recognition
• Rhythm learning
• Music composition
• Grammar learning
• Handwriting recognition
4.7. Computational and Artificial Neuro-Science:
Computational neuroscience is the field of study in which mathematical tools and theories
are used to investigate brain function.
The term “computational neuroscience” has two different definitions:
1. using a computer to study the brain
2. studying the brain as a computer
Computational and Artificial Neuroscience deals with the study or understanding of how
signals are transmitted through and from the human brain. A better understanding of How
decision is made in human brain by processing the data or signals will help us in developing
Intelligent algorithms or programs to solve complex problems. Hence, we need to
understand the basics of Biological Neural Networks (BNN).
4.7.1. The Biological Neurons:
The human brain consists of a large number, more than a billion of neural cells that
process information. Each cell works like a simple processor. The massive interaction
between all cells and their parallel processing only makes the brain’s abilities possible.
Figure 1 represents a human biological nervous unit. Various parts of biological neural
network(BNN) is marked in Figure 4.7.
9
Figure 4.7: Biological Neural Network
Dendrites are branching fibres that extend from the cell body or soma.
Soma or cell body of a neuron contains the nucleus and other structures, support
chemical processing and production of neurotransmitters.
Axon is a singular fiber carries information away from the soma to the synaptic sites
of other neurons (dendrites ans somas), muscels, or glands.
Axon hillock is the site of summation for incoming information. At any moment, the
collective influence of all neurons that conduct impulses to a given neuron will determine
whether or n ot an action potential will be initiated at the axon hillock and propagated along
the axon.
Myelin sheath consists of fat-containing cells that insulate the axon from electrical
activity. This insulation acts to increase the rate of transmission of signals. A gap exists
between each myelin sheath cell along the axon. Since fat inhibits the propagation of
electricity, the signals jump from one gap to the next.
Nodes of Ranvier are the gaps (about 1 μm) between myelin sheath cells. Since fat
serves as a good insulator, the myelin sheaths speed the rate of transmission of an electrical
impulse along the axon.
Synapse is the point of connection between two neurons or a neuron and a muscle or
a gland. Electrochemical communication between neurons take place at these junctions.
10
Terminal buttons of a neuron are the small knobs at the end of an axon that release
chemicals called neurotransmitters.
Information flow in a neural cell
The input/output and the propagation of information are shown below.
4.7.2. Artificial neuron model
An artificial neuron is a mathematical function conceived as a simple model of a real
(biological) neuron.
The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic Unit.
A set of input connections brings in activations from other neuron.
A processing unit sums the inputs, and then applies a non-linear activation function
(i.e. squashing/transfer/threshold function).
An output line transmits the result to other neurons.
4.7.3. Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single
activation function. An Artificial neural network(ANN) model based on the biological
neural sytems is shown in figure 4.8.
11
4.7.4. Applications of Computational Neuro Science:
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
Note: For further reference, kindly refer the class notes, PPTs, Video lectures available
in the Learning Management System (Moodle)
12
SCHOOL OF ELECTRICAL AND ELECTRONICS
1
UNIT V APPLICATIONS OF DEEP LEARNING
ImageNet is an image database organized according to the WordNet hierarchy (currently only
the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.
In Machine Learning and Deep Neural Networks, machines are trained on a large dataset of
various images. Machines are required to learn useful features from these training images. Once
learned, they can use these features to classify images and perform many other tasks associated
with computer vision. ImageNet gives researchers a common set of images to benchmark their
models and algorithms.
ImageNet is useful for many computer vision applications such as object recognition, image
classification and object localization.Prior to ImageNet, a researcher wrote one algorithm to
identify dogs, another to identify cats, and so on. After training with ImageNet, the same
algorithm could be used to identify different objects. The diversity and size of ImageNet meant
that a computer looked at and learned from many variations of the same object. These variations
could include camera angles, lighting conditions, and so on. Models built from such extensive
training were better at many computers vision tasks. ImageNet convinced researchers those large
datasets were important for algorithms and models to work well.
5.1.1. Technical details of Image Net:
ImageNet did not define these subcategories on its own but derived these from
WordNet. WordNet is a database of English words linked together by semantic relationships.
Words of similar meaning are grouped together into a synonym set, simply called synset.
Hypernyms are synsets that are more general. Thus, "organism" is a hypernym of "plant".
Hyponyms are synsets that are more specific. Thus, "aquatic" is a hyponym of "plant". This
hierarchy makes it useful for computer vision tasks. If the model is not sure about a subcategory,
2
it can simply classify the image higher up the hierarchy where the error probability is less. For
example, if model is unsure that it's looking at a rabbit, it can simply classify it as a mammal.
While WordNet has 100K+ synsets, only the nouns have been considered by ImageNet.
Humans make mistakes and therefore we must have checks in place to overcome them. Each
human is given a task of 100 images. In each task, 6 "gold standard" images are placed with
known labels. At most 2 errors are allowed on these standard images, otherwise the task has to
be restarted.
In addition, the same image is labelled by three different humans. When there's
disagreement, such ambiguous images are resubmitted to another human with tighter quality
threshold (only one allowed error on the standard images).
For public access, ImageNet provides image thumbnails and URLs from where the original
images were downloaded. Researchers can use these URLs to download the original images.
However, those who wish to use the images for non-commercial or educational purpose, can
create an account on ImageNet and request access. This will allow direct download of images
from ImageNet. This is useful when the original sources of images are no longer available.
The dataset can be explored via a browser-based user interface. Alternatively, there's also
an API. Researchers may want to read the API Documentation. This documentation also shares
how to download image features and bounding boxes.
Images are not uniformly distributed across subcategories. One research team found that by
considering 200 subcategories, they found that the top 11 had 50% of the images, followed by
a long tail.
When classifying people, ImageNet uses labels that are racist, misogynist and offensive.
People are treated as objects. Their photos have been used without their knowledge. About
5.8% labels are wrong. ImageNet lacks geodiversity. Most of the data represents North
America and Europe. China and India are represented in only 1% and 2.1% of the images
respectively. This implies that models trained on ImageNet will not work well when applied
for the developing world.
3
Another study from 2016 found that 30% of ImageNet's image URLs are broken. This is
about 4.4 million annotations lost. Copyright laws prevent caching and redistribution of these
images by ImageNet itself
5.2. WaveNet:
WaveNet is a deep generative model of raw audio waveforms. We show that WaveNets
are able to generate speech which mimics any human voice and which sounds more natural
than the best existing Text-to-Speech systems, reducing the gap with human performance by
over 50%. Allowing people to converse with machines is a long-standing dream of human-
computer interaction. The ability of computers to understand natural speech has been
revolutionised in the last few years by the application of deep neural networks. However,
generating speech with computers — a process usually referred to as speech synthesis or text-
to-speech (TTS) — is still largely based on so-called concatenative TTS, where a very large
database of short speech fragments are recorded from a single speaker and then recombined to
form complete utterances. This makes it difficult to modify the voice (for example switching
to a different speaker, or altering the emphasis or emotion of their speech) without recording a
whole new database.
This has led to a great demand for parametric TTS, where all the information required to
generate the data is stored in the parameters of the model, and the contents and characteristics
of the speech can be controlled via the inputs to the model. So far, however, parametric TTS
has tended to sound less natural than concatenative. Existing parametric models typically
generate audio signals by passing their outputs through signal processing algorithms known
as vocoders. WaveNet changes this paradigm by directly modelling the raw waveform of the
audio signal, one sample at a time. As well as yielding more natural-sounding speech, using
raw waveforms means that WaveNet can model any kind of audio, including music.
4
The WaveNet proposes an autoregressive learning with the help of convolutional
networks with some tricks. Basically, we have a convolution window sliding on the audio data,
and at each step try to predict the next sample value that it did not see yet. In other words, it
builds a network that learns the causal relationships between consecutive timesteps (as shown
in figure 5.1)
Typically, the speech audio has a sampling rate of 22K or 16K. For few seconds of speech, it
means there are more than 100K values for a single data and it is enormous for the network to
consume. Hence, we need to restrict the size, preferably to around 8K. At the end, the values
are predicted in Q channels (eg. Q=256 or 65536), which is compared to the original audio data
compressed to Q distinct values. For that, the mulaw quantization could be used: it maps the
values to the range of [0,Q]. And the loss can be computed either by cross-entropy,
or discretized logistic mixture.
6
The above diagram (Figure5.3 ) shows the phases or logical steps involved in natural language
processing
5.4. Word2Vec:
Word embedding is one of the most popular representation of document vocabulary. It is
capable of capturing context of a word in a document, semantic and syntactic similarity, relation
with other words, etc. What are word embeddings exactly? Loosely speaking, they are vector
7
representations of a particular word. Having said this, what follows is how do we generate them?
More importantly, how do they capture the context? Word2Vec is one of the most popular
technique to learn word embeddings using shallow neural network. It was developed by Tomas
Mikolov in 2013 at Google.
The purpose and usefulness of Word2vec is to group the vectors of similar words
together in vector space. That is, it detects similarities mathematically. Word2vec creates vectors
that are distributed numerical representations of word features, features such as the context of
individual words. It does so without human intervention.
Given enough data, usage and contexts, Word2vec can make highly accurate guesses
about a word’s meaning based on past appearances. Those guesses can be used to establish a
word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or
cluster documents and classify them by topic. Those clusters can form the basis of search,
sentiment analysis and recommendations in such diverse fields as scientific research, legal
discovery, e-commerce and customer relationship management. Measuring cosine similarity, no
similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle,
complete overlap.
Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input
is a text corpus and its output is a set of vectors: feature vectors that represent words in that
corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that
deep neural networks can understand.
Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied
just as well to genes, code, likes, playlists, social media graphs and other verbal or symbolic
series in which patterns may be discerned.
Figure 5.4: Two models of Word2Vec (A- CBOW & B- Skip-Gram model)
8
Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than
training against the input words through reconstruction, as a restricted Boltzmann machine does,
word2vec trains words against other words that neighbour them in the input corpus. t does so in
one of two ways, either using context to predict a target word (a method known as continuous
bag of words, or CBOW), or using a word to predict a target context, which is called skip-gram.
When the feature vector assigned to a word cannot be used to accurately predict that
word’s context, the components of the vector are adjusted. Each word’s context in the corpus is
the teacher sending error signals back to adjust the feature vector. The vectors of words judged
similar by their context are nudged closer together by adjusting the numbers in the vector.
Similar things and ideas are shown to be “close”. Their relative meanings have been
translated to measurable distances. Qualities become quantities, and algorithms can do their
work. But similarity is just the basis of many associations that Word2vec can learn. For example,
it can gauge relations between words of one language, and map them to another.
The main idea of word2Vec is to design a model whose parameters are the word vectors.
Then, train the model on a certain objective. At every iteration we run our model, evaluate the
errors, and follow an update rule that has some notion of penalizing the model parameters that
caused the error. Thus, we learn our word vectors.
Content Source: (1) https://towardsdatascience.com/introduction-to-word-embedding-and-
word2vec-652d0c2060fa
(2) https://wiki.pathmind.com/word2vec
9
Figure 5.5: General representation of Bone Joint detection system
10
Figure 5.7: CNN based Knee Joint Detection Model
Figure 5.7 shows the full model of a joint detection procedure. The Convolution filter
moves to the right with a certain Stride Value till it parses the complete width. Moving on, it
hops down to the beginning (left) of the image with the same Stride Value and repeats the
process until the entire image is traversed. The Kernel has the same depth as that of the input
image. The objective of the Convolution Operation is to extract the high-level features such
as edges, from the input image. Stride is the number of pixels shifts over the input
matrix. When the stride is 1 then we move the filters to 1 pixel at a time. When the stride is 2
then we move the filters to 2 pixels at a time and so on
Pooling layers section would reduce the number of parameters when the images are
too large. Spatial pooling also called subsampling or down sampling which reduces the
dimensionality of each map but retains important information. This is to decrease the
computational power required to process the data by reducing the dimensions
Types of Pooling:
• Max Pooling
• Average Pooling
• Sum Pooling
• The image is flattened into a column vector.
• The flattened output is fed to a feed-forward neural network and backpropagation applied
to every iteration of training.
Over a series of epochs, the model is able to distinguish between dominating and
certain low-level features in images and classify them using the Softmax
Classification technique. The feature map matrix will be converted as vector (x1, x2, x3, …).
These features are combined together to create a model.
11
Finally, an activation function such as softmax or sigmoid is used to classify the outputs as
Normal and Abnormal.
5.5.1 Steps Involved:
• Provide input image into convolution layer
• Choose parameters, apply filters with strides, padding if requires. Perform convolution
on the image and apply ReLU activation to the matrix.
• Perform pooling to reduce dimensionality size
• Add as many convolutional layers until satisfied
• Flatten the output and feed into a fully connected layer (FC Layer)
• Output the class using an activation function (Logistic Regression with cost functions)
and classifies images.
5.6. Other Applications:
Similarly for the other Applications such as Facial Recognition and Scene
Matching applications appropriate Deep Learning Based Algorithms such as AlexNet,
VGG, Inception, ResNet and or Deep learning-based LSTM or RNN can be used. These
Networks has to be explained with necessary Diagrams and appropriate Explanations.
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
Note: For further reference, kindly refer the class notes, PPTs, Video lectures available
in the Learning Management System (Moodle)
12