Deep Learning
Deep Learning
Deep Learning
Deep Learning
lOMoARcPSD|5982446
UNIT-I
SYLLABUS:
Introduction: Various paradigms of learning problems, Perspectives and Issues in deep
learning framework, review of fundamental learning techniques. Feed forward neural
network: Artificial Neural Network, activation function, multi-layer neural network
These paradigms provide different lenses through which to understand and study the
learning process, whether in the context of human cognition, education, or artificial
intelligence. Different paradigms may be more suitable for different types of learning
tasks or situations.
lOMoARcPSD|5982446
These paradigms provide different lenses through which to understand and study the
learning process, whether in the context of human cognition, education, or artificial
intelligence. Different paradigms may be more suitable for different types of learning
tasks or situations.
Perspectives:
1. Representation Learning:
• Perspective: Deep learning is often celebrated for its ability to
automatically learn hierarchical representations from data. This means that the
model can learn to extract meaningful features at different levels of
abstraction.
2. End-to-End Learning:
• Perspective: Deep learning enables end-to-end learning, where the
model learns to perform a task without explicit feature engineering. This can
simplify the development process and improve performance.
Scalability:
• Perspective: Deep learning models can scale with the amount of data
and computational resources available. This scalability is particularly
beneficial for handling large and complex datasets.
3. Transfer Learning:
• Perspective: Transfer learning is a powerful perspective in deep
learning, allowing models pre-trained on one task to be fine-tuned for another
task. This leverages knowledge gained from one domain to improve
performance in another.
4. Diversity of Architectures:
• Perspective: Deep learning encompasses a wide range of architectures,
including convolutional neural networks (CNNs) for image tasks, recurrent
neural networks (RNNs) for sequence tasks, and transformer architectures for
natural language processing. This diversity allows for specialized solutions
for different types of data and tasks.
1. Neural Networks:
lOMoARcPSD|5982446
Feedforward neural networks serve as the foundation for more advanced architectures
in deep learning and have applications in various domains, including image and
speech recognition, natural language processing, and regression tasks
• The most basic type of neural network with an input layer, one or
more hidden layers, and an output layer.
2. Recurrent Neural Networks (RNNs):
• Networks with connections that form cycles, allowing them to capture
sequential dependencies in data. Suitable for tasks involving time-series
data.
3. Convolutional Neural Networks (CNNs):
• Specialized networks designed for processing grid-like data, such as
images. They use convolutional layers to automatically learn hierarchical
features.
4. Radial Basis Function Networks (RBFNs):
• Networks that use radial basis functions as activation functions, often
employed in pattern recognition and approximation tasks.
5. Generative Adversarial Networks (GANs):
• A pair of networks, a generator and a discriminator, that are trained
simultaneously through adversarial training. GANs are used for generating new
data instances.
4. Activation Functions:
lOMoARcPSD|5982446
UNIT-II
6. Validation:
• Evaluate on Validation Set: Periodically assess the model's performance on
the validation set to avoid overfitting.
lOMoARcPSD|5982446
3. Regularization Techniques:
• L1 and L2 Regularization: Add regularization terms to the loss function to
penalize large weights. Helps prevent overfitting.
• Dropout: Randomly deactivate neurons during training to improve
generalization.
4. Early Stopping:
lOMoARcPSD|5982446
By applying these techniques, practitioners aim to strike a balance between fitting the
training data well and ensuring the neural network generalizes effectively to unseen
lOMoARcPSD|5982446
data. The selection and careful tuning of these components contribute to the success of
risk minimization during neural network training.
REGULARIZATION
Regularization is a technique used in machine learning to prevent overfitting and
improve the generalization of a model. Overfitting occurs when a model learns the
training data too well, including its noise and outliers, and performs poorly on new,
unseen data.
There are several types of regularization, and two common ones are L1 regularization
and L2 regularization:
1. L1 Regularization (Lasso): It adds the absolute values of the coefficients as
a penalty term to the objective function. It tends to produce sparse weight vectors,
encouraging some weights to become exactly zero, effectively performing feature
selection.
2. L2 Regularization (Ridge): It adds the squared values of the coefficients as
a penalty term to the objective function. It discourages large weights but does not
usually lead to sparse weight vectors.
lOMoARcPSD|5982446
RegularizedLoss=Loss+λ×Regularization Term
Here, the Loss term represents the original loss function used for training the model.
1. Features: Features are functions of the input and output variables that capture
the dependencies in the data. The choice of features is crucial in CRFs, and they are
used to define the energy function.
2. Training: CRFs are trained by maximizing the conditional likelihood of the
training data. Gradient-based optimization methods, such as stochastic gradient
descent, are often employed to estimate the parameters.
lOMoARcPSD|5982446
Conditional Random Fields have been successfully applied to various tasks, including
part-of-speech tagging, named entity recognition, image segmentation, and biological
sequence analysis. They provide a way to model dependencies between output
variables in a principled probabilistic framework.
LINEAR CHAIN:
A "linear chain" in the context of machine learning and graphical models typically
refers to a sequence of interconnected variables. This structure is commonly
encountered in tasks where there is a temporal or sequential relationship among the
variables. One of the most prevalent examples is in the field of natural language
processing (NLP), where words or tokens in a sentence follow a linear order.
When dealing with linear chains in the context of probabilistic graphical models, two
main types are often discussed: Hidden Markov Models (HMMs) and linear chain
Conditional Random Fields (CRFs).
1. Hidden Markov Models (HMMs): HMMs are a type of probabilistic model
that deals with sequences of observations. They assume that there is an underlying
sequence of hidden states, and each hidden state generates an observation. The
transitions between hidden states form a linear chain. HMMs are widely used for
tasks such as part-of-speech tagging, speech recognition, and bioinformatics.
2. Linear Chain Conditional Random Fields (CRFs): As mentioned earlier,
CRFs are probabilistic graphical models used for structured prediction tasks. Linear
chain CRFs specifically model dependencies in sequences. They are often applied to
problems where the output labels have a sequential or temporal order. For example,
in named entity recognition, the goal is to label each word in a sentence with its entity
type, and the dependencies between adjacent words form a linear chain.
The linear chain structure simplifies the modeling and inference processes, making it
computationally more feasible. Both HMMs and linear chain CRFs have been used in
various applications where sequential relationships need to be considered, and they
have proven effective in capturing dependencies within sequences of data.
PARTITION FUNCTION:
The partition function is a concept commonly used in statistical mechanics and
probability theory, particularly in the context of Gibbs distributions and Boltzmann
distributions. It plays a crucial role in determining the probabilities of different
states in a system.
1. Statistical Mechanics:
• In statistical mechanics, the partition function is associated with the
probability distribution of different microstates of a physical system. It is
denoted by Z] and is defined as the sum (or integral) of the exponential of the
negative energy over all possible states of the system.
lOMoARcPSD|5982446
Mathematically, for a system with discrete energy levels, the partition function is
given by:
Z=∑ all statese −βE
1.
• Here, E is the energy of a state, β is the inverse temperature, and the
sum is taken over all possible states of the system.
• The partition function is fundamental for calculating thermodynamic
properties such as free energy, entropy, and specific heat.
2. Probability Theory:
• In probability theory, particularly in the context of graphical models
like Markov Random Fields (MRFs) and Conditional Random Fields (CRFs),
the partition function is used to normalize the distribution over possible
configurations.
• In the case of CRFs, for example, the partition function ensures that
the probabilities assigned to all possible output sequences sum to 1. It helps in
defining a valid probability distribution over the output space.
• Mathematically, for a conditional distribution P(Y|X) in a CRF,
the partition function is often denoted as Z(X):
In both cases, the partition function serves to normalize the distribution and ensure
that it represents a valid probability distribution. The specific form of the partition
function depends on the context in which it is used, such as statistical mechanics
or
probabilistic graphical models.
MARKOV NETWORK
A Markov Network, also known as a Markov Random Field (MRF), is a type of
probabilistic graphical model that represents dependencies between variables in a
structured way. Markov Networks are commonly used in various fields, including
machine learning, computer vision, and statistical physics.
edge between every pair of nodes in the subset. The joint distribution is
factorized as the product of these factors.
• Mathematically, for a set of variables X, the factor associated with
clique C a is denoted as ϕC(XC).
4. Potential Functions:
• The factors are often represented by potential functions, which assign a
non-negative value to each possible assignment of values to the variables in the
clique. The joint distribution is then proportional to the product of these
potential functions.
• For a set of variables X, the potential function associated with a clique C
is often denoted as ϕC(XC). and the joint distribution is given by the product of
these potential functions.
5. Global Markov Property:
• The global Markov property states that a variable is conditionally
independent of all other variables in the network given its neighbors. This is
a consequence of the local Markov property and the factorization of the joint
distribution.
BELIEF PROPOGATION
Belief Propagation (BP) is an algorithm used for making approximate inferences in
graphical models, such as Bayesian Networks and Markov Random Fields. It is
particularly useful for solving problems related to marginal probabilities and making
predictions in models with complex dependencies.
Here are the key components and concepts associated with Hidden Markov Models:
1. States:
• An HMM consists of a set of hidden states, which represent the
underlying, unobservable processes or conditions of the system. The system
is assumed to be in one of these states at any given time.
2. Transitions:
lOMoARcPSD|5982446
Hidden Markov Models are versatile and can be used for various applications, such as
speech recognition, part-of-speech tagging, bioinformatics (e.g., gene prediction), and
financial modeling. Their ability to model sequential data and handle uncertainty
makes them valuable in scenarios where the underlying processes are not directly
observable.
ENTROPY IN DEEP LEARNING
In the context of deep learning and neural networks, the concept of entropy is often
associated with two main areas: softmax layer and generative models.
lOMoARcPSD|5982446
U-3
Deep learning:
Deep learning is a branch of machine learning which is based on artificial neural
networks. It is capable of learning complex patterns and relationships within data. In
deep learning, we don’t need to explicitly program everything. It has become
lOMoARcPSD|5982446
increasingly popular in recent years due to the advances in processing power and the
availability of large datasets. Because it is based on artificial neural networks
(ANNs) also known as deep neural networks (DNNs). These neural networks are
inspired by the structure and function of the human brain’s biological neurons, and
they are designed to learn from large amounts of data.
1. Deep Learning is a subfield of Machine Learning that involves the use of
neural networks to model and solve complex problems. Neural networks are
modeled after the structure and function of the human brain and consist of layers
of interconnected nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks,
which have multiple layers of interconnected nodes. These networks can learn
complex representations of data by discovering hierarchical patterns and features
in the data. Deep Learning algorithms can automatically learn and improve from
data without the need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including
image recognition, natural language processing, speech recognition, and
recommendation systems. Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs), and Deep Belief Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and
computational resources. However, the availability of cloud computing and the
development of specialized hardware, such as Graphics Processing Units (GPUs),
has made it easier to train deep neural networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the use
of deep neural networks to model and solve complex problems. Deep Learning has
achieved significant success in various fields, and its use is expected to continue to
grow as more data becomes available, and more powerful computing resources
become available.
Dropout refers to data, or noise, that's intentionally dropped from a neural network to
improve processing and time to results.
A neural network is software attempting to emulate the actions of the human brain.
The human brain contains billions of neurons that fire electrical and chemical signals
lOMoARcPSD|5982446
to each other to coordinate thoughts and life functions. A neural network uses a
software equivalent of these neurons, called units. Each unit receives signals from
other units and then computes an output that it passes onto other neuron/units, or
nodes, in the network.
The challenge for software-based neural networks is they must find ways to reduce the
noise of billions of neuron nodes communicating with each other, so the networks'
processing capabilities aren't overrun. To do this, a network eliminates all
communications that are transmitted by its neuron nodes not directly related to the
problem or training that it's working on. The term for this neuron node elimination
is dropout.
Dropout layers
When data scientists apply dropout to a neural network, they consider the nature of
this random processing. They make decisions about which data noise to exclude and
then apply dropout to the different layers of a neural network as follows:
• Intermediate or hidden layers. These are the layers of processing after data
ingestion. These layers are hidden because we can't exactly see what they do. The
layers, which could be one or many, process data and then pass along
intermediate
-- but not final -- results that they send to other neurons for additional processing.
Because much of this intermediate processing will end up as noise, data
scientists use dropout to exclude some of it.
• Output layer. This is the final, visible processing output from all neuron
units. Dropout is not used on this layer.
lOMoARcPSD|5982446
These
images show the different layers of a neural network before and after dropout has been
applied.
Examples and uses of dropout
Here's another real-world example that shows how dropout works: A biochemical
company wants to design a new molecular structure that will enable it to produce a
revolutionary form of plastic. The company already knows the individual elements
that will comprise the molecule. What it doesn't know is the correct formulation of
these elements.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer,
Convolutional layer, Pooling layer, and fully connected layers.
The Convolutional layer applies filters to the input image to extract features, the
Pooling layer downsamples the image to reduce computation, and the fully
connected layer makes the final prediction. The network learns the optimal filters
through backpropagation and gradient descent.
neural networks work?
Convolutional neural networks are distinguished from other neural networks by their
superior performance with image, speech, or audio signal inputs. They have three
main types of layers, which are:
• Convolutional layer
• Pooling layer
• Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While
convolutional layers can be followed by additional convolutional layers or pooling
layers, the fully-connected layer is the final layer. With each layer, the CNN increases
in its complexity, identifying greater portions of the image. Earlier layers focus on
simple features, such as colors and edges. As the image data progresses through the
layers of the CNN, it starts to recognize larger elements or shapes of the object until it
finally identifies the intended object.
Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the
majority of computation occurs. It requires a few components, which are input data, a
filter, and a feature map. Let’s assume that the input will be a color image, which is
made up of a matrix of pixels in 3D. This means that the input will have three
dimensions—a height, width, and depth—which correspond to RGB in an image. We
also have a feature detector, also known as a kernel or a filter, which will move across
the receptive fields of the image, checking if the feature is present. This process is
known as a convolution.
lOMoARcPSD|5982446
Note that the weights in the feature detector remain fixed as it moves across the
image, which is also known as parameter sharing. Some parameters, like the weight
values, adjust during training through the process of backpropagation and gradient
descent. However, there are three hyperparameters which affect the volume size of
the output that need to be set before the training of the neural network begins. These
include:
1. The number of filters affects the depth of the output. For example, three
distinct filters would yield three different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input
matrix. While stride values of two or greater is rare, a larger stride yields a
smaller output.
3. Zero-padding is usually used when the filters do not fit the input image. This sets
all elements that fall outside of the input matrix to zero, producing a larger or
equally sized output. There are three types of padding:
Pooling layer
• Max pooling: As the filter moves across the input, it selects the pixel with the
maximum value to send to the output array. As an aside, this approach tends to
be used more often compared to average pooling.
• Average pooling: As the filter moves across the input, it calculates the
average value within the receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of benefits
to the CNN. They help to reduce complexity, improve efficiency, and limit risk of
overfitting.
Fully-connected layer
The name of the full-connected layer aptly describes itself. As mentioned earlier, the
pixel values of the input image are not directly connected to the output layer in
partially connected layers. However, in the fully-connected layer, each node in the
output layer connects directly to a node in the previous layer.
This layer performs the task of classification based on the features extracted through
the previous layers and their different filters. While convolutional and pooling
layers tend to use ReLu functions, FC layers usually leverage a softmax activation
function to classify inputs appropriately, producing a probability from 0 to 1.
A recurrent neural network (RNN) is a type of artificial neural network which uses
sequential data or time series data. These deep learning algorithms are commonly used
for ordinal or temporal problems, such as language translation, natural language
lOMoARcPSD|5982446
processing (nlp), speech recognition, and image captioning; they are incorporated into
popular applications such as Siri, voice search, and Google Translate. Like
feedforward and convolutional neural networks (CNNs), recurrent neural networks
utilize training data to learn. They are distinguished by their “memory” as they take
information from prior inputs to influence the current input and output. While
traditional deep neural networks assume that inputs and outputs are independent of
each other, the output of recurrent neural networks depend on the prior elements
within the sequence
The Recurrent Neural Network consists of multiple fixed activation function units,
one for each time step. Each unit has an internal state which is called the hidden
state of the unit. This hidden state signifies the past knowledge that the network
currently holds at a given time step. This hidden state is updated at every time step to
signify the change in the knowledge of the network about the past. The hidden state
is updated using the following recurrence relation:-
The formula for calculating the current state:
where:
ht -> current state
ht-1 -> previous state
xt -> input state
Formula for applying Activation function(tanh):
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
The formula for calculating output:
Yt -> output
Why -> weight at output layer
These parameters are updated using Backpropagation. However, since RNN works
on sequential data here we use an updated backpropagation which is known as
Backpropagation through time.
5. Once all the time steps are completed the final current state is used to
calculate the output.
6. The output is then compared to the actual output i.e the target output and the
error is generated.
7. The error is then back-propagated to the network to update the weights and
hence the network (RNN) is trained using Backpropagation through time.
Types Of RNN
There are four types of RNNs based on the number of inputs and outputs in the
network.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
One to One
This type of RNN behaves the same as any simple Neural network it is also known
as Vanilla Neural Network. In this Neural network, there is only one input and one
output.
One To Many
lOMoARcPSD|5982446
In this type of RNN, there is one input and many outputs associated with it. One of
the most used examples of this network is Image captioning where given an image
we predict a sentence having Multiple words.
Many to One
In this type of network, Many inputs are fed to the network at several states of the
network generating only one output. This type of network is used in the problems
like sentimental analysis. Where we give multiple words as input and predict only
the sentiment of the sentence as output.
Many to Many
In this type of neural network, there are multiple inputs and multiple outputs
corresponding to a problem. One Example of this Problem will be language
translation. In language translation, we provide multiple words from one language as
input and predict multiple words from the second language as output.
lOMoARcPSD|5982446
U-4
The following are the major types of difficulties that researchers have attempted to
address with PNN:
the hidden neuron’s category. The values for the class that the pattern neurons
represent are added together.
Source
lOMoARcPSD|5982446
Decision Layer
The output layer compares the weighted votes accumulated in the pattern layer for
each target category and utilizes the largest vote to predict the target category.
Advantages
Disadvantages
• When it comes to classifying new cases, PNNs are slower than multilayer
perceptron networks.
• PNN requires extra memory to store the mod.