Unit 2
Unit 2
Unit 2
BACKPROPAGATION
It computes the gradient of the loss function with respect to the network weights.
It is very efficient, rather than naively directly computing the gradient concerning
each weight. This efficiency makes it possible to use gradient methods to train
multi-layer networks and update weights to minimize loss; variants such as
gradient descent or stochastic gradient descent are often used.
It uses in the vast applications of neural networks in data mining like Character
recognition, Signature verification, etc.
Features of Backpropagation:
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input
vectors that the network operates on. It Compares generated output to the desired
output and generates an error report if the result does not match the generated
output vector. Then it adjusts the weights according to the bug report to get your
desired output.
Backpropagation Algorithm:
Types of Backpropagation
Advantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate
results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
GRADIENT DESCENT
Gradient descent is an optimization algorithm commonly used in machine
learning to minimize the cost or loss function during the training of a model
aiming to find the optimal set of parameters. It is a numerical optimization
algorithm that aims to find the optimal parameters—weights and biases—of a
neural network by minimizing a defined cost function.
The learning happens during the backpropagation while training the neural
network-based model. There is a term known as Gradient Descent, which is used
to optimize the weight and biases based on the cost function. cost function
evaluates the difference between the actual and predicted outputs.
The cost function represents the discrepancy between the predicted output of the
model and the actual output. The goal of gradient descent is to find the set of
parameters that minimizes this discrepancy and improves the model’s
performance.
Batch gradient descent updates the model’s parameters using the entire training
set in each iteration, while stochastic gradient descent updates the parameters
using only one training sample at a time.
2. The cost function measures how well the model fits the training data and is
defined based on the difference between the predicted and actual values.
3. The gradient of the cost function is the derivative with respect to the
model’s parameters and points in the direction of the steepest ascent.
4. The algorithm starts with an initial set of parameters and updates them in
small steps to minimize the cost function.
5. In each iteration of the algorithm, the gradient of the cost function with
respect to each parameter is computed.
6. The gradient tells us the direction of the steepest ascent, and by moving in
the opposite direction, we can find the direction of the steepest descent.
7. The size of the step is controlled by the learning rate, which determines
how quickly the algorithm moves towards the minimum.
Large volumes of data are necessary for deep learning. Additionally, the more
accurate and powerful models will need more parameters, which calls for more
data.
When we hear "Big Data," we might wonder how it differs from the more
common "data." The term "data" refers to any unprocessed character or symbol
that can be recorded on media or transmitted via electronic signals by a computer.
Raw data, however, is useless until it is processed somehow.
Before we jump into the challenges of Big Data, let’s start with the five ‘V’s of Big
Data.
The Five ‘V’s of Big Data
Big Data is simply a catchall term used to describe data too large and complex to
store in traditional databases. The “five ‘V’s” of Big Data are:
A. Storage
With vast amounts of data generated daily, the greatest challenge is storage
(especially when the data is in different formats) within legacy systems.
Unstructured data cannot be stored in traditional databases.
B. Processing
Processing big data refers to the reading, transforming, extraction, and formatting
of useful information from raw information. The input and output of information in
unified formats continue to present difficulties.
C. Security
Security is a big concern for organizations. Non-encrypted information is at risk of
theft or damage by cyber-criminals. Therefore, data security professionals must
balance access to data against maintaining strict security protocols.
D. Finding and Fixing Data Quality Issues
Many of you are probably dealing with challenges related to poor data quality, but
solutions are available. The following are four approaches to fixing data problems:
A. Correct information in the original database.
B. Repairing the original data source is necessary to resolve any data
inaccuracies.
C. You must use highly accurate methods of determining who someone is.
What is Overfitting?
You only get accurate predictions if the machine learning model generalizes to all
types of data within its domain. Overfitting occurs when the model cannot
generalize and fits too closely to the training dataset instead. Overfitting happens
due to several reasons, such as:
• The training data size is too small and does not contain enough data samples to
accurately represent all possible input data values.
• The training data contains large amounts of irrelevant information, called noisy
data.
• The model trains for too long on a single sample set of data.
• The model complexity is high, so it learns the noise within the training data.
Overfitting examples
Consider a use case where a machine learning model has to analyze photos and
identify the ones that contain dogs in them. If the machine learning model was
trained on a data set that contained majority photos showing dogs outside in parks ,
it may learn to use grass as a feature for classification, and may not recognize a
dog inside a room.
The best method to detect overfit models is by testing the machine learning models
on more data with comprehensive representation of possible input data values and
types.
Typically, part of the training data is used as test data to check for overfitting. A
high error rate in the testing data indicates overfitting. One method of testing for
overfitting is given below.
K-fold cross-validation
Cross-validation is one of the testing methods. In this method, data scientists
divide the training set into K equally sized subsets or sample sets called folds. The
training process consists of a series of iterations. During each iteration, the steps
are:
1. Keep one subset as the validation data and train the machine learning model
on the remaining K-1 subsets.
2. Observe how the model performs on the validation sample.
3. Score model performance based on output data quality.
Iterations repeat until you test the model on every sample set. You then average the
scores across all iterations to get the final assessment of the predictive model.
How can you prevent overfitting?
1. Early stopping
Early stopping pauses the training phase before the machine learning model learns
the noise in the data. However, getting the timing right is important; else the model
will still not give accurate results.
2. Pruning
You might identify several features or parameters that impact the final prediction
when you build a model. Feature selection—or pruning—identifies the most
important features within the training set and eliminates irrelevant ones.
For example, to predict if an image is an animal or human, you can look at various
input parameters like face shape, ear position, body structure, etc. You may
prioritize face shape and ignore the shape of the eyes.
3. Regularization
Regularization is a collection of training/optimization techniques that seek to
reduce overfitting. These methods try to eliminate those factors that do not impact
the prediction outcomes by grading features based on importance.
For example, mathematical calculations apply a penalty value to features with
minimal impact. Consider a statistical model attempting to predict the housing
prices of a city in 20 years. Regularization would give a lower penalty value to
features like population growth and average annual income but a higher penalty
value to the average annual temperature of the city.
4. Ensembling
Ensembling combines predictions from several separate machine learning
algorithms. Some models are called weak learners because their results are often
inaccurate. Ensemble methods combine all the weak learners to get more accurate
results. They use multiple models to analyze sample data and pick the most
accurate outcomes.
The two main ensemble methods are bagging and boosting. Boosting trains
different machine learning models one after another to get the final result, while
bagging trains them in parallel.
5. Data augmentation
Data augmentation is a machine learning technique that changes the sample data
slightly every time the model processes it. You can do this by changing the input
data in small ways.
When done in moderation, data augmentation makes the training sets appear
unique to the model and prevents the model from learning their characteristics.
For example, applying transformations such as translation, flipping, and rotation to
input images.
Model Parameters:
Model parameters are configuration variables that are internal to the model, and a
model learns them on its own. For example, W Weights or Coefficients of
independent variables in the Linear regression model. or Weights or Coefficients
of independent variables in SVM, weight, and biases of a neural network, cluster
centroid in clustering. Some key points for model parameters are as follows:
a) They are used by the model for making predictions.
b) They are learned by the model from the data itself
c) These are usually not set manually.
d) These are the part of the model and key to a machine learning Algorithm.
Model Hyperparameters:
Hyperparameters are those parameters that are explicitly defined by the user to
control the learning process. Some key points for model parameters are as follows:
These are usually defined manually by the machine learning engineer.
One cannot know the exact best value for hyperparameters for the given problem.
The best value can be determined either by the rule of thumb or by trial and error.
Some examples of Hyperparameters are the learning rate for training a neural
network.
Categories of Hyperparameters
Broadly hyperparameters can be divided into two categories, which are given
below:
A. Hyperparameter for Optimization
B. Hyperparameter for Specific Models
Particularly deep neural networks, because their internal working can be complex
and difficult to interpret. Here are a few reasons why neural networks are referred
to as black boxes:
Complexity of Internal Operations: Neural networks consist of many
layers and nodes, each with numerous parameters. The interactions and
transformations that occur within the network during the learning process
are Complex and not easily understandable. The Complete number of
parameters and the non-linear nature of the operations make it challenging to
automatically grasp how the network arrives at a specific output.
Lack of Interpretability: Understanding why a neural network makes a
specific decision or prediction can be difficult. While it may be possible to
observe the input and output, determining the exact reasons for the network's
decision can be Difficult to catch or find. This lack of interpretability is
similar to treating the network as a black box where the internal processes
are not transparent.
High Dimensionality: Neural networks often operate in high-dimensional
spaces, making it impractical for humans to visualize or comprehend the
relationships between inputs and outputs. As a result, the inner workings of
the network remain obscure.
Non-linearity: Neural networks apply non-linear transformations to the
input data, and this non-linearity contributes to the complexity of their
behavior. Understanding how small changes in input affect the output is not
straightforward due to these non-linear transformations.
Learning from Data: Neural networks learn from data, adjusting their
parameters based on patterns and relationships within the training set. While
this ability to learn complex patterns is a strength, it also means that the
learned representations may not be easily interpretable by humans.
Lack of Flexibility:-
Deep learning is generally known for its flexibility and ability to learn complex
representations from data. Deep neural networks, which form the foundation of
deep learning, are capable of automatically extracting hierarchical features from
raw input data, making them suitable for a wide range of tasks such as image
recognition, natural language processing, and reinforcement learning.
1. Data Dependence: Deep learning models often require large amounts of labeled
data to perform well. This dependency on data can be seen as a limitation,
especially in scenarios where obtaining labeled data is expensive or time-
consuming.
Multitasking:-
Deep learning models can be adapted for multitasking or handling multiple tasks
simultaneously. Multitasking in deep learning refers to training a model to perform
more than one distinct task using a shared set of parameters. There are a few ways
in which deep learning models can be designed to support multitasking:
2. Joint Training: Instead of training separate models for each task, a deep
learning model can be trained jointly on multiple tasks. During training, the model
is presented with data from all tasks, and the optimization process updates the
shared parameters to improve performance on all tasks simultaneously.
3. Transfer Learning: Transfer learning involves training a model on one task and
then transferring the learned knowledge to another related task. This can be
considered a form of multitasking, where the model leverages knowledge gained
from one task to improve performance on another. Pre-trained models, such as
those trained on large image datasets, are often fine-tuned for specific tasks in this
way.
Neuron
Neuron
euron is a fundamental unit of a neural network, which is inspired by the
structure and functioning of biological neurons in the human brain. Neurons in
deep learning are also referred to as nodes or artificial neurons. They play a crucial
role in processing information and making predi
predictions
ctions within a neural network.
Basic
asic overview of the key component
components of a neuron in deep learning:
1.Connection Strength:
A weight represents the strength or intensity of the connection between two
neurons. It determines the impact of the input signal from one neuron on the output
signal of another.
2.Learnable Parameter:
In thee training phase of a neural network, the weights are learnable parameters.
They are initialized randomly and then adjusted iteratively during the training
process to minimize the difference between the predicted output and the actual
target output.
3. Influence
luence on Neuron Activation:
The weighted sum of inputs to a neuron, including the associated weights, is
computed as part of the neuron's activation. This weighted sum is then passed
through an activation function to determine the neuron's output.
4.Role in Learning:
During the training process, the neural network adjusts the weights to minimize the
error in its predictions. This is typically done using optimization algorithms like
gradient descent, where the weights are updated in the direction that reduces
red the
error.
5.Modeling Relationships:
The weights allow the neural network to model complex relationships and patterns
in the input data. By adjusting the weights, the network can learn to give more or
less importance to specific features, capturing the underlying structure of the data.
6. Bias Term:
In addition to weights, a neuron may have an associated bias term. The bias allows
the neuron to produce an output even when all inputs are zero, providing flexibility
in modeling.
Bias
Bias is an additional
onal parameter associated with each neuron in a neural network.
While weights represent the strength of connections between neurons, biases
provide neurons with the flexibility to produce an output even when all inputs are
zero. The bias term allows the ne
neural
ural network to model more complex
relationships and capture patterns that might not be evident from the raw input data
alone.
Here are the key points about bias in deep learning:
1.Introduction of Offsets:
The bias term introduces an offset or constant value to the input of a neuron. This
is particularly useful when all the input values are zero, preventing the neuron from
being stuck at zero output.
2.Learnable Parameter:
Similar to weights, the bias is a learnable parameter that is adjusted during the
training process. The neural network learns the appropriate values for biases to
minimize the difference between its predictions and the actual target values.
3.Impact on Activation:
The bias term is added to the weighted sum of inputs before passing through the
activation function. Mathematically, this can be expressed as:
4.Flexibility in Modeling:
Biases provide each neuron with a certain degree of independence from the input
data. This flexibility is important for the network to adapt and capture relationships
that may not be clear in the raw input features.
5.Role in Training:
During the training process, biases are adjusted along with weights to optimize the
network's performance on a specific task. Optimization algorithms, such as
gradient descent, are used to update both weights and biases iteratively.
Activation Function?
An activation function is a function that creates inputs and finds relationships from
a series of outputs. An activation function uses algorithms that function like a
human brain to find patterns and relationships in sets of data. Different activation
functions are used depending on the desired impact and performance of the neural
network. Activation functions are made up of 3 layers; input layers, hidden layers,
and output layers.
Activation functions are important because they can add linearity or non-linearity
to a neural network. Activation functions allow information to be presented in a
way that patterns and relationships in data can be extracted. Since all data is not
linear, activation functions allow users to find patterns in multidimensional
information. Since activation functions can be multidimensional, they allow for the
analysis of image, audio, and video.
Linear
Linear activation functions are represented with f(x) = x, it only delivers a range of
activations and cannot compute complex data. This means that complex patterns
and information can not be found using linear activation functions. Linear
functions are good for simple sets of data that can be easily interpreted.
Binary Step
Binary step activation functions are able to comprehend more complex data, but
cannot be used for problems with multi-step classifications.
Non-Linear
Non-linear functions are the most used and make it simple for a neural network to
separate information. There are several different kinds of non-linear functions that
are used depending on results needed. The most common of nonlinear functions
are Sigmoid, Tanh, and ReLU.
To update parameters, weights and biases are adjusted. Biases are located after
weights and are in a different layer of a network, always being assigned the value
of 1. After the parameters are updated, the process is run again. Once the error is at
a minimum, the model is ready to start predicting.
Types of Backpropagation
Static Backpropagation
Recurrent Backpropagation
Recurrent backpropagation is used in data mining, to find the fixed value. Once the
fixed value is found, the error is computed and then run through a backpropagation
algorithm.
The difference between the two types of backpropagation is that static mapping is
immediate and recurrent backpropagation takes a longer time to map.