Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views29 pages

Unit-1 DL

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 29

Unit-1 DL

Introduction to machine learning


Machine Learning (ML) is a subfield of artificial intelligence (AI) that enables computers to learn from data
and make decisions or predictions without being explicitly programmed. It involves creating algorithms that
can analyze and interpret patterns in data to improve over time. Here's a breakdown of the key concepts:

1. Types of Machine Learning:

• Supervised Learning: In this approach, the model is trained on labeled data, where the input data is
paired with the correct output. The goal is to learn the mapping from inputs to outputs. Example:
Predicting house prices based on features like location, size, etc.

• Unsupervised Learning: Here, the model is trained on data without explicit labels. It tries to find
hidden patterns or groupings in the data. Example: Clustering customers based on purchasing
behavior.

• Semi-supervised Learning: This method combines a small amount of labeled data with a large
amount of unlabeled data. It can improve learning accuracy compared to unsupervised learning
alone.

• Reinforcement Learning: This type involves an agent that learns to make decisions by interacting
with its environment. It receives feedback in the form of rewards or penalties based on its actions.
Example: Training an AI to play a game like chess.

2. Key Components:

• Data: The foundation of machine learning. Models learn from the data you provide, so the quality
and quantity of the data are critical.

• Model: A mathematical representation or algorithm that is trained on the data to make predictions
or decisions.

• Training: The process of teaching the model by feeding it data and adjusting it to improve accuracy.

• Testing: Evaluating the model's performance using new, unseen data to check how well it
generalizes to new inputs.

3. Common Algorithms:

• Linear Regression: Predicts continuous values based on a linear relationship between the input and
output.

• Decision Trees: A tree-like structure that helps in decision-making by splitting the data based on
different criteria.

• Support Vector Machines (SVM): Used for classification tasks, separating data into different classes
with a hyperplane.
• Neural Networks: A network of interconnected nodes (or neurons) that can model complex
relationships in the data, forming the foundation for deep learning.

• K-Nearest Neighbors (KNN): A simple classification algorithm that assigns a class based on the
majority class among the nearest neighbors.

4. Applications of Machine Learning:

• Healthcare: Predicting diseases, drug discovery, medical image analysis.

• Finance: Fraud detection, stock price prediction, credit scoring.

• Retail: Recommendation systems, customer segmentation, demand forecasting.

• Autonomous Vehicles: Object detection, route optimization, self-driving technology.

• Natural Language Processing (NLP): Speech recognition, sentiment analysis, machine translation.

5. Challenges in Machine Learning:

• Overfitting: When a model performs well on training data but poorly on unseen data because it has
memorized the data instead of generalizing.

• Underfitting: When a model is too simple and cannot capture the underlying patterns in the data.

• Data Quality: Inaccurate, missing, or biased data can lead to poor model performance.

• Computational Resources: Some machine learning models, especially deep learning, require
significant computational power and time.

6. The Future of Machine Learning:

• With advancements in algorithms, computing power, and access to big data, machine learning is
continuously evolving. It holds the potential to revolutionize industries, automate complex tasks,
and drive innovations in AI.

Machine learning is a rapidly growing field with applications in virtually every sector, transforming how we
work, communicate, and solve problems.

Linear Models: An Overview


Linear models are a class of algorithms that make predictions based on a linear combination of input
features. They are foundational in machine learning and include models such as Support Vector Machines
(SVMs), Perceptrons, and Logistic Regression. Each model has its own characteristics and is suited to
specific types of problems.
1. Support Vector Machines (SVMs)
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems in
Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs,
and then we test it with this strange creature. So as support vector creates a decision boundary between
these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.

o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional
space, but we need to find out the best decision boundary that helps to classify the data points. This best
boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are
2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between
the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.

How does SVM works?

Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that
has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

2. Perceptrons
Perceptron in Machine Learning

In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It
is the primary step to learn Machine Learning and Deep Learning technologies, which consists of a set of
weights, input values or scores, and a threshold. Perceptron is a building block of an Artificial Neural
Network. Initially, in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing
certain calculations to detect input data capabilities or business intelligence. Perceptron is a linear Machine
Learning algorithm used for supervised learning for various binary classifiers. This algorithm enables
neurons to learn elements and processes them one by one during preparation. In this tutorial, "Perceptron
in Machine Learning," we will discuss in-depth knowledge of Perceptron and its basic functions in brief.
Let's start with the basic introduction of Perceptron.
What is the Perceptron model in Machine Learning?

Perceptron is Machine Learning algorithm for supervised learning of various binary classification tasks.
Further, Perceptron is also understood as an Artificial Neuron or neural network unit that helps to detect
certain input data computations in business intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks.
However, it is a supervised learning algorithm of binary classifiers. Hence, we can consider it as a single-
layer neural network with four main parameters, i.e., input values, weights and Bias, net sum, and an
activation function.

What is Binary classifier in Machine Learning?

In Machine Learning, binary classifiers are defined as the function that helps in deciding whether input data
can be represented as vectors of numbers and belongs to some specific class.

Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as
a classification algorithm that can predict linear predictor function in terms of weight and feature
vectors.

Basic Components of Perceptron

Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three main
components. These are as follows:
o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the system for further
processing. Each input node contains a real numerical value.

o Wight and Bias:

Weight parameter represents the strength of the connection between units. This is another most important
parameter of Perceptron components. Weight is directly proportional to the strength of the associated
input neuron in deciding the output. Further, Bias can be considered as the line of intercept in a linear
equation.

o Activation Function:

These are the final and important components that help to determine whether the neuron will fire or not.
Activation Function can be considered primarily as a step function.

Types of Activation functions:

o Sign function

o Step function, and

o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various problem
statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in
perceptron models by checking whether the learning process is slow or has vanishing or exploding
gradients.

How does Perceptron work?

In Machine Learning, Perceptron is considered as a single-layer neural network that consists of four main
parameters named input values (Input nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and their weights, then adds these
values together to create the weighted sum. Then this weighted sum is applied to the activation function 'f'
to obtain the desired output. This activation function is also known as the step function and is represented
by 'f'.

This step function or Activation function plays a vital role in ensuring that output is mapped between
required values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the strength of
a node. Similarly, an input's bias value gives the ability to shift the activation function curve up or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then add them to
determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned weighted sum, which gives
us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)

Types of Perceptron Models

Based on the layers, Perceptron models are divided into two types. These are as follows:

1. Single-layer Perceptron Model

2. Multi-layer Perceptron model

Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron model consists
feed-forward network and also includes a threshold transfer function inside the model. The main objective
of the single-layer perceptron model is to analyze the linearly separable objects with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with
inconstantly allocated input for weight parameters. Further, it sums up all inputs (weight). After adding all
inputs, if the total sum of all inputs is more than a pre-determined value, the model gets activated and
shows the output value as +1.

If the outcome is same as pre-determined or threshold value, then the performance of this model is stated
as satisfied, and weight demand does not change. However, this model consists of a few discrepancies
triggered when multiple weight inputs values are fed into the model. Hence, to find desired output and
minimize errors, some changes should be necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:

Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but
has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward stage and terminate
on the output layer.

o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on
the output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having
various layers in which activation function does not remain linear, similar to a single layer perceptron
model. Instead of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear and non-linear
patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear problems.


o It works well with both small and large input data.

o It helps us to obtain quick predictions after the training.

o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.

o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each
independent variable.

o The model functioning depends on the quality of the training.

Perceptron Function

Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned weight
coefficient 'w'.

Mathematically, we can express it as follows:

f(x)=1; if w.x+b>0

otherwise, f(x)=0

o 'w' represents real-valued weights vector

o 'b' represents the bias

o 'x' represents a vector of input x values.

Characteristics of Perceptron

The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary classifiers.

2. In Perceptron, the weight coefficient is automatically learned.

3. Initially, weights are multiplied with input features, and the decision is made whether the neuron is
fired or not.

4. The activation function applies a step rule to check whether the weight function is greater than
zero.

5. The linear decision boundary is drawn, enabling the distinction between the two linearly separable
classes +1 and -1.

6. If the added sum of all input values is more than the threshold value, it must have an output signal;
otherwise, no output will be shown.

Limitations of Perceptron Model

A perceptron model has limitations as follows:


o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer
function.

o Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors
are non-linear, it is not easy to classify them properly.

3. Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.

o Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.

o Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).

o The curve from the logistic function indicates the likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not based on its weight, etc.

o Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.

o Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is showing
the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic
regression, but is used to classify samples; Therefore, it falls under the classification algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to probabilities.

o It maps any real value into another value within a range of 0 and 1.

o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so
it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic
function.

o In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.

o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation by
(1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.

o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or "sheep"

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".

Neural Networks:
Neural networks are machine learning models that mimic the complex functions of the human brain. These
models consist of interconnected nodes or neurons that process data, learn patterns, and enable tasks such
as pattern recognition and decision-making.

In this article, we will explore the fundamentals of neural networks, their architecture, how they work,
and their applications in various fields. Understanding neural networks is essential for anyone interested
in the advancements of artificial intelligence.

Understanding Neural Networks in Deep Learning

Neural networks are capable of learning and identifying patterns directly from data without pre-defined
rules. These networks are built from several key components:

1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold and an
activation function.

2. Connections: Links between neurons that carry information, regulated by weights and biases.

3. Weights and Biases: These parameters determine the strength and influence of connections.

4. Propagation Functions: Mechanisms that help process and transfer data across layers of neurons.

5. Learning Rule: The method that adjusts weights and biases over time to improve accuracy.
Learning in neural networks follows a structured, three-stage process:

1. Input Computation: Data is fed into the network.

2. Output Generation: Based on the current parameters, the network generates an output.

3. Iterative Refinement: The network refines its output by adjusting weights and biases, gradually
improving its performance on diverse tasks.

Layers in Neural Network Architecture

1. Input Layer: This is where the network receives its input data. Each input neuron in the layer
corresponds to a feature in the input data.

2. Hidden Layers: These layers perform most of the computational heavy lifting. A neural network can
have one or multiple hidden layers. Each layer consists of units (neurons) that transform the inputs
into something that the output layer can use.

3. Output Layer: The final layer produces the output of the model. The format of these outputs varies
depending on the specific task (e.g., classification, regression).

Shallow Network
A shallow network in machine learning refers to a neural network with a simple architecture, typically
consisting of one hidden layer between the input and output layers. It is in contrast to a deep network,
which has multiple hidden layers.

Key Characteristics of a Shallow Network:

1. Architecture:

o Input layer: Accepts the features of the data.

o Hidden layer: Performs computations to identify patterns in the data.

o Output layer: Produces predictions or classifications.

o Only one hidden layer is used in a shallow network.


2. Complexity:

o Limited capacity to model highly complex patterns.

o Suitable for problems where the relationship between input and output is relatively simple.

3. Training:

o Easier and faster to train compared to deep networks due to fewer parameters.

o Requires less computational power and memory.

4. Common Uses:

o Tasks with small datasets.

o Problems where the relationships are well-understood and simple, such as linear or slightly
nonlinear problems.

Advantages of Shallow Networks:

1. Efficiency: Fewer parameters and simpler structure make training faster.

2. Interpretability: Easier to understand and analyze compared to deep networks.

3. Lower Risk of Overfitting: With fewer parameters, there’s a reduced risk of overfitting on small
datasets.

Disadvantages of Shallow Networks:

1. Limited Representational Power: They struggle to capture complex patterns and hierarchical
features in data.

2. Scalability Issues: Not suitable for tasks requiring the extraction of high-level features, such as
image recognition or natural language processing.

3. Difficulty Handling Nonlinear Relationships: When relationships are highly nonlinear, shallow
networks may fail to converge or perform well.

Example of a Shallow Network:

A single-layer perceptron or a feedforward neural network with one hidden layer can be considered a
shallow network. For example:

• Input features: Age, income, and spending score.

• Output: Predicting whether a person will buy a product (Yes/No).


While shallow networks are foundational to neural networks, deep networks are often preferred for
modern applications requiring higher complexity and accuracy.

Shallow Network (Shallow neural networks):


A shallow neural network refers to a neural network that consists of only one hidden layer between the
input and output layers. This structure is simpler compared to deep neural networks that feature multiple
hidden layers. Despite their simplicity, shallow networks are powerful tools capable of approximating any
function, given sufficient neurons in the hidden layer—a property known as the universal approximation
theorem.

Components of a Shallow Neural Network

1. Input Layer: This is where the network receives its input data. Each neuron in this layer represents a
feature of the input dataset.

2. Hidden Layer: The single hidden layer in a shallow network transforms the inputs into something
that the output layer can use. The neurons in this layer apply a set of weights to the inputs and pass
them through an activation function to introduce non-linearity to the process.

3. Output Layer: The final layer produces the output of the network. For regression tasks, this might
be a single neuron; for classification, it could be multiple neurons corresponding to the classes.

How Do Shallow Neural Networks Work?

The functionality of shallow neural networks hinges on the transformation of inputs through the hidden
layer to produce outputs. Here's a step-by-step breakdown:

• Weighted Sum: Each neuron in the hidden layer calculates a weighted sum of the inputs.

• Activation Function: The weighted sums are passed through an activation function (such
as Sigmoid, Tanh, or ReLU) to introduce non-linearity, enabling the network to learn complex
patterns.

• Output Generation: The output layer integrates the signals from the hidden layer, often through
another set of weights, to produce the final output.

Training Shallow Neural Networks

Training a shallow neural network typically involves:

• Forward Propagation: Calculating the output for a given input by passing it through the layers of the
network.

• Loss Calculation: Determining how far the network's output is from the actual desired output using
a loss function.

• Backpropagation: Calculating the gradient of the loss function with respect to each weight in the
network, which informs how the weights should be adjusted to minimize the loss.

• Weight Update: Adjusting the weights using an optimization algorithm like gradient descent.
Training a Network:
Training a neural network is the process of using training data to find the appropriate weights of the
network for creating a good mapping of inputs and outputs

We have multiple methods to train a network as follows:

1. loss functions
2. back propagation
3. stochastic gradient descent

1. Loss Function:
Loss Functions in Training Neural Networks

A loss function is a mathematical function that measures how well a machine learning model's predictions
match the actual target values. It quantifies the error or difference between the predicted and actual
values. The goal of training a neural network is to minimize this loss by adjusting the model's weights using
an optimization algorithm like gradient descent.

Purpose of a Loss Function

• Guides Model Training: It provides a scalar value that indicates how far the model's predictions are
from the ground truth.

• Optimization Target: During training, the optimization algorithm tries to minimize the loss function
by updating model parameters.

• Indicates Progress: Lower loss values generally signify better model performance on the training
data.

Types of Loss Functions

1. Regression Loss Functions

Used when the output is continuous (e.g., predicting house prices).

• Mean Squared Error (MSE):

MSE=1N∑i=1N(yi−y^i)2\text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2MSE=N1i=1∑N(yi−y^i)2

o Penalizes larger errors more heavily.

o Suitable for tasks where large deviations are undesirable.

• Mean Absolute Error (MAE):

MAE=1N∑i=1N∣yi−y^i∣\text{MAE} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i|MAE=N1i=1∑N∣yi−y^i∣

o Treats all errors equally.


o More robust to outliers than MSE.

• Huber Loss: Combines MSE and MAE to be robust to outliers:

L(a)={12(y−y^)2if ∣y−y^∣≤δδ∣y−y^∣−12δ2otherwiseL(a) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if }


|y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2} \delta^2 & \text{otherwise} \end{cases}L(a)={21
(y−y^)2δ∣y−y^∣−21δ2if ∣y−y^∣≤δotherwise

2. Classification Loss Functions

Used when the output is categorical (e.g., predicting whether an email is spam or not).

• Binary Cross-Entropy (Log Loss): Used for binary classification problems:

Loss=−1N∑i=1N[yilog⁡(y^i)+(1−yi)log⁡(1−y^i)]\text{Loss} = -\frac{1}{N} \sum_{i=1}^N \left[y_i


\log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]Loss=−N1i=1∑N[yilog(y^i)+(1−yi)log(1−y^i)]

o Penalizes incorrect confidence in predictions.

o Works well with outputs in the range [0, 1].

• Categorical Cross-Entropy: Generalization of binary cross-entropy for multi-class classification:

Loss=−1N∑i=1N∑j=1Cyijlog⁡(y^ij)\text{Loss} = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^C y_{ij}


\log(\hat{y}_{ij})Loss=−N1i=1∑Nj=1∑Cyijlog(y^ij)

o Suitable for multi-class problems.

o Often used with softmax activation.

• Hinge Loss: Used for Support Vector Machines (SVMs) and margin-based classification:

Loss=∑i=1Nmax⁡(0,1−yiy^i)\text{Loss} = \sum_{i=1}^N \max(0, 1 - y_i \hat{y}_i)Loss=i=1∑Nmax(0,1−yiy^i)

3. Custom Loss Functions

• Tailored for specific problems or tasks.

• Examples: IoU loss for object detection, style transfer losses, etc.

How Loss Functions Work in Training

1. Forward Pass:

o The network makes predictions based on current weights.

o The loss function computes the error between predictions and actual values.

2. Backward Pass:

o The loss is used to compute gradients with respect to model weights using backpropagation.

o Gradients guide how the weights should be updated.


3. Optimization:

o The optimization algorithm (e.g., stochastic gradient descent) minimizes the loss by updating
weights iteratively.

Choosing the Right Loss Function

• Regression Tasks: Use MSE or MAE depending on sensitivity to outliers.

• Binary Classification: Use Binary Cross-Entropy.

• Multi-Class Classification: Use Categorical Cross-Entropy.

• Specific Domains: Customize loss functions to align with domain-specific requirements.

Example

For a neural network predicting house prices:

• Input: Features like size, location, number of rooms.

• Output: Predicted price.

• Ground Truth: Actual price.

• Loss Function: Mean Squared Error (MSE) measures how far predictions are from actual prices.

The network learns to predict more accurate prices by minimizing the MSE over iterations.

2. Back propagation:
Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural networks,
particularly feed-forward networks. It works iteratively, minimizing the cost function by adjusting weights
and biases.

In each epoch, the model adapts these parameters, reducing loss by following the error gradient.
Backpropagation often utilizes optimization algorithms like gradient descent or stochastic gradient
descent. The algorithm computes the gradient using the chain rule from calculus, allowing it to effectively
navigate complex layers in the neural network to minimize the cost function.
fig(a) A simple illustration of how the backpropagation works by adjustments of weights

Why is Backpropagation Important?

Backpropagation plays a critical role in how neural networks improve over time. Here's why:

1. Efficient Weight Update: It computes the gradient of the loss function with respect to each weight
using the chain rule, making it possible to update weights efficiently.

2. Scalability: The backpropagation algorithm scales well to networks with multiple layers and complex
architectures, making deep learning feasible.

3. Automated Learning: With backpropagation, the learning process becomes automated, and the
model can adjust itself to optimize its performance.

3. Stochastic gradient descent


Stochastic Gradient Descent is an optimization algorithm used to minimize a loss function during training
of machine learning models, particularly neural networks. It updates model parameters iteratively by
approximating the gradient of the loss function using a single or small batch of data points.

Gradient Descent: A Recap

The goal of gradient descent is to minimize the loss function L(θ)\mathcal{L}(\theta)L(θ), where θ\thetaθ
represents the model parameters (e.g., weights and biases). The update rule is:
θ←θ−η∂L∂θ\theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}}{\partial \theta}θ←θ−η∂θ∂L

• η\etaη: Learning rate, controls the size of the step in the direction of the negative gradient.

• ∂L∂θ\frac{\partial \mathcal{L}}{\partial \theta}∂θ∂L: Gradient of the loss with respect to θ\thetaθ.

Variants of Gradient Descent

1. Batch Gradient Descent:

o Uses the entire dataset to compute the gradient in each iteration.

o Update rule: θ←θ−η1N∑i=1N∂Li∂θ\theta \leftarrow \theta - \eta \frac{1}{N} \sum_{i=1}^N


\frac{\partial \mathcal{L}_i}{\partial \theta}θ←θ−ηN1i=1∑N∂θ∂Li

o Pros:

▪ Provides stable convergence.

o Cons:

▪ Computationally expensive for large datasets.

▪ Memory-intensive.

2. Stochastic Gradient Descent (SGD):

o Uses a single random data point to compute the gradient in each iteration.

o Update rule: θ←θ−η∂Li∂θ\theta \leftarrow \theta - \eta \frac{\partial \mathcal{L}_i}{\partial


\theta}θ←θ−η∂θ∂Li where iii is a randomly selected data point.

o Pros:

▪ Computationally efficient for large datasets.

▪ Can escape local minima due to noise in updates.

o Cons:

▪ Noisy updates may lead to less stable convergence.

3. Mini-Batch Gradient Descent:

o Uses a small subset (mini-batch) of data points to compute the gradient.

o Update rule: θ←θ−η1B∑i=1B∂Li∂θ\theta \leftarrow \theta - \eta \frac{1}{B} \sum_{i=1}^B


\frac{\partial \mathcal{L}_i}{\partial \theta}θ←θ−ηB1i=1∑B∂θ∂Li where BBB is the batch
size.

o Balances trade-offs: Less noisy than SGD, faster than batch gradient descent.
How SGD Works

1. Initialize Parameters: Start with random weights θ\thetaθ.

2. Random Sampling: Randomly shuffle the training data.

3. Update Parameters:

o For each training example (xi,yi)(x_i, y_i)(xi,yi), compute the gradient of the loss function
with respect to θ\thetaθ.

o Update weights using: θ←θ−η∂L(xi,yi)∂θ\theta \leftarrow \theta - \eta \frac{\partial


\mathcal{L}(x_i, y_i)}{\partial \theta}θ←θ−η∂θ∂L(xi,yi)

4. Repeat until convergence or a stopping criterion is met (e.g., a fixed number of epochs).

Advantages of SGD

1. Efficiency:

o Processes one example at a time, making it feasible for large datasets.

o Requires less memory than batch gradient descent.

2. Faster Updates:

o Updates weights more frequently, allowing for faster learning at the start.

3. Escapes Local Minima:

o The randomness introduces noise, helping SGD escape shallow local minima and explore a
broader parameter space.

Disadvantages of SGD

1. Noisy Updates:

o The randomness can cause fluctuations, making it harder to converge smoothly.

2. Tuning Challenges:

o Requires careful selection of the learning rate η\etaη.

o Sensitive to hyperparameters.

3. Suboptimal Convergence:

o May not reach the exact global minimum due to the inherent noise.

Improvements Over SGD


Modern optimizers build upon SGD to improve convergence and stability:

1. Momentum:

o Adds a fraction of the previous update to the current update, smoothing the trajectory:
vt=γvt−1+η∂L∂θv_t = \gamma v_{t-1} + \eta \frac{\partial \mathcal{L}}{\partial \theta}vt
=γvt−1+η∂θ∂L θ←θ−vt\theta \leftarrow \theta - v_tθ←θ−vt

2. RMSProp:

o Adapts the learning rate based on recent gradients.

3. Adam Optimizer:

o Combines momentum and RMSProp for efficient and robust updates.

Conclusion

Stochastic Gradient Descent is a cornerstone of machine learning optimization, offering computational


efficiency and simplicity. Despite its noisy updates, techniques like momentum and learning rate scheduling
make it a powerful tool for training neural networks effectively.

Neural networks as universal function approximates


Neural Networks as Universal Function Approximators

Neural networks are powerful computational models capable of approximating any function under certain
conditions. This concept is formally established in the Universal Approximation Theorem, which states that
a sufficiently large neural network can approximate any continuous function on a compact domain to an
arbitrary degree of accuracy, provided it has enough neurons and a suitable activation function.

Key Points of the Universal Approximation Theorem

1. Scope:

o Applies to feedforward neural networks with at least one hidden layer.

o The theorem guarantees the ability to approximate functions, but it does not specify the
efficiency of doing so.

2. Requirements:

o Nonlinear Activation Function: The hidden layer must use a non-linear activation function
(e.g., sigmoid, ReLU).

o Sufficient Neurons: The number of neurons in the hidden layer must be large enough to
capture the function's complexity.
3. Function Type:

o The theorem applies to continuous functions on compact (finite and closed) subsets of
Rn\mathbb{R}^nRn.

Intuition Behind the Theorem

Neural networks approximate a target function by learning a composition of simpler functions. Each layer
applies a linear transformation followed by a non-linear activation function, enabling the network to model
complex, non-linear relationships.

1. Linear Approximation:

o Without non-linear activations, a neural network can only represent linear functions.

2. Non-Linearity Adds Flexibility:

o The non-linear activations allow the network to "bend" the input space and approximate
more complex functions.

3. Layer Composition:

o By stacking layers, the network builds hierarchical representations, progressively learning


features of increasing abstraction.

Implications of the Theorem

1. Expressive Power:

o Neural networks can approximate any function, but the complexity (e.g., the number of
neurons and layers) depends on the target function.

2. Practical Considerations:

o Parameter Explosion: A single-layer network may need an extremely large number of


neurons to approximate complex functions.

o Depth vs. Width: Adding depth (more layers) can achieve the same approximation with
fewer neurons per layer, often more efficiently.

3. Approximation vs. Generalization:

o While neural networks can approximate any function, training them to generalize well on
unseen data is a separate challenge.

Proof Sketch
The proof typically uses the sigmoid activation function and shows that linear combinations of sigmoidal
functions can approximate any continuous function:

1. The sigmoid function can approximate a step function.

2. Step functions can be combined to form piecewise constant functions.

3. Piecewise constant functions can approximate continuous functions as the steps become smaller.

Example

Suppose you want to approximate f(x)=sin⁡(x)f(x) = \sin(x)f(x)=sin(x) over the interval [0,π][0, \pi][0,π]:

• A neural network with a single hidden layer and enough neurons can represent sin⁡(x)\sin(x)sin(x)
by combining multiple sigmoidal or ReLU-based activations.

• The approximation improves as the number of neurons increases.

Challenges

1. Training:

o Convergence to the target function requires effective optimization algorithms.

o Poor initialization or hyperparameter tuning can prevent successful learning.

2. Overfitting:

o A neural network that approximates the training data perfectly might fail to generalize to
unseen data.

3. Computational Cost:

o Large networks with many neurons or layers require significant computational resources.

Neural Networks in Practice

• While the Universal Approximation Theorem provides theoretical assurance, practical


implementation focuses on balancing:

o Model Complexity: Sufficient to capture the target function.

o Training Efficiency: Achieving convergence within reasonable time and resources.

o Generalization: Performing well on unseen data.

• Modern architectures like Convolutional Neural Networks (CNNs) and Transformers exploit this
universal approximation capability in specialized ways for tasks like image recognition, language
modeling, and more.
Conclusion

Neural networks as universal function approximators demonstrate their theoretical potential to model any
problem domain. However, practical success relies on architectural design, training strategies, and data
quality. This powerful characteristic underpins their widespread use in modern machine learning
applications.

You might also like