Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Unit2 ML Notes

ml notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit2 ML Notes

ml notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT – II

Supervised Learning – I (Regression/Classification) Regression models: Simple Linear


Regression, multiple linear Regression. Cost Function, Gradient Descent, Performance
Metrics: Mean Absolute Error(MAE),Mean Squared Error(MSE) R-Squared error, Adjusted
R Square. Classification models: Decision Trees-ID3,CART, Naive Bayes, K-Nearest-
Neighbours (KNN), Logistic Regression, Multinomial Logistic Regression Support Vector
Machines (SVM) - Nonlinearity and Kernel Methods

Linear regression:
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)

Page 25
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model
representation.
Regression Models
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
2.1 Linear regression
Linear regression in simple term is answering a question on “How can I use X to predict Y?”
where X is some information that you have, and Y is some information that you want.
Let’s say you wanted a sell a house and you wanted to know how much you can sell it for.
You have information about the house that is your X and the selling price that you wanted to
know will be your Y.
Linear regression creates an equation in which you input your given numbers (X) and it
outputs the target variable that you want to find out (Y).
Linear Regression model representation
Linear regression is such a useful and established algorithm, that it is both a statistical model
and a machine learning model. Linear regression tries a draw a best fit line that is close to the
data by finding the slope and intercept.
Linear regression equation is,
Y=a+bx
In this equation:
• y is the output variable. It is also called the target variable in machine learning or the
dependent variable.
• x is the input variable. It is also referred to as the feature in machine learning or it is
called the independent variable.
• a is the constant

Page 26
• b is the coefficient of independent variable
2.2 Multiple linear regression
Multiple Linear Regression assumes there is a linear relationship between two or more
independent variables and one dependent variable.
The Formula for multiple linear regression:
Y=B0+B0X1+B2X2+……+BnXn+e
• Y = the predicted value of the dependent variable
• B0 = the y-intercept (value of y when all other parameters are set to 0)
• B1X1= the regression coefficient (B1) of the first independent variable (X1)
• BnXn = the regression coefficient of the last independent variable
• e = model error
2.3 Cost-function
The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number.
2.4 Gradient Descent
It is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further,
gradient descent is also used to train Neural Networks.

2.4.1 Types of Gradient Descent


Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochastic gradient descent, and mini-batch
gradient descent. Let's understand these different types of gradient descent:
1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.
Advantages of Batch gradient descent:
o It produces less noise in comparison to other gradient descent.
o It produces stable gradient descent convergence.
o It is Computationally efficient as all resources are used for all training samples.

Page 27
2. Stochastic gradient descent

Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration.
3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent.

• Mean Squared Error represents the average of the squared difference between the
original and predicted values in the data set. It measures the variance of the residuals.

• Root Mean Squared Error is the square root of Mean Squared error. It measures the
standard deviation of residuals.

• The coefficient of determination or R-squared represents the proportion of the variance


in the dependent variable which is explained by the linear regression model. It is a scale-
free score i.e. irrespective of the values being small or large, the value of R square will be
less than one.

Page 28
• Adjusted R squared is a modified version of R square, and it is adjusted for the
number of independent variables in the model, and it will always be less than or equal to
R².In the formula below n is the number of observations in the data and k is the number
of the independent variables in the data.

2.5 Evaluation Metrics


• Mean Squared Error(MSE) and Root Mean Square Error penalizes the large prediction
errors vi-a-vis Mean Absolute Error (MAE). However, RMSE is widely used than MSE
to evaluate the performance of the regression model with other random models as it has
the same units as the dependent variable (Y-axis).
• MSE is a differentiable function that makes it easy to perform mathematical operations
in comparison to a non-differentiable function like MAE. Therefore, in many models,
RMSE is used as a default metric for calculating Loss Function despite being harder to
interpret than MAE.
• The lower value of MAE, MSE, and RMSE implies higher accuracy of a regression
model. However, a higher value of R square is considered desirable.
• R Squared & Adjusted R Squared are used for explaining how well the independent
variables in the linear regression model explains the variability in the dependent variable.
R Squared value always increases with the addition of the independent variables which
might lead to the addition of the redundant variables in our model. However, the adjusted
R-squared solves this problem.
• Adjusted R squared takes into account the number of predictor variables, and it is used
to determine the number of independent variables in our model. The value of Adjusted R
squared decreases if the increase in the R square by the additional variable isn’t
significant enough.

Page 29
• For comparing the accuracy among different linear regression models, RMSE is a
better choice than R Squared.

2.6 Decision Trees


In simple words, a decision tree is a structure that contains nodes (rectangular boxes) and
edges(arrows) and is built from a dataset (table of columns representing features/attributes and
rows corresponds to records). Each node is either used to make a decision (known as decision
node) or represent an outcome (known as leaf node).
Decision tree Example

The picture above depicts a decision tree that is used to classify whether a person
is Fit or Unfit.
The decision nodes here are questions like ‘’‘Is the person less than 30 years of age?’, ‘Does
the person eat junk?’, etc. and the leaves are one of the two possible
outcomesviz. Fit and Unfit.
Looking at the Decision Tree we can say make the following decisions:
if a person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a person is
less than 30 years of age and eats junk food then he is Unfit and so on.
The initial node is called the root node (colored in blue), the final nodes are called the leaf
nodes (colored in green) and the rest of the nodes are called intermediate or internal nodes.
The root and intermediate nodes represent the decisions while the leaf nodes represent the
outcomes.
2.6.1 ID3

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step.

Page 30
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree. In
simple words, the top-down approach means that we start building the tree from the top and
the greedy approach means that at each iteration we select the best feature at the present
moment to create a node.
Most generally ID3 is only used for classification problems with nominal features only.
ID3 Steps
1. Calculate the Information Gain of each feature.
2. Considering that all rows don’t belong to the same class, split the dataset S into subsets
using the feature for which the Information Gain is maximum.
3. Make a decision tree node using the feature with the maximum Information gain.
4. If all rows belong to the same class, make the current node as a leaf node with the classas
its label.
5. Repeat for the remaining features until we run out of all features, or the decision tree
has all leaf nodes.
2.6.2 CART Algorithm
The CART algorithm works via the following process:
• The best split point of each input is obtained.
• Based on the best split points of each input in Step 1, the new “best” split point is
identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index
criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It
works on categorical variables, provides outcomes either “successful” or “failure” and hence
conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,

Page 31
• Where 0 depicts that all the elements are allied to a certain class, or only one class
exists there.
• The Gini index of value 1 signifies that all the elements are randomly distributed
across various classes, and
• A value of 0.5 denotes the elements are uniformly distributed into some classes.
Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is
then used to identify the “Class” within which the target variable is most likely to fall.
Classification trees are used when the dataset needs to be split into classes that belong to the
response variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used
to predict its value. Regression trees are used when the response variable is continuous. For
example, if the response variable is the temperature of the day.
Pseudo-code of the CART algorithm
d = 0, endtree = 0
Note(0) = 1, Node(1) = 0, Node(2) = 0
while endtree < 1
if Node(2d-1) + Node(2d) +..... + Node(2d+1-2) = 2 - 2d+1
endtree = 1
else
do i = 2d-1, 2d, ..... , 2d+1-2
if Node(i) > -1
Split tree
else
Node(2i+1) = -1
Node(2i+2) = -1
end if
end do
end if
d=d+1
end while
CART model representation

Page 32
CART models are formed by picking input variables and evaluating split points on those
variables until an appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
• Greedy algorithm: In this The input space is divided using the Greedy method which
is known as a recursive binary spitting. This is a numerical method within which all of
the values are aligned and several other split points are tried and assessed using a cost
function.
• Stopping Criterion: As it works its way down the tree with the training data, the
recursive binary splitting method described above must know when to stop splitting. The
most frequent halting method is to utilize a minimum amount of training data allocated to
every leaf node. If the count is smaller than the specified threshold, the split is rejected
and also the node is considered the last leaf node.
• Tree pruning: Decision tree’s complexity is defined as the number of splits in the
tree. Trees with fewer branches are recommended as they are simple to grasp and less
prone to cluster the data. Working through each leaf node in the tree and evaluating the
effect of deleting it using a hold-out test set is the quickest and simplest pruning
approach.
• Data preparation for the CART: No special data preparation is required for the
CART algorithm.
2.7 Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training
dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naïve Bayes?

Page 33
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Page 34
2.8 K-Nearest Neighbor(KNN) Algorithm
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

Page 35
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already studied
in geometry. It can be calculated as:

Page 36
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.

Page 37
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
2.9 Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Page 38
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

Page 39
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
2.10 Multinomial Logistic Regression
Multinomial Logistic Regression is a classification technique that extends the logistic
regression algorithm to solve multiclass possible outcome problems, given one or more
independent variables.
Example for Multinomial Logistic Regression:
(a) Which Flavor of ice cream will a person choose?
Dependent Variable:
• Vanilla
• Chocolate
• Butterscotch
• Black Current
Independent Variables:
• Gender
• Age
• Occasion
• Happiness
• Etc.
Multinomial Logistic Regression is also known as multiclass logistic regression, softmax
regression, polytomous logistic regression, multinomial logit, maximum entropy (MaxEnt)
classifier and conditional maximum entropy model.

Page 40
Dependent Variable:
The dependent Variable can have two or more possible outcomes/classes.
The dependent variables are nominal in nature means there is no any kind of ordering in
target dependent classes i.e. these classes cannot be meaningfully ordered.
The dependent variable to be predicted belongs to a limited set of items defined.
Basic Steps
The basic steps of the SVM are:
1. select two hyperplanes (in 2D) which separates the data with no points between
them (red lines)
2. maximize their distance (the margin)
3. the average line (here the line half way between the two red lines) will be the decision
boundary
This is very nice and easy, but finding the best margin, the optimization problem is not trivial
(it is easy in 2D, when we have only two attributes, but what if we have N dimensions with N
a very big number).
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

Page 41
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can
be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Page 42
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


Kernel Methods
Kernels or kernel methods (also called Kernel functions) are sets of different types of
algorithms that are being used for pattern analysis. They are used to solve a non-linear
problem by using a linear classifier. Kernels Methods are employed in SVM (Support Vector
Machines) which are used in classification and regression problems. The SVM uses what is
called a “Kernel Trick” where the data is transformed and an optimal boundary is found for
the possible outputs.
The Need for Kernel Method and its Working

Before we get into the working of the Kernel Methods, it is more important to understand
support vector machines or the SVMs because kernels are implemented in SVM models. So,
Support Vector Machines are supervised machine learning algorithms that are used in
classification and regression problems such as classifying an apple to class fruit while
classifying a Lion to the class animal.
we have 2 dimension which represents the ambient space but the lone which divides or
classifies the space is one dimension less than the ambient space and is called hyperplane.
But what if we have input like this:
It is very difficult to solve this classification using a linear classifier as there is no good linear
line that should be able to classify the red and the green dots as the points are randomly

Page 43

You might also like