Unit II
Unit II
Unit II
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc.
and Classification algorithms are used to predict/Classify the discrete values such as
Male or Female, True or False, Spam or Not Spam, etc.
Classification
• The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations on the basis of training data
• In Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups. Such as, Yes
or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as
targets/labels or categories.
• Classification is a process of finding a function which helps in dividing the
dataset into classes based on different parameters. In Classification, a computer
program is trained on the training dataset and based on that training, it
categorizes the data into different classes.
• Unlike regression, the output variable of Classification is a category, not a value,
such as "Green or Blue", "fruit or animal", etc. Since the Classification algorithm
is a Supervised learning technique, hence it takes labeled input data, which
means it contains input with the corresponding output
In classification algorithm, a discrete output function(y) is mapped to input
variable(x).
o Logistic Regression
o K-Nearest Neighbours
o Support Vector Machines
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Logistic regression:
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
o We know the equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
On the basis of the categories, Logistic Regression can be classified into three types:
When you extend this line, you will have values greater than 1 and less than 0, which
do not make much sense in our classification problem. It will make a model
interpretation a challenge. That is where `Logistic Regression` comes in. If we needed
to predict sales for an outlet, then this model could be helpful. But here we need to
classify customers.
-We need a function to transform this straight line in such a way that values will be
between 0 and 1:
Ŷ = Q (Z)
Ŷ =1/1+ e-z
After transformation, we will get a line that remains between 0 and 1. Another
advantage of this function is all the continuous values we will get will be between 0 and
1 which we can use as a probability for making predictions. For example, if the
predicted value is on the extreme right, the probability will be close to 1 and if the
predicted value is on the extreme left, the probability will be close to 0.
electing the right model is not enough. You need a function that measures the
performance of a Machine Learning model for given data. Cost Function quantifies
the error between predicted values and expected values.
-Another thing that will change with this transformation is Cost Function. In Linear
Regression, we use `Mean Squared Error` for cost function given by:-
and when this error function is plotted with respect to weight parameters of the Linear
Regression Model, it forms a convex curve which makes it eligible to apply Gradient
Descent Optimization Algorithm to minimize the error by finding global minima and
adjust weights
Log Loss is the most important classification metric based on probabilities. It’s hard to
interpret raw log-loss values, but log-loss is still a good metric for comparing models.
For any given problem, a lower log loss value means better predictions.
Mathematical interpretation:
Log Loss is the negative average of the log of corrected predicted probabilities for
each instance.
-> By default, the output of the logistics regression model is the probability of the
sample being positive(indicated by 1) i.e if a logistic regression model is trained to
classify on a `company dataset` then the predicted probability column says What is the
probability that the person has bought jacket. Here in the above data set the probability
that a person with ID6 will buy a jacket is 0.94.
In the same way, the probability that a person with ID5 will buy a jacket (i.e. belong
to class 1) is 0.1 but the actual class for ID5 is 0, so the probability for the class is (1-
0.1)=0.9. 0.9 is the correct probability for ID5.
We will find a log of corrected probabilities for each instance.
As you can see these log values are negative. To deal with the negative sign, we take
the negative average of these values, to maintain a common convention that lower
loss scores are better.
Here Yi represents the actual class and log(p(yi)is the probability of that class.
Now Let’s see how the above formula is working in two cases:
1. When the actual class is 1: second term in the formula would be 0 and we will
left with first term i.e. yi.log(p(yi)) and (1-1).log(1-p(yi) this will be 0.
2. When the actual class is 0: First-term would be 0 and will be left with the
second term i.e (1-yi).log(1-p(yi)) and 0.log(p(yi)) will be 0.
wow!! we got back to the original formula for binary cross-entropy/log loss
We could start by assuming p(x) be the linear function. However, the problem is that p
is the probability that should vary from 0 to 1 whereas p(x) is an unbounded linear
equation. To address this problem, let us assume, log p(x) be a linear function of x and
further, to bound it between a range of (0,1), we will use logit transformation. Therefore,
we will consider log p(x)/(1-p(x)). Next, we will make this function to be linear:
After solving for p(x):
To make the logistic regression a linear classifier, we could choose a certain threshold,
e.g. 0.5. Now, the misclassification rate can be minimized if we predict y=1 when p ≥
0.5 and y=0 when p<0.5. Here, 1 and 0 are the classes.
Since Logistic regression predicts probabilities, we can fit it using likelihood. Therefore,
for each training data point x, the predicted class is y. Probability of y is either p if y=1
or 1-p if y=0. Now, the likelihood can be written as:
Gradient Descent is known as one of the most commonly used optimization algorithms
to train machine learning models by means of minimizing errors between actual and
expected results. Further, gradient descent is also used to train Neural Networks.
The best way to define the local minimum or local maximum of a function using gradient
descent is as follows:
o If we move towards a negative gradient or away from the gradient of the function
at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.