Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

2+Logistic_regression

The document provides an overview of logistic regression, including its prerequisites, objectives, and key concepts such as odds, probability, log odds, and the sigmoid function. It explains the differences between logistic and linear regression, emphasizing that logistic regression is used for binary classification problems. Additionally, it discusses the cost function used in logistic regression, which is the Cross-Entropy or Log Loss function, and outlines the optimization of weight parameters using gradient descent.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

2+Logistic_regression

The document provides an overview of logistic regression, including its prerequisites, objectives, and key concepts such as odds, probability, log odds, and the sigmoid function. It explains the differences between logistic and linear regression, emphasizing that logistic regression is used for binary classification problems. Additionally, it discusses the cost function used in logistic regression, which is the Cross-Entropy or Log Loss function, and outlines the optimization of weight parameters using gradient descent.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Logistic Regression

Prerequisite-
-Descriptive statistics
-Linear Regression
Objectives-

- Understand the prerequisite terms odds, probability, log odds, logit function and
sigmoid.
- What is logistic regression and its cost function.
- Learn how to optimize weights using gradient descent.
Odd and Probability-
The theory of chances for success or unsuccess evets often express as odds and
probabilities. These two terms denote the same information but are different from
one another in expression and defined as:
Odd-
amit.singh204@gmail.com odds is defined as the ratio to chances in the favor of event to the chances
LOSXG1B9X7
against it. The value of odd may lie between 0 to ∞.
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟(𝐴)
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜(𝐴) =
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑎𝑔𝑎𝑖𝑛𝑠𝑡(𝐴)
Example: The odd of getting Ace in a deck of 52 cards is given by:
4
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜(𝐴) =
48
Probability- Probability is defined as the ratio to chances in the favor of events to
the total trials. Probabilities are expressed either as percentage or decimal and
value lies between 0 and 1.
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟(𝐴)
𝑃𝑟𝑜𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑃(𝐴) =
𝑡𝑜𝑡𝑎𝑙 𝑡𝑟𝑖𝑎𝑙𝑠
Example: The probability of getting Ace in a deck of 52 cards is given by:
4
𝑃(𝐴) = = 0.077 𝑜𝑟 7.7%
52
Relationship-

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
1
Sharing or publishing the contents in part or full is liable for legal action.
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡(𝐴) 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜 =
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡(𝐴) 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔
𝑃(𝐴)
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜 (𝐴) =
1 − 𝑃(𝐴)
𝑜𝑑𝑑 𝑟𝑎𝑡𝑖𝑜
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑃(𝐴) =
1 + 𝑜𝑑𝑑 𝑟𝑎𝑡𝑖𝑜
Log odds and logit-
We now know that the odd ratio is the probability of an event occurring to the probability
of that event not occurring. Taking log of odd ratio is called log odds and defined as:
𝑃(𝐴)
log(𝐴) = log ( )
1 − 𝑃(𝐴)
When the function variable of log is probability p. it is called as logit function
means logit of probability is log of odds.
𝑃
log(𝑜𝑑𝑑𝑠) = 𝑙𝑜𝑔𝑖𝑡(𝑃) = log ( )
1−𝑃
Logit Function- Logit function is mainly used in working with probabilities. The logit
function is the log of the odds that Y equals one of the categories. The value of logit
function varies between (−∞ , ∞). The value approaches towards ∞ when probability
amit.singh204@gmail.com
LOSXG1B9X7
value touches to 1 and it goes to −∞ when probability value touches to 0. The logit function
is very important in the field of statistics as it can maps the probability values ranges (0,
1) to a full range value of real numbers.
𝑧
𝑙𝑜𝑔𝑖𝑡I𝑦(𝑧)K = log ( )
1−𝑧

Figure 1: Logit Function

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
2
Sharing or publishing the contents in part or full is liable for legal action.
Sigmoid function- Sigmoid function is defined as the inverse of logit function. Which
means for a probability value P we have:
𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑I𝑙𝑜𝑔𝑖𝑡(𝑃)K
Sigmoid performs the inverse of logit which means it maps any arbitrary real number into
the range of (0, 1). The function is defined as:
𝟏
𝒔𝒊𝒈𝒎𝒐𝒊𝒅(𝒛) =
𝟏 + 𝒆V𝒛

amit.singh204@gmail.com
LOSXG1B9X7
Figure 2 Sigmoid function

Logistic Regression-
In case of linear regression the target variable y is a continuous variable but let
suppose y is categorical variable which has two classes then linear regression
cannot use to predict the value of target variable. Logistic regression is used to
solve such problem.
Precisely Logistic Regression is defined as a statistical approach, for classifying
labels. In its basic form it is used to classify binary data. Logistic regression is very
much similar to linear regression where the explanatory variables(X) are combined
with weights values to predict a target variable of binary class(y). The main
difference between linear regression and logistic regression is of target value. The
logistic regression modelled the target values as 0 or 1 whereas linear regression
modelled them as numeric value. Logistic regression model is expressed as:
1 𝑒 (WX YWZ [)
𝑦= =
1 + 𝑒 V(WX YWZ [) 𝑒 (WX YWZ [) + 1

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
3
Sharing or publishing the contents in part or full is liable for legal action.
Examples-

1. Churn prediction- Churn is the probability of client to abandon a service or stop paying
as client of a particular service provider. The ratio of clients that abandon the service
during a particular time interval is called churn rate. Churn prediction is considered as a
problem of binary classification that in future whether a client or customer churn the service
based on his/her attributes. For example a particular client churn the on the basis of
monthly charge of service.
1
𝑃(𝑐ℎ𝑢𝑟𝑛 = 1|𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑐ℎ𝑎𝑟𝑔𝑒) =
1 + 𝑒 V(WX YWZ ]^_`abc daefgh)

1 𝑖𝑓 𝑃 (𝑐ℎ𝑢𝑟𝑛 = 1|𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑐ℎ𝑎𝑟𝑔𝑒) > 0.5


𝑐ℎ𝑢𝑟𝑛 = i
0 𝑖𝑓 𝑃(𝑐ℎ𝑢𝑟𝑛 = 1|𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑐ℎ𝑎𝑟𝑔𝑒) ≤ 0.5

2. Spam Detection- Problem to identify that an email is spam or not.

3. Banking- Problem to predict a particular customer default a loan or not.

Cost Function-
Linear regression uses mean squared error as its cost function but unfortunately this
cannot be used with logistic regression. Logistic regression uses Cross-Entropy or Log-
amit.singh204@gmail.com
LOSXG1B9X7
Loss function as its cost function defined for two class classification problem.
]
1
𝑐𝑜𝑠𝑡(𝑤m , 𝑤n ) = − o{𝑦q 𝑙𝑜𝑔(𝑎q ) + (1 − 𝑦q )𝑙𝑜𝑔(1 − 𝑎q )}
𝑚
qsn

Where
𝑎q = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q )

1
𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q ) =
1 + 𝑒 Vcae`t
The cost function can be divided into two separate functions as:

]
1
𝑐𝑜𝑠𝑡(𝑤m , 𝑤n ) = − o 𝑙𝑜𝑔(𝑎q ) 𝒊𝒇 𝑦 = 1
𝑚
qsn

and
]
1
𝑐𝑜𝑠𝑡(𝑤m , 𝑤n ) = − o 𝑙𝑜𝑔(1 − 𝑎q ) 𝒊𝒇 𝑦 = 0
𝑚
qsn

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
4
Sharing or publishing the contents in part or full is liable for legal action.
Optimization of coefficient or weight parameter- Again gradient descent is used to
optimize the value of weight parameters. The derivative of Loss function is defined as:
]
1
𝐶𝑜𝑠𝑡 = o{−𝑦q log 𝑎q − (1 − 𝑦q )log (1 − 𝑎q ) }
𝑚
qsn
where

1
𝑎q =
1 + 𝑒 Vcae`t
amit.singh204@gmail.com
LOSXG1B9X7 and
𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q

Now let find the derivative of cost function( By chain rule of partial derivative):
𝜕𝐶𝑜𝑠𝑡 𝜕𝐶𝑜𝑠𝑡 𝜕𝑎q 𝜕𝑦ℎ𝑎𝑡q
= ∗ ∗
𝜕𝑤q 𝜕𝑎q 𝜕𝑦ℎ𝑎𝑡q 𝜕𝑤q

So,
𝜕𝐶𝑜𝑠𝑡 𝑎q − 𝑦q
=
𝜕𝑎q 𝑎q (1 − 𝑦q )
𝜕𝑎q
= 𝑎q (1 − 𝑎q )
𝜕𝑦ℎ𝑎𝑡q
and
𝜕𝑦ℎ𝑎𝑡q
= 𝑥q
𝜕𝑤q

By using all above the generalize formula expressed as:


]
𝜕𝐶𝑜𝑠𝑡 1
= o(𝑎q − 𝑦q )𝑥q 𝑤𝑖𝑡ℎ 𝑥m = 1 𝑓𝑜𝑟 𝑤m
𝜕𝑤q 𝑚
qsn

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
5
Sharing or publishing the contents in part or full is liable for legal action.
Parameter update:

𝜕𝐶𝑜𝑠𝑡
𝒘𝒊 = 𝒘𝒊 − 𝒍𝒓𝒂𝒕𝒆
𝜕𝑤q

Example- Consider an example where we are interested to find the effect of studying hours
per day over the result in examination and predict that a student will pass or fail for given
study hours. We have sample data about six students for their grades and total study hours
per day.
Table-1 1

Study Hours Grades


per day

6 1(Pass) Fail/Pass
5 0(Fail)
4 0(Fail)
7 1(Pass)
0
8 1(Pass) 1 3 5 7 9
amit.singh204@gmail.com Grades
LOSXG1B9X7
2 0(Fail) Figure-1: Scatter Graph

To solve the problem using logistic regression let us model the linear equation as:
𝑦(𝐺𝑟𝑎𝑑𝑒𝑠) = 𝑤m + 𝑤n 𝑋(𝑆𝑡𝑢𝑑𝑦 𝐻𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦)

and predict the result using:


1
𝑃(𝑟𝑒𝑠𝑢𝑙𝑡 = 1|𝑠𝑡𝑢𝑑𝑦ℎ𝑜𝑢𝑟𝑠) =
1+ 𝑒 V(WX YWZ „(…`†‡c ˆ^†f‰ Šhf ‡ec))

𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q
and
n
𝑎q = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q ) = nY h ‹Œ•Ž•t

𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
6
Sharing or publishing the contents in part or full is liable for legal action.
Cost Function:
]
1
𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) = o{−𝑦q log 𝑎q − (1 − 𝑦q )log (1 − 𝑎q ) }
𝑚
qsn

Gradients:
]
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) 1
= o(𝑎q − 𝑦q )
𝜕𝑤m 𝑚
qsn

and
]
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) 1
= o(𝑎q − 𝑦q )𝑥q
𝜕𝑤n 𝑚
qsn

Parameter updates:

𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤m = 𝑤m − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤m

and

amit.singh204@gmail.com 𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
LOSXG1B9X7 𝑤n = 𝑤n − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤n

We have,
X: 6 5 4 7 8 2
y: 1 0 0 1 1 0

Iteration #1:
Let 𝑤m = 1 𝑎𝑛𝑑 𝑤n = 1, 𝑤𝑖𝑡ℎ 𝑙𝑟𝑎𝑡𝑒 = 0.01
n
𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q 𝑎𝑛𝑑 𝑎q = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q ) = nY h ‹Œ•Ž•t

yhat: 7 6 5 8 9 3

𝒂: 0.999 0.997 0.993 0.999 0.999 0.995

So,
(0.999 − 1) + (0.997 − 0) + (0.993 − 0) +
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) (0.999 − 1) + (0.999 − 1) + (0.995 − 0)
=
𝜕𝑤m 6

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
7
Sharing or publishing the contents in part or full is liable for legal action.
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
= 0.497
𝜕𝑤m
and
(0.999 − 1) ∗ 6 + (0.997 − 0) ∗ 5 + (0.993 − 0) ∗ 4 +
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) (0.999 − 1) ∗ 7 + (0.999 − 1) ∗ 8 + (0.995 − 0) ∗ 2
=
𝜕𝑤n 6
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
= 1.821
𝜕𝑤n
Parameter update:
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤m = 𝑤m − 𝑙𝑟𝑎𝑡𝑒 ∗ = 1 − 0.01 ∗ (0.497) = 0.995
𝜕𝑤m
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤n = 𝑤n − 𝑙𝑟𝑎𝑡𝑒 ∗ = 1 − 0.01 ∗ (1.821) = 0.982
𝜕𝑤n
Iteration #1:
Let 𝑤m = 0.995 𝑎𝑛𝑑 𝑤n = 0.982, 𝑤𝑖𝑡ℎ 𝑙𝑟𝑎𝑡𝑒 = 0.01
yhat: 6.887 5.905 4.923 7.869 8.851 2.959

𝒂: 0.999 0.997 0.993 0.999 0.999 0.950


amit.singh204@gmail.com
LOSXG1B9X7

(0.999 − 1) + (0.997 − 0) + (0.993 − 0) +


𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) (0.999 − 1) + (0.999 − 1) + (0.950 − 0)
= = 0.489
𝜕𝑤m 6
𝑎𝑛𝑑
(0.999 − 1) ∗ 6 + (0.997 − 0) ∗ 5 + (0.993 − 0) ∗ 4 +
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) (0.999 − 1) ∗ 7 + (0.999 − 1) ∗ 8 + (0.950 − 0) ∗ 2
= = 1.806
𝜕𝑤n 6
Parameter update:
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤m = 𝑤m − 𝑙𝑟𝑎𝑡𝑒 ∗ = 0.995 − 0.01 ∗ (0.489) = 0.990
𝜕𝑤m
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤n = 𝑤n − 𝑙𝑟𝑎𝑡𝑒 ∗ = 0.982 − 0.01 ∗ (1.806) = 0.964
𝜕𝑤n
and so on……….

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
8
Sharing or publishing the contents in part or full is liable for legal action.
Evaluation of Logistic regression model- Performance measurement of
classification algorithms are judge by confusion matrix which comprise the
classification count values of actual and predicted labels. The confusion matrix for
binary classification is given by:

Figure 3 Confusion Matrix

Confusion matrix cells are populated by the terms:


True Positive(TP)- The values which are predicted as True and are actually True.
True Negative(TN)- The values which are predicted as False and are actually
amit.singh204@gmail.com
LOSXG1B9X7
False.
False Positive(FP)- The values which are predicted as True but are actually False.
False Negative(FN)- The values which are predicted as False but are actually
True.
Classification performance metrics are based on confusion matrix values. The
most popularly used metrics are;
Precision- measure of correctness achieved in prediction.
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Recall (sensitivity)- measure of completeness, actual observations which are
predicted correctly.
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
Specificity- measure of how many observations of false category predicted
correctly.

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
9
Sharing or publishing the contents in part or full is liable for legal action.
𝑇𝑁
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
F1-Score- a way to combine precision and recall metric in a single term. F1score
is defined as harmonic mean of precision and recall.
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1𝑠𝑐𝑜𝑟𝑒 =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
ROC Curve- Receiver Operating Characteristic(ROC) measures the performance of
models by evaluating the trade-offs between sensitivity (true positive rate) and false (1-
specificity) or false positive rate.
AUC - The area under curve (AUC) is another measure for classification models is based
on ROC. It is the measure of accuracy judged by the area under the curve for ROC.

amit.singh204@gmail.com
LOSXG1B9X7

Figure 4 ROC Curve

Pros and cons of Linear Regression:


Pros- Logistic regression classification model is simple and easily scalable for
multiple classes.
Cons- Classifier constructs linear boundaries and the interpretation of coefficients
value is difficult.
********

Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
10
Sharing or publishing the contents in part or full is liable for legal action.

You might also like