2+Logistic_regression
2+Logistic_regression
Prerequisite-
-Descriptive statistics
-Linear Regression
Objectives-
- Understand the prerequisite terms odds, probability, log odds, logit function and
sigmoid.
- What is logistic regression and its cost function.
- Learn how to optimize weights using gradient descent.
Odd and Probability-
The theory of chances for success or unsuccess evets often express as odds and
probabilities. These two terms denote the same information but are different from
one another in expression and defined as:
Odd-
amit.singh204@gmail.com odds is defined as the ratio to chances in the favor of event to the chances
LOSXG1B9X7
against it. The value of odd may lie between 0 to ∞.
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟(𝐴)
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜(𝐴) =
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑎𝑔𝑎𝑖𝑛𝑠𝑡(𝐴)
Example: The odd of getting Ace in a deck of 52 cards is given by:
4
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜(𝐴) =
48
Probability- Probability is defined as the ratio to chances in the favor of events to
the total trials. Probabilities are expressed either as percentage or decimal and
value lies between 0 and 1.
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟(𝐴)
𝑃𝑟𝑜𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑃(𝐴) =
𝑡𝑜𝑡𝑎𝑙 𝑡𝑟𝑖𝑎𝑙𝑠
Example: The probability of getting Ace in a deck of 52 cards is given by:
4
𝑃(𝐴) = = 0.077 𝑜𝑟 7.7%
52
Relationship-
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
1
Sharing or publishing the contents in part or full is liable for legal action.
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡(𝐴) 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜 =
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡(𝐴) 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔
𝑃(𝐴)
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜 (𝐴) =
1 − 𝑃(𝐴)
𝑜𝑑𝑑 𝑟𝑎𝑡𝑖𝑜
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑃(𝐴) =
1 + 𝑜𝑑𝑑 𝑟𝑎𝑡𝑖𝑜
Log odds and logit-
We now know that the odd ratio is the probability of an event occurring to the probability
of that event not occurring. Taking log of odd ratio is called log odds and defined as:
𝑃(𝐴)
log(𝐴) = log ( )
1 − 𝑃(𝐴)
When the function variable of log is probability p. it is called as logit function
means logit of probability is log of odds.
𝑃
log(𝑜𝑑𝑑𝑠) = 𝑙𝑜𝑔𝑖𝑡(𝑃) = log ( )
1−𝑃
Logit Function- Logit function is mainly used in working with probabilities. The logit
function is the log of the odds that Y equals one of the categories. The value of logit
function varies between (−∞ , ∞). The value approaches towards ∞ when probability
amit.singh204@gmail.com
LOSXG1B9X7
value touches to 1 and it goes to −∞ when probability value touches to 0. The logit function
is very important in the field of statistics as it can maps the probability values ranges (0,
1) to a full range value of real numbers.
𝑧
𝑙𝑜𝑔𝑖𝑡I𝑦(𝑧)K = log ( )
1−𝑧
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
2
Sharing or publishing the contents in part or full is liable for legal action.
Sigmoid function- Sigmoid function is defined as the inverse of logit function. Which
means for a probability value P we have:
𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑I𝑙𝑜𝑔𝑖𝑡(𝑃)K
Sigmoid performs the inverse of logit which means it maps any arbitrary real number into
the range of (0, 1). The function is defined as:
𝟏
𝒔𝒊𝒈𝒎𝒐𝒊𝒅(𝒛) =
𝟏 + 𝒆V𝒛
amit.singh204@gmail.com
LOSXG1B9X7
Figure 2 Sigmoid function
Logistic Regression-
In case of linear regression the target variable y is a continuous variable but let
suppose y is categorical variable which has two classes then linear regression
cannot use to predict the value of target variable. Logistic regression is used to
solve such problem.
Precisely Logistic Regression is defined as a statistical approach, for classifying
labels. In its basic form it is used to classify binary data. Logistic regression is very
much similar to linear regression where the explanatory variables(X) are combined
with weights values to predict a target variable of binary class(y). The main
difference between linear regression and logistic regression is of target value. The
logistic regression modelled the target values as 0 or 1 whereas linear regression
modelled them as numeric value. Logistic regression model is expressed as:
1 𝑒 (WX YWZ [)
𝑦= =
1 + 𝑒 V(WX YWZ [) 𝑒 (WX YWZ [) + 1
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
3
Sharing or publishing the contents in part or full is liable for legal action.
Examples-
1. Churn prediction- Churn is the probability of client to abandon a service or stop paying
as client of a particular service provider. The ratio of clients that abandon the service
during a particular time interval is called churn rate. Churn prediction is considered as a
problem of binary classification that in future whether a client or customer churn the service
based on his/her attributes. For example a particular client churn the on the basis of
monthly charge of service.
1
𝑃(𝑐ℎ𝑢𝑟𝑛 = 1|𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑐ℎ𝑎𝑟𝑔𝑒) =
1 + 𝑒 V(WX YWZ ]^_`abc daefgh)
Cost Function-
Linear regression uses mean squared error as its cost function but unfortunately this
cannot be used with logistic regression. Logistic regression uses Cross-Entropy or Log-
amit.singh204@gmail.com
LOSXG1B9X7
Loss function as its cost function defined for two class classification problem.
]
1
𝑐𝑜𝑠𝑡(𝑤m , 𝑤n ) = − o{𝑦q 𝑙𝑜𝑔(𝑎q ) + (1 − 𝑦q )𝑙𝑜𝑔(1 − 𝑎q )}
𝑚
qsn
Where
𝑎q = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q )
1
𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q ) =
1 + 𝑒 Vcae`t
The cost function can be divided into two separate functions as:
]
1
𝑐𝑜𝑠𝑡(𝑤m , 𝑤n ) = − o 𝑙𝑜𝑔(𝑎q ) 𝒊𝒇 𝑦 = 1
𝑚
qsn
and
]
1
𝑐𝑜𝑠𝑡(𝑤m , 𝑤n ) = − o 𝑙𝑜𝑔(1 − 𝑎q ) 𝒊𝒇 𝑦 = 0
𝑚
qsn
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
4
Sharing or publishing the contents in part or full is liable for legal action.
Optimization of coefficient or weight parameter- Again gradient descent is used to
optimize the value of weight parameters. The derivative of Loss function is defined as:
]
1
𝐶𝑜𝑠𝑡 = o{−𝑦q log 𝑎q − (1 − 𝑦q )log (1 − 𝑎q ) }
𝑚
qsn
where
1
𝑎q =
1 + 𝑒 Vcae`t
amit.singh204@gmail.com
LOSXG1B9X7 and
𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q
Now let find the derivative of cost function( By chain rule of partial derivative):
𝜕𝐶𝑜𝑠𝑡 𝜕𝐶𝑜𝑠𝑡 𝜕𝑎q 𝜕𝑦ℎ𝑎𝑡q
= ∗ ∗
𝜕𝑤q 𝜕𝑎q 𝜕𝑦ℎ𝑎𝑡q 𝜕𝑤q
So,
𝜕𝐶𝑜𝑠𝑡 𝑎q − 𝑦q
=
𝜕𝑎q 𝑎q (1 − 𝑦q )
𝜕𝑎q
= 𝑎q (1 − 𝑎q )
𝜕𝑦ℎ𝑎𝑡q
and
𝜕𝑦ℎ𝑎𝑡q
= 𝑥q
𝜕𝑤q
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
5
Sharing or publishing the contents in part or full is liable for legal action.
Parameter update:
𝜕𝐶𝑜𝑠𝑡
𝒘𝒊 = 𝒘𝒊 − 𝒍𝒓𝒂𝒕𝒆
𝜕𝑤q
Example- Consider an example where we are interested to find the effect of studying hours
per day over the result in examination and predict that a student will pass or fail for given
study hours. We have sample data about six students for their grades and total study hours
per day.
Table-1 1
6 1(Pass) Fail/Pass
5 0(Fail)
4 0(Fail)
7 1(Pass)
0
8 1(Pass) 1 3 5 7 9
amit.singh204@gmail.com Grades
LOSXG1B9X7
2 0(Fail) Figure-1: Scatter Graph
To solve the problem using logistic regression let us model the linear equation as:
𝑦(𝐺𝑟𝑎𝑑𝑒𝑠) = 𝑤m + 𝑤n 𝑋(𝑆𝑡𝑢𝑑𝑦 𝐻𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦)
𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q
and
n
𝑎q = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q ) = nY h ‹Œ•Ž•t
𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
6
Sharing or publishing the contents in part or full is liable for legal action.
Cost Function:
]
1
𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) = o{−𝑦q log 𝑎q − (1 − 𝑦q )log (1 − 𝑎q ) }
𝑚
qsn
Gradients:
]
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) 1
= o(𝑎q − 𝑦q )
𝜕𝑤m 𝑚
qsn
and
]
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) 1
= o(𝑎q − 𝑦q )𝑥q
𝜕𝑤n 𝑚
qsn
Parameter updates:
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤m = 𝑤m − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤m
and
amit.singh204@gmail.com 𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
LOSXG1B9X7 𝑤n = 𝑤n − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤n
We have,
X: 6 5 4 7 8 2
y: 1 0 0 1 1 0
Iteration #1:
Let 𝑤m = 1 𝑎𝑛𝑑 𝑤n = 1, 𝑤𝑖𝑡ℎ 𝑙𝑟𝑎𝑡𝑒 = 0.01
n
𝑦ℎ𝑎𝑡q = 𝑤m + 𝑤n 𝑥q 𝑎𝑛𝑑 𝑎q = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡q ) = nY h ‹Œ•Ž•t
yhat: 7 6 5 8 9 3
So,
(0.999 − 1) + (0.997 − 0) + (0.993 − 0) +
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) (0.999 − 1) + (0.999 − 1) + (0.995 − 0)
=
𝜕𝑤m 6
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
7
Sharing or publishing the contents in part or full is liable for legal action.
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
= 0.497
𝜕𝑤m
and
(0.999 − 1) ∗ 6 + (0.997 − 0) ∗ 5 + (0.993 − 0) ∗ 4 +
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n ) (0.999 − 1) ∗ 7 + (0.999 − 1) ∗ 8 + (0.995 − 0) ∗ 2
=
𝜕𝑤n 6
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
= 1.821
𝜕𝑤n
Parameter update:
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤m = 𝑤m − 𝑙𝑟𝑎𝑡𝑒 ∗ = 1 − 0.01 ∗ (0.497) = 0.995
𝜕𝑤m
𝜕𝐶𝑜𝑠𝑡(𝑤m , 𝑤n )
𝑤n = 𝑤n − 𝑙𝑟𝑎𝑡𝑒 ∗ = 1 − 0.01 ∗ (1.821) = 0.982
𝜕𝑤n
Iteration #1:
Let 𝑤m = 0.995 𝑎𝑛𝑑 𝑤n = 0.982, 𝑤𝑖𝑡ℎ 𝑙𝑟𝑎𝑡𝑒 = 0.01
yhat: 6.887 5.905 4.923 7.869 8.851 2.959
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
8
Sharing or publishing the contents in part or full is liable for legal action.
Evaluation of Logistic regression model- Performance measurement of
classification algorithms are judge by confusion matrix which comprise the
classification count values of actual and predicted labels. The confusion matrix for
binary classification is given by:
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
9
Sharing or publishing the contents in part or full is liable for legal action.
𝑇𝑁
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
F1-Score- a way to combine precision and recall metric in a single term. F1score
is defined as harmonic mean of precision and recall.
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1𝑠𝑐𝑜𝑟𝑒 =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
ROC Curve- Receiver Operating Characteristic(ROC) measures the performance of
models by evaluating the trade-offs between sensitivity (true positive rate) and false (1-
specificity) or false positive rate.
AUC - The area under curve (AUC) is another measure for classification models is based
on ROC. It is the measure of accuracy judged by the area under the curve for ROC.
amit.singh204@gmail.com
LOSXG1B9X7
Proprietary content.
This file is©meant
Great Learning. All Rights
for personal use Reserved. Unauthorized use or distribution
by amit.singh204@gmail.com prohibited.
only.
10
Sharing or publishing the contents in part or full is liable for legal action.