Reference Material Logistic Regression
Reference Material Logistic Regression
Prerequisite-
-Descriptive statistics
-Linear Regression
Objectives-
- To understand the prerequisite terms such as odds, probability, log odds, logit function
and sigmoid function.
- What is logistic regression and the corresponding cost function?
- Learn how to optimize weights using gradient descent.
Odd and Probability-
The theory of chances for success or failure of the events are often expressed as odds and
probabilities. Probability can be defined as to how likely an event can occur. Odds is ratio
of the probability of success to failure of an event.
kumar.ashish2050@gmail.com
21YORICED7 Probability- Probability is defined as the ratio of the number of way that are favourable to
the occurrence of an event A to the total number of outcomes of the experiment.
Probabilities are expressed either as percentage or decimal and value lies between 0 and 1.
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟(𝐴)
𝑃𝑟𝑜𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑃(𝐴) =
𝑡𝑜𝑡𝑎𝑙 𝑡𝑟𝑖𝑎𝑙𝑠
Example: The probability of getting Ace in a deck of 52 cards is given by:
4
𝑃(𝐴) = = 0.077 𝑜𝑟 7.7%
52
Odd- Odds is defined as the ratio to chances in the favor of event to the chances against
it. The value of odd may lie between 0 to ∞.
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟(𝐴)
𝑜𝑑𝑑𝑟𝑎𝑡𝑖𝑜(𝐴) =
𝑐ℎ𝑎𝑛𝑐𝑒𝑠 𝑎𝑔𝑎𝑖𝑛𝑠𝑡(𝐴)
𝑃
Mathematically Odds Ratio = 1−𝑃 , where P denotes the probability of success of the
desired event.
The logit function is the log odds function modelled for a probability ‘p’.
kumar.ashish2050@gmail.com
21YORICED7 𝑃
log(𝑜𝑑𝑑𝑠) = 𝑙𝑜𝑔𝑖𝑡(𝑃) = log(
)
1−𝑃
Logit Function- Logit function is mainly used in working with probabilities. The logit
function is the log of the odds that Y equals one of the categories. The value of logit
function varies between (−∞ , ∞). The value approaches towards ∞ when the probability
value approaches to 1 and it goes to −∞ when the probability value approaches to 0. The
logit function is very important in the field of statistics as it can map the probability values
ranges (0, 1) to a full range value of real numbers.
𝑧
𝑙𝑜𝑔𝑖𝑡(𝑦(𝑧)) = log( )
1−𝑧
Sigmoid function- Sigmoid function can be thought of as an inverse to the logit function. This
means for a probability value P we have:
𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑙𝑜𝑔𝑖𝑡(𝑃))
Sigmoid (sometimes
kumar.ashish2050@gmail.com also called the S-curve) performs the inverse of logit which means it maps
21YORICED7
any arbitrary real number into the range of (0, 1). The function is defined as:
𝟏
𝒔𝒊𝒈𝒎𝒐𝒊𝒅(𝒛) =
𝟏 + 𝒆−𝒛
1. Churn prediction- Churn is the probability of a client to abandon a service or stop paying
to that particular service provider. The ratio of clients that abandon the service during a
particular time interval is called churn rate. Churn prediction is considered as a problem of
binary classification that whether a client or customer is going to churn. For example a
particular client churn the on the basis of monthly charge of service.
1
𝑃(𝑐ℎ𝑢𝑟𝑛 = 1|𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑐ℎ𝑎𝑟𝑔𝑒) =
1 + 𝑒 −(𝑤0 +𝑤1 𝑚𝑜𝑛𝑡ℎ𝑙𝑦 𝑐ℎ𝑎𝑟𝑔𝑒)
Here, we have selected the probability cut-off to 0.5 for predicting the target classes.
Where
𝑎𝑖 = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡𝑖 )
1
𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡𝑖 ) =
1 + 𝑒 −𝑦ℎ𝑎𝑡𝑖
The cost function can be divided into two separate functions as:
kumar.ashish2050@gmail.com 𝑚
21YORICED7 1
𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 ) = − ∑ 𝑙𝑜𝑔(𝑎𝑖 ) 𝒊𝒇 𝑦 = 1
𝑚
𝑖=1
and
𝑚
1
𝑐𝑜𝑠𝑡(𝑤0 , 𝑤1 ) = − ∑ 𝑙𝑜𝑔(1 − 𝑎𝑖 ) 𝒊𝒇 𝑦 = 0
𝑚
𝑖=1
1
𝑎𝑖 =
1 + 𝑒 −𝑦ℎ𝑎𝑡𝑖
and
𝑦ℎ𝑎𝑡𝑖 = 𝑤0 + 𝑤1 𝑥𝑖
Now let find the derivative of cost function (By chain rule of partial derivative):
So,
𝜕𝐶𝑜𝑠𝑡 𝑎𝑖 − 𝑦𝑖
=
𝜕𝑎𝑖 𝑎𝑖 (1 − 𝑦𝑖 )
𝜕𝑎𝑖
= 𝑎𝑖 (1 − 𝑎𝑖 )
kumar.ashish2050@gmail.com 𝜕𝑦ℎ𝑎𝑡𝑖
21YORICED7 and
𝜕𝑦ℎ𝑎𝑡𝑖
= 𝑥𝑖
𝜕𝑤𝑖
Parameter update:
𝜕𝐶𝑜𝑠𝑡
𝒘𝒊 = 𝒘𝒊 − 𝒍𝒓𝒂𝒕𝒆
𝜕𝑤𝑖
Example- Consider an example where we are interested to find the effect of studying hours
per day over the result in examination and predict that a student will pass or fail for given
study hours. We have sample data about six students for their grades and total study hours per
day.
Fail/Pass
6 1(Pass)
5 0(Fail)
4 0(Fail)
7 1(Pass) 0
1 3 5 7 9
8 1(Pass) Grades
2 0(Fail) Figure-1: Scatter Graph
To solve the problem using logistic regression let us model the linear equation as:
𝑦(𝐺𝑟𝑎𝑑𝑒𝑠) = 𝑤0 + 𝑤1 𝑥(𝑆𝑡𝑢𝑑𝑦 𝐻𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦)
𝑦ℎ𝑎𝑡𝑖 = 𝑤0 + 𝑤1 𝑥𝑖
and
1
𝑎𝑖 = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡𝑖 ) =
1+ 𝑒 −𝑦ℎ𝑎𝑡𝑖
𝑦ℎ𝑎𝑡𝑖 = 𝑤0 + 𝑤1 𝑥𝑖
Cost Function:
𝑚
1
𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 ) = ∑{−𝑦𝑖 log 𝑎𝑖 − (1 − 𝑦𝑖 )log(1 − 𝑎𝑖 ) }
𝑚
𝑖=1
Gradients:
𝑚
𝜕𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 ) 1
= ∑(𝑎𝑖 − 𝑦𝑖 )
𝜕𝑤0 𝑚
𝑖=1
Parameter updates:
𝜕𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤0 = 𝑤0 − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤0
and
𝜕𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 )
𝑤1 = 𝑤1 − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤1
We have,
X: 6 5 4 7 8 2
y: 1 0 0 1 1 0
Iteration #1:
kumar.ashish2050@gmail.comLet 𝑤0 = 1 𝑎𝑛𝑑 𝑤1 = 1, 𝑤𝑖𝑡ℎ 𝑙𝑟𝑎𝑡𝑒 = 0.01
21YORICED7
1
𝑦ℎ𝑎𝑡𝑖 = 𝑤0 + 𝑤1 𝑥𝑖 𝑎𝑛𝑑 𝑎𝑖 = 𝑠𝑖𝑔𝑚(𝑦ℎ𝑎𝑡𝑖 ) =
1+ 𝑒 −𝑦ℎ𝑎𝑡𝑖
yhat: 7 6 5 8 9 3
𝒂: 0.999 0.997 0.993 0.999 0.999 0.995
So,
(0.999 − 1) + (0.997 − 0) + (0.993 − 0) +
𝜕𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 ) (0.999 − 1) + (0.999 − 1) + (0.995 − 0)
=
𝜕𝑤0 6
𝜕𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 )
= 0.497
𝜕𝑤0
and
(0.999 − 1) ∗ 6 + (0.997 − 0) ∗ 5 + (0.993 − 0) ∗ 4 +
𝜕𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 ) (0.999 − 1) ∗ 7 + (0.999 − 1) ∗ 8 + (0.995 − 0) ∗ 2
=
𝜕𝑤1 6
𝜕𝐶𝑜𝑠𝑡(𝑤0 , 𝑤1 )
= 1.821
𝜕𝑤1
kumar.ashish2050@gmail.com
21YORICED7