UNIT-III Lecture Notes
UNIT-III Lecture Notes
UNIT-III Lecture Notes
UNIT-III
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics
applications to various Business Domains etc.
Regression- Concepts:
The above graph presents the linear relationship between the output(y) variable and
predictor(X) variables.
The blue line is referred to as the best fit straight line.
Based on the given data points, we attempt to plot a line that fits the points the best.
To calculate best-fit line linear regression uses a traditional slope-intercept form which is
given below,
Yi = β0 + β1Xi
1
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
This algorithm explains the linear relationship between the dependent(output) variable y and the
independent(predictor) variable X using a straight line Y= B0 + B1 X.
The goal of the linear regression algorithm is to get the best values for B0 and B1 to find
the best fit line.
The best fit line is a line that has the least error which means the error between predicted
values and actual values should be minimum.
Random Error(Residuals):
In regression, the difference between the observed value of the dependent variable(yi) and
the predicted value(predicted) is called the residuals.
εi = ypredicted – yi
where ypredicted = B0 + B1 Xi
Mathematically, the best fit line is obtained by minimizing the Residual Sum of
Squares(RSS)
The cost function helps to work out the optimal values for B0 and B1, which provides the
best fit line for the data points.
In Linear Regression, generally Mean Squared Error (MSE) cost function is used,
which is the average of squared error that occurred between the ypredicted and yi.
2
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value
settles at the minima.
R2 = 1 – ( RSS/TSS )
Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each
data point in the plot/data.
It is the measure of the difference between the expected and the actual observed output.
Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the
mean of the response variable.
Mathematically TSS is,
3
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
Root Mean Squared Error (RSME) and Residual Standard Error (RSE):
The Root Mean Squared Error is the square root of the variance of the residuals.
It specifies the absolute fit of the model to the data i.e. how close the observed data points
are to the predicted values.
Mathematically it can be represented as,
To make this estimate unbiased, one has to divide the sum of the squared residuals by the
degrees of freedom rather than the total number of data points in the model.
This term is then called the Residual Standard Error(RSE).
Mathematically it can be represented as,
Regression is a parametric approach, which means that it makes assumptions about the
data for the purpose of analysis.
For successful regression analysis, it’s essential to validate the following assumptions.
2. Independence of residuals:
The error terms should not be dependent on one another (like in time-series data
wherein the next value is dependent on the previous one).
There should be no correlation between the residual terms.
The absence of this phenomenon is known as Autocorrelation.
There should not be any visible patterns in the error terms.
5
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear
Regression along with a few new additional assumptions.
1. Overfitting: When more and more variables are added to a model, the model may
become far too complex and usually ends up memorizing all the data points in the
training set. This phenomenon is known as the overfitting of a model. This usually leads
to high training accuracy and very low test accuracy.
6
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
There have always been situations where a model performs well on training data but not
on the test data.
While training models on a dataset, overfitting, and underfitting are the most common
problems faced by people.
Before understanding overfitting and underfitting one must know about bias and
variance.
Bias:
Bias is a measure to determine how accurate is the model likely to be on future unseen
data.
Complex models, assuming there is enough training data available, can do predictions
accurately.
Whereas the models that are too naive, are very likely to perform badly with respect to
predictions.
Simply, Bias is errors made by training data.
Generally, linear algorithms have a high bias which makes them fast to learn and easier to
understand but in general, are less flexible.
Implying lower predictive performance on complex problems that fail to meet the
expected outcomes.
Variance:
Variance is the sensitivity of the model towards training data, that is it quantifies how
much the model will react when input data is changed.
Ideally, the model shouldn’t change too much from one training dataset to the next
training data, which will mean that the algorithm is good at picking out the hidden
underlying patterns between the inputs and the output variables.
Ideally, a model should have lower variance which means that the model doesn’t change
drastically after changing the training data(it is generalizable).
Having higher variance will make a model change drastically even on a small change in
the training dataset.
The aim of any supervised machine learning algorithm is to achieve low bias and low
variance as it is more robust.
7
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
Overfitting:
When a model learns each and every pattern and noise in the data to such extent that it
affects the performance of the model on the unseen future dataset, it is referred to as
overfitting.
The model fits the data so well that it interprets noise as patterns in the data.
When a model has low bias and higher variance it ends up memorizing the data and
causing overfitting.
Overfitting causes the model to become specific rather than generic.
This usually leads to high training accuracy and very low test accuracy.
Detecting overfitting is useful, but it doesn’t solve the actual problem. There are several
ways to prevent overfitting, which are stated below:
Cross-validation
If the training data is too small to train add more relevant and clean data.
If the training data is too large, do some feature selection and remove unnecessary
features.
Regularization
8
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
Underfitting:
In simple linear regression or multiple linear regression, we make some basic assumptions on
error term.
Assumptions:
In practice, of course, we have a collection of observations but we do not know the values
of the coefficients β0,β1,…,βk
These need to be estimated from the data.
The least squares principle provides a way of choosing the coefficients effectively by
minimising the sum of the squared errors.
That is, we choose the values of β0,β1,…,βk that minimise
This is called least squares estimation because it gives the least value for the sum of
squared errors.
Finding the best estimates of the coefficients is often called “fitting” the model to the data
9
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
In most practical problems, the analyst has a rather large pool of possible candidate
regressors(variables), of which only a few are likely to be important.
Finding an appropriate subset of regressors(variables) for the model is often called the
variable selection problem or Variable Rationalization.
The basic steps for variable selection are as follows:
1. Specify the maximum model to be considered.
2. Specify a criterion for selection a model.
3. Specify a strategy for selecting variables.
4. Conduct the specified analysis.
5. Evaluate the Validity of the model chosen
Step 1: Specifying the maximum Model: The maximum model is defined to be the largest
model (the one having the most predictor variables) considered at any point in the process of
model selection.
Step 2: Specifying a Criterion for Selecting a Model: There are several criteria that can be
used to evaluate subset regression models such as F-Test Statistic, Coefficient of Determination,
Residual Mean Square, Mallow’s Cp Statistic,
(a) All possible regression procedure: The all possible regression procedure requires that we
fit each possible regression equation associated with each possible combination of the k
independent variables.
(c)Forward Selection Procedure: The procedure begins with the assumption that there are no
regressors in the model other than the intercept. Forward selection is a type of stepwise
regression which begins with an empty model and adds in variables one by one. In each forward
step, you add the one variable that gives the single best improvement to your model.
10
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
Conduct the analysis by using above specified strategies and criterias and also choose best model
that fits the data.
Check validity of the chosen model by using evaluation metrics such as R Squared, RMSE, RSS
and so on.
To select the best regression equation, carry out the following steps:
6) Perform through analyses of the “best models” (usually three to five models).
8) Discuss with the subject-matter experts the relative advantages and disadvantages
11
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
12
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
The standard logistic function is simply the inverse of the logit equation above. If we
solve for y from the logit equation, the formula of the logistic function is below:
Model Fit Statistic- Maximum Likelihood Estimation (The Best Fit Model):
Like linear regression, the logistic regression algorithm finds the best values of
coefficients (b0, b1, …,bn) to fit the training dataset.
The standard way to determine the best fit for logistic regression is maximum likelihood
estimation (MLE)
In this estimation method, we use a likelihood function that measures how well a set of
parameters fit a sample of data.
The parameter values that maximize the likelihood function are the maximum likelihood
estimates.
Model Construction:
Let’s see a simple example with the following dataset:
With one input variable x1, the logistic regression formula becomes:
13
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
log(p/(1-p)) = w0 + w1*x1
or
p = 1/(1 + e^(-(w0 + w1*x1)))
Since y is binary of values 0 or 1, a bernoulli random variable can be used to model its
probability:
P(y=1) = p
P(y=0) = 1 – p
Or:
P(y) = (p^y)*(1-p)^(1-y)
with y being either 0 or 1
How do we model the distribution of multiple observations like P(y0, y1, y2, y3, y4)?
Let’s assume these observations are mutually independent from each other. Then we can write
the joint distribution of the training dataset as:
P(y0, y1, y2, y3, y4) = P(y0) * P(y1) * P(y2) * P(y3) * P(y4)
Let’s assume P(yi = 1) = pi for i = 0,1,2,3,4. Then we can rewrite the formula as below:
We can calculate the p estimate for each observation based on the logistic function formula:
y0 = 0
y1 = 0
y2 = 0
y3 = 1
y4 = 1
14
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
The goal is to choose the values of w0 and w1 that result in the maximum likelihood based on
the training dataset.
Note that it’s computationally more convenient to optimize the log-likelihood function. Since
the natural logarithm is a strictly increasing function, the same w0 and w1 values that maximize
L would also maximize l = log(L).
While in machine learning, we prefer the idea of minimizing cost/loss functions, so we often
define the cost function as the negative of the average log-likelihood.
cost function = – avg(l(w0, w1)) = – 1/5 * l(w0, w1) = – 1/5 * (y0*log(p0) + (1-y0)*log(1-p0) +
y1*log(p1) + (1-y1)*log(1-p1) + … + (1-y4)*log(1-p4))
Maximizing the (log) likelihood is the same as minimizing the cross entropy loss function.
Optimization Methods:
Unlike OLS estimation for the linear regression, we don’t have a closed-form solution for
the MLE. But we do know that the cost function is convex, which means a local
minimum is also the global minimum.
To minimize this cost function, Python libraries such as scikit-learn (sklearn) use
numerical methods similar to Gradient Descent. And since sklearn uses gradients to
minimize the cost function, it’s better to scale the input variables and/or use
regularization to make the algorithm more stable.
Model Interpretations:
But by using the Logistic Regression algorithm in Python sklearn, we can find the best estimates
are w0 = -4.411 and w1 = 4.759 for our example dataset.
15
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
We can plot the logistic regression with the sample dataset. As you can see, the output y only has
two values of 0 and 1, while the logistic function has an S shape.
Since w1 = 4.759, with a one-unit increase of x1, the log odds is expected to increase by 4.759 as
well.
Given a new observation, how would we predict which class y = 0 or 1 it belongs to?
For example, say a new observation has input variable x1 = 0.9. By using the logistic regression
equation estimated from MLE, we can calculate the probability p of it belongs to y = 1.
If we use 50% as the threshold, we would predict that this observation is in class 0, since p <
50%.
Since the logistic regression has an S shape, the larger x1, the more likely the observation has
class y = 1.
16
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
At the threshold of probability p=50%, the odds are p/(1-p) = 50%/50% = 1. So the log(odds) =
log(1) = 0.
Solving for x1, we get 0.927. That’s the threshold of x1 for prediction, i.e., when x1 > 0.927, the
observation will be classified as y = 1.
Types of Logistic Regression: Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Fraud Detection
Customer Churn Prediction
Cancer diagnosis
Finance
BA is of utmost importance to the finance sector. Data Scientists are in high demand in
investment banking, portfolio management, financial planning, budgeting, forecasting,
etc.
For example: Companies these days have a large amount of financial data. Use of
intelligent Business Analytics tools can help use this data to determine the products’
prices. Also, on the basis of historical information Business Analysts can study the trends
on the performance of a particular stock and advise the client on whether to retain it or
sell it.
Marketing
Studying buying patterns of consumer behaviour, analysing trends, help in identifying the
target audience, employing advertising techniques that can appeal to the consumers,
forecast supply requirements, etc.
For example: Use Business Analytics to gauge the effectiveness and impact of a
marketing strategy on the customers. Data can be used to build loyal customers by giving
them exactly what they want as per their specifications.
17
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)
HR Professionals
HR professionals can make use of data to find information about educational background
of high performing candidates, employee attrition rate, number of years of service of
employees, age, gender, etc. This information can play a pivotal role in the selection
procedure of a candidate.
For example: HR manager can predict the employee retention rate on the basis of data
given by Business Analytics.
CRM
Business Analytics helps one analyse the key performance indicators, which further helps
in decision making and make strategies to boost the relationship with the consumers. The
demographics, and data about other socio-economic factors, purchasing patterns,
lifestyle, etc., are of prime importance to the CRM department.
For example: The company wants to improve its service in a particular geographical
segment. With data analytics, one can predict the customer’s preferences in that particular
segment, what appeals to them, and accordingly improve relations with customers.
Manufacturing
Business Analytics can help you in supply chain management, inventory management,
measure performance of targets, risk mitigation plans, improve efficiency in the basis of
product data, etc.
For example: The Manager wants information on performance of a machinery which has
been used past 10 years. The historical data will help evaluate the performance of the
machinery and decide whether costs of maintaining the machine will exceed the cost of
buying a new machinery.
Credit Card Companies
Credit card transactions of a customer can determine many factors: financial health, life
style, preferences of purchases, behavioral trends, etc.
For example: Credit card companies can help the retail sector by locating the target
audience. According to the transactions reports, retail companies can predict the choices
of the consumers, their spending pattern, preference over buying competitor’s products,
etc. This historical as well as real-time information helps them direct their marketing
strategies in such a way that it hits the dart and reaches the right audience.
*****
18
Dr. S Rao Chintalapudi