Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

UNIT-III Lecture Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Data Analytics (R20) Department of CSE(AI&ML)

UNIT-III

Regression: Concepts, Blue property assumptions, Least Square Estimation, Variable


Rationalization, and Model Building etc.

Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics
applications to various Business Domains etc.

Regression- Concepts:

 Regression analysis is a form of predictive modelling technique which investigates the


relationship between a dependent and independent variable.
 Linear regression shows the linear relationship between the independent(predictor)
variable i.e. X-axis and the dependent(output) variable i.e. Y-axis, called linear
regression.
 If there is a single input variable X(dependent variable), such linear regression is called
simple linear regression.

 The above graph presents the linear relationship between the output(y) variable and
predictor(X) variables.
 The blue line is referred to as the best fit straight line.
 Based on the given data points, we attempt to plot a line that fits the points the best.
 To calculate best-fit line linear regression uses a traditional slope-intercept form which is
given below,

Yi = β0 + β1Xi

1
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi = Independent


variable.

This algorithm explains the linear relationship between the dependent(output) variable y and the
independent(predictor) variable X using a straight line Y= B0 + B1 X.

 The goal of the linear regression algorithm is to get the best values for B0 and B1 to find
the best fit line.
 The best fit line is a line that has the least error which means the error between predicted
values and actual values should be minimum.

Random Error(Residuals):

 In regression, the difference between the observed value of the dependent variable(yi) and
the predicted value(predicted) is called the residuals.
εi = ypredicted – yi

where ypredicted = B0 + B1 Xi
 Mathematically, the best fit line is obtained by minimizing the Residual Sum of
Squares(RSS)

Cost function for Linear Regression:

 The cost function helps to work out the optimal values for B0 and B1, which provides the
best fit line for the data points.
 In Linear Regression, generally Mean Squared Error (MSE) cost function is used,
which is the average of squared error that occurred between the ypredicted and yi.

2
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

 We calculate MSE using simple linear equation y=mx+b:

 Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value
settles at the minima.

Evaluation Metrics for Linear Regression:

1. Coefficient of Determination or R-Squared (R2)


2. Root Mean Squared Error (RSME) and Residual Standard Error (RSE)

Coefficient of Determination or R-Squared (R2):

 R-Squared is a number that explains the amount of variation that is explained/captured by


the developed model.
 It always ranges between 0 & 1.
 Overall, the higher the value of R-squared, the better the model fits the data.
 Mathematically it can be represented as,

R2 = 1 – ( RSS/TSS )

 Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each
data point in the plot/data.
 It is the measure of the difference between the expected and the actual observed output.

 Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the
mean of the response variable.
 Mathematically TSS is,

where y hat is the mean of the sample data points.

 The significance of R-squared is shown by the following figures,

3
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

Root Mean Squared Error (RSME) and Residual Standard Error (RSE):

 The Root Mean Squared Error is the square root of the variance of the residuals.
 It specifies the absolute fit of the model to the data i.e. how close the observed data points
are to the predicted values.
 Mathematically it can be represented as,

 To make this estimate unbiased, one has to divide the sum of the squared residuals by the
degrees of freedom rather than the total number of data points in the model.
 This term is then called the Residual Standard Error(RSE).
 Mathematically it can be represented as,

 R-squared is a better measure than RSME.


 Because the value of Root Mean Squared Error depends on the units of the variables (i.e.
it is not a normalized measure), it can change with the change in the unit of the variables.

Assumptions of Linear Regression:

 Regression is a parametric approach, which means that it makes assumptions about the
data for the purpose of analysis.
 For successful regression analysis, it’s essential to validate the following assumptions.

1. Linearity of residuals: There needs to be a linear relationship between the dependent


variable and independent variable(s).
4
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

2. Independence of residuals:
 The error terms should not be dependent on one another (like in time-series data
wherein the next value is dependent on the previous one).
 There should be no correlation between the residual terms.
 The absence of this phenomenon is known as Autocorrelation.
 There should not be any visible patterns in the error terms.

3. Normal distribution of residuals:


 The mean of residuals should follow a normal distribution with a mean equal to
zero or close to zero.
 This is done in order to check whether the selected line is actually the line of best
fit or not.
 If the error terms are non-normally distributed, suggests that there are a few
unusual data points that must be studied closely to make a better model.

5
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

4. The equal variance of residuals:


 The error terms must have constant variance. This phenomenon is known as
Homoscedasticity.
 The presence of non-constant variance in the error terms is referred to
as Heteroscedasticity.
 Generally, non-constant variance arises in the presence of outliers or extreme
leverage values.

Multiple Linear Regression:

 Multiple linear regression is a technique to understand the relationship between


a single dependent variable and multiple independent variables.
 The formulation for multiple linear regression is also similar to simple linear regression
withthe small change that instead of having one beta variable, you will now have betas
for all the variables used. The formula is given as:
Y = B0 + B1X1 + B2X2 + … + BpXp + ε

Considerations of Multiple Linear Regression:

All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear
Regression along with a few new additional assumptions.

1. Overfitting: When more and more variables are added to a model, the model may
become far too complex and usually ends up memorizing all the data points in the
training set. This phenomenon is known as the overfitting of a model. This usually leads
to high training accuracy and very low test accuracy.

6
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

2. Multicollinearity: It is the phenomenon where a model with several independent


variables, may have some variables interrelated.
3. Feature Selection: With more variables present, selecting the optimal set of predictors
from the pool of given features (many of which might be redundant) becomes an
important task for building a relevant and better model.

Overfitting and Underfitting in Linear Regression:

 There have always been situations where a model performs well on training data but not
on the test data.
 While training models on a dataset, overfitting, and underfitting are the most common
problems faced by people.
 Before understanding overfitting and underfitting one must know about bias and
variance.

Bias:

 Bias is a measure to determine how accurate is the model likely to be on future unseen
data.
 Complex models, assuming there is enough training data available, can do predictions
accurately.
 Whereas the models that are too naive, are very likely to perform badly with respect to
predictions.
 Simply, Bias is errors made by training data.
 Generally, linear algorithms have a high bias which makes them fast to learn and easier to
understand but in general, are less flexible.
 Implying lower predictive performance on complex problems that fail to meet the
expected outcomes.

Variance:

 Variance is the sensitivity of the model towards training data, that is it quantifies how
much the model will react when input data is changed.
 Ideally, the model shouldn’t change too much from one training dataset to the next
training data, which will mean that the algorithm is good at picking out the hidden
underlying patterns between the inputs and the output variables.
 Ideally, a model should have lower variance which means that the model doesn’t change
drastically after changing the training data(it is generalizable).
 Having higher variance will make a model change drastically even on a small change in
the training dataset.

Bias Variance Tradeoff:

 The aim of any supervised machine learning algorithm is to achieve low bias and low
variance as it is more robust.

7
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

 So that the algorithm should achieve better performance.


 There is no escape from the relationship between bias and variance in machine learning.

 There is an inverse relationship between bias and variance,


o An increase in bias will decrease the variance.
o An increase in the variance will decrease the bias.
 There is a trade-off that plays between these two concepts and the algorithms must find a
balance between bias and variance.

Overfitting:

 When a model learns each and every pattern and noise in the data to such extent that it
affects the performance of the model on the unseen future dataset, it is referred to as
overfitting.
 The model fits the data so well that it interprets noise as patterns in the data.
 When a model has low bias and higher variance it ends up memorizing the data and
causing overfitting.
 Overfitting causes the model to become specific rather than generic.
 This usually leads to high training accuracy and very low test accuracy.
 Detecting overfitting is useful, but it doesn’t solve the actual problem. There are several
ways to prevent overfitting, which are stated below:

 Cross-validation
 If the training data is too small to train add more relevant and clean data.
 If the training data is too large, do some feature selection and remove unnecessary
features.
 Regularization

8
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

Underfitting:

 Underfitting is not often discussed as often as overfitting is discussed.


 When the model fails to learn from the training dataset and is also not able to generalize
the test dataset, is referred to as underfitting.
 This type of problem can be very easily detected by the performance metrics.
 When a model has high bias and low variance it ends up not generalizing the data and
causing underfitting.
 It is unable to find the hidden underlying patterns from the data.
 This usually leads to low training accuracy and very low test accuracy.
 The ways to prevent underfitting are stated below,

 Increase the model complexity


 Increase the number of features in the training data
 Remove noise from the data.

Best Linear Unbiased Estimator (BLUE) Property Assumptions:

In simple linear regression or multiple linear regression, we make some basic assumptions on
error term.

Assumptions:

1. Error has zero mean


2. Error has constant variance
3. Errors are uncorrelated
4. Errors are normally distributed

Least Squares Estimation:

 In practice, of course, we have a collection of observations but we do not know the values
of the coefficients β0,β1,…,βk
 These need to be estimated from the data.
 The least squares principle provides a way of choosing the coefficients effectively by
minimising the sum of the squared errors.
 That is, we choose the values of β0,β1,…,βk that minimise

 This is called least squares estimation because it gives the least value for the sum of
squared errors.
 Finding the best estimates of the coefficients is often called “fitting” the model to the data

9
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

Variable Rationalization and Model Building:

 In most practical problems, the analyst has a rather large pool of possible candidate
regressors(variables), of which only a few are likely to be important.
 Finding an appropriate subset of regressors(variables) for the model is often called the
variable selection problem or Variable Rationalization.
 The basic steps for variable selection are as follows:
1. Specify the maximum model to be considered.
2. Specify a criterion for selection a model.
3. Specify a strategy for selecting variables.
4. Conduct the specified analysis.
5. Evaluate the Validity of the model chosen

Step 1: Specifying the maximum Model: The maximum model is defined to be the largest
model (the one having the most predictor variables) considered at any point in the process of
model selection.
Step 2: Specifying a Criterion for Selecting a Model: There are several criteria that can be
used to evaluate subset regression models such as F-Test Statistic, Coefficient of Determination,
Residual Mean Square, Mallow’s Cp Statistic,

Step 3: Specifying a Strategy for Selecting Variables:

(a) All possible regression procedure: The all possible regression procedure requires that we
fit each possible regression equation associated with each possible combination of the k
independent variables.

(b) Backward Elimination Procedure: Backward elimination is one of several computer-based


iterative variable-selection procedures. It begins with a model containing all the independent
variables of interest. Then, at each step the variable with smallest F-statistic is deleted (if the F is
not higher than the chosen cutoff level).

(c)Forward Selection Procedure: The procedure begins with the assumption that there are no
regressors in the model other than the intercept. Forward selection is a type of stepwise
regression which begins with an empty model and adds in variables one by one. In each forward
step, you add the one variable that gives the single best improvement to your model.

(d) Stepwise Regression Procedure: Stepwise regression is a modified version of forward


regression that permits reexamination, at every step, of the variables incorporated in the model in
pervious steps. A variable that entered at an early stage may become superfluous at a larger stage
because of its relationship with other variables subsequently added to the model.

Step 4: Conduct the specified analysis

10
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

Conduct the analysis by using above specified strategies and criterias and also choose best model
that fits the data.

Step 5: Evaluate the Validity of the model chosen:

Check validity of the chosen model by using evaluation metrics such as R Squared, RMSE, RSS
and so on.

To select the best regression equation, carry out the following steps:

1) Fit the largest model possible to the data.

2) Perform a through analysis of this model.

3) Determine if a transformation of the response or some of the regressors is necessary.

4) Determine if all possible regression is feasible.

5) Compare and contrast the best models recommended by each criterion.

6) Perform through analyses of the “best models” (usually three to five models).

7) Explore the need for further transformations.

8) Discuss with the subject-matter experts the relative advantages and disadvantages

of the final set of models.

Logistic Regression- Model Theory:

 Logistic Regression is a one of the popular supervised learning Technique.


 It is used for predicting the categorical dependent variable using a given set of
independent variables.
 Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
 The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

11
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

Logistic Function (Sigmoid Function):


 The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
 It maps any real value into another value within a range of 0 and 1.
 The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.

Assumptions for Logistic Regression:

 The dependent variable must be categorical in nature.


 The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

 We know the equation of the straight line can be written as:

 In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

12
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

 But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:

 The standard logistic function is simply the inverse of the logit equation above. If we
solve for y from the logit equation, the formula of the logistic function is below:

y = 1/(1 + e^(-(b0 + b1*x1 + b2*x2 + … + bn*xn)))


where e is the base of the natural logarithms

 The logistic function is a type of sigmoid function.

sigmoid(h) = 1/(1 + e^(-h))


where h = b0 + b1*x1 + b2*x2 + … + bn*xn for logistic function.

Model Fit Statistic- Maximum Likelihood Estimation (The Best Fit Model):
 Like linear regression, the logistic regression algorithm finds the best values of
coefficients (b0, b1, …,bn) to fit the training dataset.
 The standard way to determine the best fit for logistic regression is maximum likelihood
estimation (MLE)
 In this estimation method, we use a likelihood function that measures how well a set of
parameters fit a sample of data.
 The parameter values that maximize the likelihood function are the maximum likelihood
estimates.
Model Construction:
Let’s see a simple example with the following dataset:

Observation # Input x1 Binary Output y


0 0.5 0
1 1.0 0
2 0.65 0
3 0.75 1
4 1.2 1

With one input variable x1, the logistic regression formula becomes:

13
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

log(p/(1-p)) = w0 + w1*x1
or
p = 1/(1 + e^(-(w0 + w1*x1)))

Since y is binary of values 0 or 1, a bernoulli random variable can be used to model its
probability:

P(y=1) = p
P(y=0) = 1 – p

Or:

P(y) = (p^y)*(1-p)^(1-y)
with y being either 0 or 1

This distribution formula is only for a single observation.

How do we model the distribution of multiple observations like P(y0, y1, y2, y3, y4)?

Let’s assume these observations are mutually independent from each other. Then we can write
the joint distribution of the training dataset as:

P(y0, y1, y2, y3, y4) = P(y0) * P(y1) * P(y2) * P(y3) * P(y4)

To make it more specific, each observed y has a different probability of being 1.

Let’s assume P(yi = 1) = pi for i = 0,1,2,3,4. Then we can rewrite the formula as below:

P(y0) * P(y1) * P(y2) * P(y3) * P(y4) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) *… *


p4^(y4)*(1-p4)^(1-y4)

We can calculate the p estimate for each observation based on the logistic function formula:

 p0 = 1/(1 + e^(-(w0 + w1*0.5)))


 p1 = 1/(1 + e^(-(w0 + w1*1.0)))
 p2 = 1/(1 + e^(-(w0 + w1*0.65)))
 p3 = 1/(1 + e^(-(w0 + w1*0.75)))
 p4 = 1/(1 + e^(-(w0 + w1*1.2)))

We also have the values of the output variable y:

 y0 = 0
 y1 = 0
 y2 = 0
 y3 = 1
 y4 = 1

14
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

Log Likelihood Function in statistics:

So we have all the p0 – p4 and y0 – y4 values from the training dataset.

Our likelihood becomes a function of the parameters w0 and w1:

L(w0, w1) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) * … * p4^(y4)*(1-p4)^(1-y4)

The goal is to choose the values of w0 and w1 that result in the maximum likelihood based on
the training dataset.

Note that it’s computationally more convenient to optimize the log-likelihood function. Since
the natural logarithm is a strictly increasing function, the same w0 and w1 values that maximize
L would also maximize l = log(L).

So in statistics, we often try to maximize the function below:

l(w0, w1) = log(L(w0, w1)) = y0*log(p0) + (1-y0)*log(1-p0) + y1*log(p1) + (1-y1)*log(1-p1) +


… + (1-y4)*log(1-p4)

Cost Function (Cross Entropy Loss) :

While in machine learning, we prefer the idea of minimizing cost/loss functions, so we often
define the cost function as the negative of the average log-likelihood.

cost function = – avg(l(w0, w1)) = – 1/5 * l(w0, w1) = – 1/5 * (y0*log(p0) + (1-y0)*log(1-p0) +
y1*log(p1) + (1-y1)*log(1-p1) + … + (1-y4)*log(1-p4))

This is also called the average of the cross entropy loss.

Maximizing the (log) likelihood is the same as minimizing the cross entropy loss function.

Optimization Methods:

 Unlike OLS estimation for the linear regression, we don’t have a closed-form solution for
the MLE. But we do know that the cost function is convex, which means a local
minimum is also the global minimum.
 To minimize this cost function, Python libraries such as scikit-learn (sklearn) use
numerical methods similar to Gradient Descent. And since sklearn uses gradients to
minimize the cost function, it’s better to scale the input variables and/or use
regularization to make the algorithm more stable.

Model Interpretations:

But by using the Logistic Regression algorithm in Python sklearn, we can find the best estimates
are w0 = -4.411 and w1 = 4.759 for our example dataset.
15
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

We can plot the logistic regression with the sample dataset. As you can see, the output y only has
two values of 0 and 1, while the logistic function has an S shape.

We can also make some interpretations with the parameter w1.

Recall that we have:

log(odds of y=1) = log(p/(1-p)) = w0 + w1*x1


where p = P(y = 1)

Since w1 = 4.759, with a one-unit increase of x1, the log odds is expected to increase by 4.759 as
well.

How to use Logistic Regression Models to Predict?:

As mentioned earlier, we often use logistic regression models for predictions.

Given a new observation, how would we predict which class y = 0 or 1 it belongs to?

For example, say a new observation has input variable x1 = 0.9. By using the logistic regression
equation estimated from MLE, we can calculate the probability p of it belongs to y = 1.

p = 1/(1 + e^(-(-4.411 + 4.759*0.9))) = 46.8%

If we use 50% as the threshold, we would predict that this observation is in class 0, since p <
50%.

Since the logistic regression has an S shape, the larger x1, the more likely the observation has
class y = 1.

16
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

What’s the threshold of x1 for us to classify the observation as y = 1?

At the threshold of probability p=50%, the odds are p/(1-p) = 50%/50% = 1. So the log(odds) =
log(1) = 0.

While log (odds) fits the linear regression equation, we have:

log(odds) = 0 = -4.411 + 4.759*x1

Solving for x1, we get 0.927. That’s the threshold of x1 for prediction, i.e., when x1 > 0.927, the
observation will be classified as y = 1.

Types of Logistic Regression: Logistic Regression can be classified into three types:

 Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
 Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Applications of Logistic Regression:

 Fraud Detection
 Customer Churn Prediction
 Cancer diagnosis

Analytics applications to various Business Domains:

 Finance
BA is of utmost importance to the finance sector. Data Scientists are in high demand in
investment banking, portfolio management, financial planning, budgeting, forecasting,
etc.
For example: Companies these days have a large amount of financial data. Use of
intelligent Business Analytics tools can help use this data to determine the products’
prices. Also, on the basis of historical information Business Analysts can study the trends
on the performance of a particular stock and advise the client on whether to retain it or
sell it.
 Marketing
Studying buying patterns of consumer behaviour, analysing trends, help in identifying the
target audience, employing advertising techniques that can appeal to the consumers,
forecast supply requirements, etc.
For example: Use Business Analytics to gauge the effectiveness and impact of a
marketing strategy on the customers. Data can be used to build loyal customers by giving
them exactly what they want as per their specifications.

17
Dr. S Rao Chintalapudi
Data Analytics (R20) Department of CSE(AI&ML)

 HR Professionals
HR professionals can make use of data to find information about educational background
of high performing candidates, employee attrition rate, number of years of service of
employees, age, gender, etc. This information can play a pivotal role in the selection
procedure of a candidate.
For example: HR manager can predict the employee retention rate on the basis of data
given by Business Analytics.
 CRM
Business Analytics helps one analyse the key performance indicators, which further helps
in decision making and make strategies to boost the relationship with the consumers. The
demographics, and data about other socio-economic factors, purchasing patterns,
lifestyle, etc., are of prime importance to the CRM department.
For example: The company wants to improve its service in a particular geographical
segment. With data analytics, one can predict the customer’s preferences in that particular
segment, what appeals to them, and accordingly improve relations with customers.
 Manufacturing
Business Analytics can help you in supply chain management, inventory management,
measure performance of targets, risk mitigation plans, improve efficiency in the basis of
product data, etc.
For example: The Manager wants information on performance of a machinery which has
been used past 10 years. The historical data will help evaluate the performance of the
machinery and decide whether costs of maintaining the machine will exceed the cost of
buying a new machinery.
 Credit Card Companies
Credit card transactions of a customer can determine many factors: financial health, life
style, preferences of purchases, behavioral trends, etc.
For example: Credit card companies can help the retail sector by locating the target
audience. According to the transactions reports, retail companies can predict the choices
of the consumers, their spending pattern, preference over buying competitor’s products,
etc. This historical as well as real-time information helps them direct their marketing
strategies in such a way that it hits the dart and reaches the right audience.

*****

18
Dr. S Rao Chintalapudi

You might also like