Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Unit-3 Data Analysis

The document discusses regression concepts, including linear and logistic regression, and the assumptions necessary for effective model building. It covers various types of regression techniques, their applications in different fields, and the BLUE property assumptions for Ordinary Least Squares (OLS) estimators. Additionally, it provides examples of calculating the line of best fit using the Least Squares method and highlights the importance of understanding relationships between variables for prediction and analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit-3 Data Analysis

The document discusses regression concepts, including linear and logistic regression, and the assumptions necessary for effective model building. It covers various types of regression techniques, their applications in different fields, and the BLUE property assumptions for Ordinary Least Squares (OLS) estimators. Additionally, it provides examples of calculating the line of best fit using the Least Squares method and highlights the importance of understanding relationships between variables for prediction and analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT -3: Regression – Concepts, Blue property assumptions, Least Square Estimation, Variable

Rationalization, and Model Building etc.


Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics applications
to various Business Domains etc
Regression Concepts:
3.1 Regression: It is a supervised machine learning technique where the output variable is
continuous. It is a statistical technique used to model and analyze the relationship between a
dependent variable (the outcome you want to predict or explain) and one or more independent
variables (the predictors).
Regression Analysis is a statistical process for estimating the relationships between the Dependent
Variables /Criterion Variables / Response Variables
& one or More Independent variables / Predictor variables.
 Regression describes how an independent variable is numerically related to the dependent
variable.
 Regression can be used for prediction, estimation and hypothesis testing, and modeling causal
relationships.
When Regression is chosen?
 A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”.
 Many different models can be used, the simplest is linear regression. It tries to fit data with
the best hyperplane which goes through the points.
 Mathematically a linear relationship represents a straight line when plotted as a graph.
 A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
Ex: Predict sales of product, stock price, temperature, house price.
Related Concepts in Regression:
1. Dependent Variable (Response Variable): This is the variable you are trying to predict or
explain. It depends on other variables. For example, if you are predicting house prices, the
house price is the dependent variable.
2. Independent Variable (Predictor or Explanatory Variable): These are the variables that help
explain or predict the dependent variable. For example, in predicting house prices,
independent variables might include house size, location, number of bedrooms, etc.
3. Regression Line: In simple linear regression, the relationship between the independent and
dependent variables is represented by a straight line, called the regression line. It shows the
best possible fit for the data.
4. Regression Equation: In simple linear regression, the relationship between two variables is
described by this equation: y=β0+β1x+ ϵ
Where:
● y is the dependent variable (the one you're trying to predict).
● x is the independent variable (the predictor).
● β0 is the intercept (the value of y when x=0).
● β1 is the slope (how much y changes for each unit change in x).
● ϵ (epsilon) is the error term (the difference between the actual and predicted
values).
5. Residuals (Errors): Residuals represent the difference between the actual values of the
dependent variable and the values predicted by the regression model. They measure how
well the model fits the data.
6. R-squared (Coefficient of Determination): This statistic measures how well the independent
variables explain the variation in the dependent variable. It ranges from 0 to 1, where a
value closer to 1 indicates a better fit.
7. Multicollinearity: In multiple regressions, multicollinearity occurs when two or more
independent variables are highly correlated, making it difficult to determine their individual
effects on the dependent variable.
8. Over fitting: Over fitting happens when a model is too complex and fits the training data too
well, capturing noise instead of the underlying pattern. This leads to poor performance on
new, unseen data.
Types of Regression:
1. Linear Regression: The simplest form of regression, it assumes a linear relationship between
the dependent and independent variables. Changes in the independent variable lead to
proportional changes in the dependent variable.

i. Simple Linear Regression: Involves one independent variable.


This is the simplest form of linear regression, and it involves only one independent
variable and one dependent variable. The equation for simple linear regression is:
y=β0+β1X
where:
● Y is the dependent variable
● X is the independent variable
● β0 is the intercept
● β1 is the slope
ii. Multiple Linear Regressions: Involves two or more independent variables.
This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:
y=β0+β1X1+β2X2+………βnXn
where:
● Y is the dependent variable
● X1, X2, …, Xn are the independent variables
● β0 is the intercept
● β1, β2, …, βn are the slopes
● Best Fit Line: Our primary objective while using linear regression is to locate the best-fit line,
which implies that the error between the predicted and actual values should be kept to a
minimum. There will be the least error in the best-fit line.
● The best Fit Line equation provides a straight line that represents the relationship between
the dependent and independent variables. The slope of the line indicates how much the
dependent variable changes for a unit change in the independent variable(s).

The goal of the algorithm is to find the best Fit Line equation that can predict the values based on
the independent variables.

Example: Simple Linear Regression


Suppose we have the following data points: (x1,y1),(x2,y2),…,(xn,yn)
We want to fit a line y=B0+B1X .The LSE approach will find values of m (slope) and b (intercept)
such that the sum of squared residuals is minimized. We compute m and b using the formulas
mentioned earlier and finds the line that best fits the data.
The slope of the line of best fit can be calculated from the formula as follows:
Problem 1: Find the line of best fit for the following data points using the Least Square method:
(x,y) = (1,3), (2,4), (4,8), (6,10), (8,15).
Solution:
Here, we have x as the independent variable and y as the dependent variable. First, we calculate the
means of x and y values denoted by X and Y respectively.
X = (1+2+4+6+8)/5 = 4.2
Y = (3+4+8+10+15)/5 = 8
Xi yi X-xi Y-yi (X-xi)*(Y-yi) (X-xi)2
1 3 3.2 5 16 10.24
2 4 2.2 4 8.8 4.84
4 8 0.2 0 0 0.04
6 10 -1.8 -2 3.6 3.24
8 15 -3.8 -7 26.6 14.44
Sum(Σ) 0 0 55 32.8

B1 = 55/32.8 = 1.68 (rounded upto 2 decimal places)


Now, the intercept will be calculated from the formula as follows:
Y=B0+B1*X
B0 = 8 – 1.68*4.2 = 0.94
Thus, the equation of the line of best fit becomes, Y= 0.94+1.68x.
Problem 2: Find the line of best fit for the following data of heights and weights of students of a
school using the Least Square method:
● Height (in centimeters): [160, 162, 164, 166, 168]
● Weight (in kilograms): [52, 55, 57, 60, 61]
Solution:
Here, we denote Height as x (independent variable) and Weight as y (dependent variable). Now, we
calculate the means of x and y values denoted by X and Y respectively.
X = (160 + 162 + 164 + 166 + 168) / 5 = 164
Y = (52 + 55 + 57 + 60 + 61) / 5 = 57
xi yi X-xi Y-yi (X-xi)*(Y-yi) (X-xi)2
160 52 4 5 20 16
162 55 2 2 4 4
164 57 0 0 0 0
166 60 -2 -3 6 4
168 61 -4 -4 16 16
Sum(Σ) 0 0 46 40

Now, the slope of the line of best fit can be calculated from the formula as follows:

B1 = 46/40 = 1.15
Now, the intercept will be calculated from the formula as follows:
B0= Y – B1X
c = 57 – 1.15*164 = -131.6
Thus, the equation of the line of best fit becomes, y = – 131.6+1.15X
2. Logistic Regression: (Classification Technique) Used when the dependent variable is binary
or categorical (e.g., yes/no, 0/1). Despite its name, logistic regression is used for
classification, not regression. It estimates the probability that a given input belongs to a
certain class using a logistic function.
3. Polynomial Regression: A form of regression where the relationship between the
dependent and independent variables is modeled as an nth-degree polynomial. It helps
capture non-linear relationships between the variables.
4. Ridge and Lasso Regression: These are regularization techniques used to prevent over fitting
by adding penalty terms to the regression equation.
i. Ridge Regression: Shrinks the coefficients by adding a penalty proportional to the
square of their magnitude. It reduces large coefficients but doesn't set any to zero.
ii. Lasso Regression: Adds a penalty proportional to the absolute value of the
coefficients, which can lead to some coefficients being reduced to zero. This
effectively performs feature selection, removing irrelevant variables.
5. Stepwise Regression: (Dimensionality Reduction)A method used to select the most
significant variables for the model by adding or removing predictors based on their statistical
significance. It can be forward (adding variables) or backward (removing variables).
6. Non-Linear Regression: Used when the relationship between the dependent and
independent variables is not linear. It models more complex relationships using curves or
other non-linear functions.
7. Poisson Regression: Used when the dependent variable represents count data (e.g., the
number of occurrences of an event in a fixed time period).
Assumptions of Linear Regression:
For linear regression to give reliable results, some assumptions need to be met:
1. Linearity: The relationship between the independent and dependent variables should be
linear.
2. Independence: The observations should be independent of each other.
3. Homoscedasticity: The variance of the residuals should be constant across all levels of the
independent variables.
4. Normality: The residuals should be normally distributed.
5. No Multicollinearity: Independent variables should not be highly correlated with each
other.
Applications of Regression:
● Finance: Predicting stock prices, returns, or risk factors.
● Marketing: Estimating the effect of advertising on sales.
● Healthcare: Predicting patient outcomes based on health metrics.
● Economics: Modeling relationships between economic indicators (e.g., unemployment rates
and GDP).
● Social Sciences: Studying relationships between social behaviors and outcomes.
Errors:
Sum of all errors: (∑error) = Actual-Predicted= ∑(Y-Yhat)
Sum of absolute value of all errors: (∑|error|)
Sum of square of all errors: (∑error^2)
Regression is a powerful statistical tool for understanding relationships between variables and
making predictions. Different types of regression techniques are used depending on the nature of
the data and the specific goal of the analysis. Proper use of regression can provide deep insights into
patterns and trends within datasets.
3.2 BLUE Property Assumption:
The BLUE property assumption is a foundational concept in regression analysis, especially in the
context of the Ordinary Least Squares (OLS) method. According to the Gauss-Markov Theorem,
under a set of specific assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE)
for estimating the coefficients in a linear regression model. Below, I will explain the components of
the BLUE property and the necessary assumptions in greater detail.
Components of the BLUE Property:
1. Best: The OLS estimator has the smallest possible variance among all linear and unbiased
estimators. In other words, OLS provides the most efficient estimates compared to other
unbiased methods. Lower variance means that the OLS estimates are more reliable because
they have less variability from sample to sample.
2. Linear: The OLS estimator is a linear combination of the observed values of the dependent
variable. This means that the estimated coefficients (β^0,β^1,…,β^k) are calculated by
applying linear operations (addition and scalar multiplication) to the observed data.
3. Unbiased: The OLS estimator, on average, provides the true value of the parameter being
estimated. This means that if we repeatedly draw random samples from the population and
apply the OLS method, the average of the estimated coefficients will equal the true
population parameters.
o Mathematically, this means that: E (β^) =β where E (β^) is the expected value of the
OLS estimator, and β is the true parameter value.
4. Estimator: The OLS method provides estimates of the unknown parameters (like β0,β1,…,βk
) in a regression model. These estimates are derived from the data and aim to describe the
relationship between the dependent and independent variables.
Example:
Imagine you want to estimate the average height of students in a class. You measure the
heights of 10
randomly chosen students:
1. Unbiased: If you were to repeat this process many times, the average height you calculate
should be close to the true average height of the entire class. This means your method
doesn’t
systematically overestimate or underestimate the average.
2. Linear: You calculate the average height by simply adding the heights of the students and
dividing by 10.
3. Best: Among all the ways to estimate the average height (like using weights or different
formulas), the simple average gives you the least variability in your estimates when repeated
multiple times, making it the best in terms of precision.
Assumptions for the BLUE Property:
The OLS estimator is guaranteed to be BLUE if the following five assumptions (also called the Gauss-
Markov assumptions) hold true in a linear regression model:
1. Linearity of the model in parameters
The model must be linear in terms of the parameters (coefficients), though not necessarily in terms
of the variables themselves. The model can be written as:
Y=β0+β1X1+β2X2+ +βk Xk+ϵ
In the simple linear regression equation of OLS estimates properties is Y=β0+β1X+ ϵ (ϵ error)
The above equation is based on the following assumptions
a. Randomness of ϵ
b. Mean of ϵ is Zero
c. Variance of ϵ is constant
d. The variance of ϵ has normal distribution
e. Error ϵ of different observations are independent.
Here, Y is the dependent variable, X1, X2… Xkare the independent variables, β0,β1,…,βk are the
parameters to be estimated, and ϵ is the error term (the difference between the observed and
predicted values).
The linearity assumption ensures that the relationship between the dependent and independent
variables can be described using a straight line or a linear combination of the independent variables.
2. Exogeneity (Zero Conditional Mean of the Errors)
The error term (ϵ) must have an expected value of zero given any values of the independent
variables. Mathematically:
E(ϵ∣X)=0
This means that the error term is uncorrelated with the independent variables and there is no
omitted variable bias. It implies that the independent variables in the model are not systematically
related to the unobserved factors (the errors) that influence the dependent variable.
If this assumption is violated, the OLS estimators become biased, meaning the average of the
estimates does not equal the true population parameters.
3. Homoscedasticity (Constant Variance of Errors)
The variance of the error terms should remain constant across all levels of the independent
variables. Formally, this means:
Var(ϵi∣X)=σ2∀i
This is called homoscedasticity, which means that the spread (variance) of the errors is the same no
matter what value of the independent variable is used.
If the error variance changes for different values of the independent variables (a condition called
heteroscedasticity), OLS estimators are still unbiased, but they are no longer efficient (i.e., they do
not have the minimum variance). This reduces the reliability of the OLS estimates.
Homoscedasticity vs Heteroscedasticity:

 The Assumption of homoscedasticity (meaning “same variance”) is central to linear


regression models. Homoscedasticity describes a situation in which the error term (that is,
the “noise” or random disturbance in the relationship between the independent variables
and the dependent variable) is the same across all values of the independent variables.
 Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error
term differs across values of an independent variable.
 The impact of violating the assumption of homoscedasticity is a matter of degree,
increasing as heteroscedasticity increases.
 Homoscedasticity means “having the same scatter.” For it to exist in a set of data, the
points must be about the same distance from the line, as shown in the picture above.
 The opposite is heteroscedasticity (“different scatter”), where points are at widely varying
distances from the regression line.
4. No Autocorrelation (Independence of Errors) OR Normality of Errors:
The error terms for different observations should be uncorrelated with each other. Mathematically,
this means:
Cov(ϵi,ϵj)=0fori≠ j
This is particularly important in time series data, where consecutive error terms can be correlated.
When the errors are correlated, OLS is no longer the best estimator because it doesn’t have the
minimum variance.
Autocorrelation often arises in data where time, location, or sequence plays a role (e.g., economic
data over time). If autocorrelation exists, alternative methods like Generalized Least Squares (GLS)
should be used.
5. Independence Or No Perfect Multicollinearity:
The independent variables should not be perfectly correlated with each other. Perfect
multicollinearity occurs when one independent variable can be expressed as an exact linear
combination of other independent variables in the model. Mathematically:
Xj=c1X1+c2X2⋯+ckXk
If perfect multicollinearity exists, OLS cannot uniquely estimate the regression coefficients because it
becomes impossible to separate the individual effects of the collinear variables on the dependent
variable.
If multicollinearity is high but not perfect, OLS estimates may still be valid, but they become
unreliable because the standard errors of the coefficients are inflated, making it difficult to assess
their statistical significance.
Simple Example:
Imagine you want to estimate the average score of students in a math test based on their study
hours.
1. Linearity: You assume that more study hours lead to higher test scores in a straight-line manner
(e.g., each additional hour consistently increases the score).
2. Unbiasedness: If you calculate the average score from several samples of students, your
average should reflect the true average score of all students.
3. Homoscedasticity: The spread of test scores should be similar, regardless of how many hours
students studied. For example, whether they studied 1 hour or 5 hours, the variability in scores
should be consistent.
4. Independence Or No Perfect Multicollinearity: One student’s score shouldn’t affect another’s. If
one student does well, it shouldn’t influence the test scores of other students.
5. Normality of Errors: The differences between the observed scores and the predicted scores
(based on study hours) should follow a bell-shaped curve, especially if you have a small group.
When all these assumptions are met, you can confidently use methods like linear regression to
estimate the average score based on study hours, and you can say that your estimator is BLUE!
Understanding the Importance of Each Assumption:
1. Linearity ensures that the relationship can be modeled appropriately using a linear
approach.
2. Exogeneity guarantees that the estimates are unbiased, meaning the OLS method will, on
average, produce the correct results.
3. Homoscedasticity and No Autocorrelation are important for the efficiency of the OLS
estimators. When these assumptions hold, OLS provides estimates with the lowest variance,
making it the best estimator.
4. No Perfect Multicollinearity ensures that each variable’s contribution to the dependent
variable can be separated and identified.
When the assumptions hold, OLS gives the most reliable and precise estimates for linear regression
models, making it one of the most widely used techniques in statistical modeling.
3.3 Least Square Estimation (LSE) :The Least Squares Method is a statistical technique used to find
the best-fitting line for a set of data points by minimizing the squared differences between observed
and predicted values. When data is plotted on a scatter plot, this method helps derive a straight line
that represents the data well. The goal is to use this line to predict unknown values of the
dependent variable based on known values of the independent variable. This method minimizes the
sum of the squared vertical distances (residuals) between the actual and predicted values, resulting
in a regression line or line of best fit.

Key Ideas and Terminology:


1. Modeling Relationships:
o Linear Regression: The most common application involves linear regression, where
the relationship between the dependent variable (response) and one or more
independent variables (predictors) is modeled linearly.
o Non-linear Models: While least squares are traditionally associated with linear
models, it can also be adapted for non-linear relationships by transforming the
variables or using polynomial regression.

2. Minimization of Residuals: The independent variable values as xi and the dependent ones as
yi. These are the errors between the observed data and the values predicted by the model.
Mathematically, for a given set of observations (x1,y1),(x2,y2),…,(xn, yn) and predicted
values y^1,y^2,…,y^n , the residual for each data point is:
ri=yi−y^I where yi is the observed value and y^i is the predicted value by the model.
o The method works by minimizing the following cost function:

Where Yi are the observed values and Y^i are the predicted values from the model
3. Sum of Squared Residuals (SSR): Calculate the average values of xi and yi as X and Y. LSE
minimizes the sum of squared residuals, given by:
SSR=∑i=1 to n(yi−y^i)2
This ensures that both large positive and large negative residuals are penalized equally, as squaring
removes any negative signs.
To assess how well the model fits the data, various metrics can be used:
 R-squared: Indicates the proportion of variance in the dependent variable
explained by the model.
 Adjusted R-squared: Adjusts R-squared for the number of predictors, useful
for comparing models with different numbers of predictors.
 Residual Analysis: Examining residuals can help identify patterns or
deviations from assumptions.

4. Best Fit Line: In the case of a linear regression model, the goal is to find a line (or hyper
plane in higher dimensions) that best fits the data. The equation of the line in simple linear
regression is:
Y=β0+β1X+ϵ
Where:
o Β1 is the slope of the line(coefficient).
o Β0 is the intercept.
o LSE finds the values of m and b that minimize the sum of squared residuals between
the observed y values and the predicted y^values from this line.
5. Estimating Parameters:
In the simple linear Regression: The estimates for the slope m and intercept b are found
using calculus by setting the derivative of the sum of squared residuals to zero. The result of
the least squares method gives the parameters (coefficients) of the regression model, which
show the strength and direction of the relationship between the predictors and the
response variable. The formulas are:
Slope B1:

o Intercept B0 :

For multiple linear regressions, the approach generalizes by minimizing the sum of squared
residuals for a model with multiple predictors.
6. Curve Fitting: LSE can also be applied to non-linear models where the relationship between
variables is more complex. For example, in polynomial regression, the goal is to fit a
polynomial curve to the data.
Use Least Squares Estimation:
● Minimizes Error: By minimizing the sum of squared residuals, LSE produces estimates that,
on average, lead to the smallest prediction errors for the given data.
● Simple and Efficient: LSE provides an algebraic solution to finding model parameters, which
is computationally efficient for many applications, especially linear regression.
● Widely Applicable: LSE can be used for linear, polynomial, and other forms of regression and
curve fitting in fields like economics, engineering, machine learning, and more.
Advantages and Limitations:
● Advantages:
o Minimizes error in predictions.
o Simple to implement and interpret.
o Provides the "best" linear fit under standard assumptions.
● Limitations:
o Sensitive to outliers, as squaring residuals gives more weight to large errors.
o Assumes linearity between the variables (though can be extended to non-linear
relationships with more complex models).
7. Applications:
o Least squares estimators are widely used across various fields, including economics,
biology, engineering, and social sciences, for tasks like predictive modeling, trend
analysis, and causal inference.
Example in Data Analysis
Consider a dataset where you want to analyze the relationship between advertising spend and sales.
You collect data on these variables and use least squares regression to model the relationship.
1. Model Creation: You create a linear model:
Sales=β0+β1Advertising Spend
Fitting the Model: By applying the least squares method, you estimate β (intercept) and β1 (slope).
2. Analysis: After fitting, you find that β1=2.5, indicating that for every additional dollar spent
on advertising, sales increase by $2.50.
3. Evaluation: You check the R-squared value to see how well your model explains the
variation in sales, and perform residual analysis to validate the assumptions.
The least squares estimator is a powerful tool in data analysis, enabling researchers and analysts to
model relationships, make predictions, and derive insights from data. Its simplicity, interpretability,
and effectiveness make it a cornerstone of statistical analysis in many fields.
Example Scenario
Suppose you want to understand the relationship between the number of hours studied and exam
scores for a group of students. Here’s a small dataset:
Hours Studied (X) Exam Score (Y)
1 50
2 55
3 65
4 70
5 80
Step 1: Model Structure
We assume a linear relationship between hours studied and exam scores:
Y=β0+β1X+ϵ
Where:
 Y is the exam score,
 Xis the hours studied,
 β0 is the intercept (score when no hours are studied),
 β1 is the slope (change in score per additional hour studied),
 ϵ is the error term.
Step 2: Calculate the Estimates
1. Mean Values:
o Mean of X(Hours Studied): Xˉ=1+2+3+4+5/5=3
Mean of Y (Exam Score): Yˉ=50+55+65+70+80/5=62
2. Calculate β1 (Slope):

Breaking this down:

Now we can compute β1:


β1=75/10=7.5
Calculate β0 (Intercept):
β0=Yˉ−β1Xˉ=62−7.5×3=62−22.5=39.5
Step 3: The Regression Equation
Now we have our regression equation:
Y=39.5+7.5X
Step 4: Making Predictions
Using this model, you can predict the exam score for any number of hours studied. For example, if a
student studies for 4 hours:
Y=39.5+7.5×4=39.5+30=69.5
In this example, we used the least squares method to fit a linear regression model, finding that for
each additional hour studied, the exam score increases by 7.5 points, with an intercept of 39.5. This
illustrates how the least squares estimator helps in understanding relationships in data.
3.4 Variable Rationalization, and Model Building etc: These are both fundamental steps in data
analysis and machine learning that help create predictive or explanatory models.
3.4.1. Variable Rationalization: It is the process of selecting, refining, and transforming the
variables (features) used in a model to ensure they are relevant, meaningful, and not redundant.
 The data set may have a large number of attributes. But some of those attributes can be
irrelevant or redundant. The goal of Variable Rationalization is to improve the Data
Processing in an optimal way through attribute subset selection.
 This process is to find a minimum set of attributes such that dropping of those irrelevant
attributes does not much affect the utility of data and the cost of data analysis could be
reduced.
 Mining on a reduced data set also makes the discovered pattern easier to understand. As
part of Data processing, we use the below methods of Attribute subset selection, we use
the below methods of Attribute subset selection

Key Steps in Variable Rationalization/ Feature Selection and Engineering Techniques:

● Feature Selection or stepwise Forward selection: This involves choosing which variables (or
features) from the dataset should be included in the model. Some features may be
irrelevant or redundant and should be removed to avoid over fitting, where the model
performs well on training data but poorly on new data.
Common methods include statistical tests, correlation analysis, and algorithms like Recursive
Feature Elimination (RFE).
● Feature Engineering: This step involves creating new features or modifying existing ones to
better represent the relationships in the data. For example, you might create interaction
terms, polynomial features, or log transformations of variables that show non-linear
relationships.
● Feature Transformation OR Normalizing and Standardizing Variables: Modifying variables
to improve their distribution or align them with the model's assumptions. Examples include
scaling, normalization, encoding categorical variables, and creating interaction terms. To
improve model performance, especially for algorithms sensitive to scale (like gradient
descent), variables may need to be normalized (scaling to a 0–1 range) or standardized
(transforming to have a mean of 0 and standard deviation of 1).
● Handling Missing Values: Missing data can bias results or reduce model accuracy. Strategies
like mean/median imputation, forward/backward filling, or more advanced techniques like
Multiple Imputation by Chained Equations (MICE) are often used.
● Dealing with Multicollinearity: It occurs when two or more independent variables are highly
correlated, which can cause instability in the model. Identifying and eliminating or
combining highly correlated variables helps improve model stability.
● Dimensionality Reduction: Reducing the number of features to avoid overfitting and
improve computation. Techniques include PCA, Singular Value Decomposition (SVD), or
more advanced algorithms like t-SNE and UMAP.
Why Variable Rationalization Matters:
● Prevents Over fitting: Fewer, more relevant variables reduce the risk of the model fitting
noise in the data rather than the actual signal.
● Improves Interpretability: With fewer, well-chosen variables, it's easier to understand the
relationships between the variables and the outcome.
● Boosts Efficiency: Reducing the number of variables makes the model more computationally
efficient, particularly when dealing with large datasets.
Example Context
Objective: Predict if a customer will default on a loan (a classification problem with a binary target:
Default or No Default).
Dataset Variables:
1. Age: Customer’s age.
2. Income: Monthly income.
3. Employment Type: Type of employment (e.g., salaried, self-employed).
4. Credit Score: Customer’s credit rating.
5. Loan Amount: Amount of loan requested.
6. Education Level: Highest level of education.
7. Marital Status: Married, single, divorced, etc.
8. Number of Dependents: Number of dependents the customer has.
Example Variable Rationalization
1.1 Feature Selection
 Purpose: Remove irrelevant variables and keep only the ones that significantly impact
default prediction.
 Example: Through exploratory data analysis, you may find that Marital Status and Education
Level have low correlation with the target variable (default rate) and decide to exclude
them.
1.2 Feature Transformation
 Purpose: Make the data suitable for the model and improve performance.
 Example: Transform Income and Loan Amount using logarithmic scaling if they show a highly
skewed distribution. This can make the data more normalized, improving the model’s
accuracy.
1.3 Handling Missing Values
 Purpose: Ensure model consistency and avoid bias.
 Example: If Credit Score has some missing values, you could fill them with the median credit
score of customers with similar income levels.
1.4 Dealing with Multicollinearity
 Purpose: Reduce redundancy among variables.
 Example: If Income and Loan Amount are highly correlated, consider combining them into a
single feature, such as Debt-to-Income Ratio, to avoid issues with multicollinearity.
1.5 Dimensionality Reduction
 Purpose: Simplify the model and reduce overfitting risk.
 Example: If the dataset includes many detailed financial metrics, you could use Principal
Component Analysis (PCA) to combine them into a few principal components that capture
the essential information.
3.4.2. Model Building: It creates the predictive model that uses these features to make accurate
predictions and offer insights. Together, they ensure the development of an efficient, interpretable,
and accurate model.
It is the process of creating a statistical or machine learning model that best explains or predicts the
relationships between variables in a dataset or an outcome. This involves selecting an appropriate
algorithm, training the model on data, evaluating its performance, and refining it.
Key Steps in Model Building:
1. Problem Definition
2. Hypothesis Generation/selecting a model type
3. Data Collection/training the model
4. Data Exploration/Transformation
5. Predictive Modeling
6. Model Deployment
1. Defining the Problem: Clearly define what you are trying to predict or explain, setting up target
variables and objectives.
The first step in constructing a model is to understand the industrial problem in a more
comprehensive way. To identify the purpose of the problem and the prediction target, we
must define the project objectives appropriately.
Therefore, to proceed with an analytical approach, we have to recognize the obstacles first.
Remember, excellent results always depend on a better understanding of the problem.
2. Hypothesis Generation: Hypothesis generation is the guessing approach through which we derive
some essential data parameters that have a significant correlation with the prediction target.
Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders into
account. We search for every suitable factor that can influence the outcome.
Hypothesis generation focuses on what you can create rather than what is available in the dataset.
3. Data Collection: Data collection is gathering data from relevant sources regarding the analytical
problem, and then we extract meaningful insights from the data for prediction.

The data gathered must have:


 Proficiency in answer hypothesis questions.
 Capacity to elaborate on every data parameter.
 Effectiveness to justify your research.
 Competency to predict outcomes accurately.

4. Data Exploration/Transformation
 The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary
features, null values, unanticipated small values, or immense values. So, before applying any
algorithmic model to data, we have to explore it first.
 By inspecting the data, we get to understand the explicit and hidden trends in data. We find
the relation between data features and the target variable.
 Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only

There are several sub steps involved in data exploration:


Feature Identification:
You need to analyze which data features are available and which ones are not.
Identify independent and target variables.
Identify data types and categories of these variables.
Univariate Analysis: We inspect each variable one by one. This kind of analysis depends on the variable type
whether it is categorical and continuous.
Continuous variable: We mainly look for statistical trends like mean, median, standard deviation, skewness,
and many more in the dataset.
Categorical variable: We use a frequency table to understand the spread of data for each category. We can
measure the counts and frequency of occurrence of values.
Multi-variate Analysis: The bi-variate analysis helps to discover the relation between two or more variables.
We can find the correlation in case of continuous variables and the case of categorical; we look for association
and dissociation between them.
Filling Null Values: Usually, the dataset contains null values which lead to lower the potential of the model.
With a continuous variable, we fill these null values using the mean or mode of that specific column. For the
null values present in the categorical column, we replace them with the most frequently occurred categorical
value. Remember; don’t delete that rows because you may lose the information.
5. Predictive Modeling: It is a mathematical approach to create a statistical model to forecast future behavior
based on input test data. Steps involved in predictive modeling:
Algorithm Selection: When we have the structured dataset, and we want to estimate the continuous or
categorical outcome then we use supervised machine learning methodologies like regression and classification
techniques. When we have unstructured data and want to predict the clusters of items to which a particular
input test sample belongs, we use unsupervised algorithms. An actual data scientist applies multiple
algorithms to get a more accurate model.
. Selecting a Model Type: Depending on the nature of the problem (e.g., regression, classification, clustering),
different models are suited for different tasks.
o Linear Models: For predicting a continuous outcome using a linear combination of features,
you might use Linear Regression.
o Logistic Models: For binary or categorical outcomes, Logistic Regression is a common choice.
o Decision Trees, Random Forests, and Gradient Boosting: For both classification and
regression tasks, tree-based models are popular because they can capture complex, non-linear
relationships.
o Support Vector Machines (SVM) and Neural Networks: These can handle more complex, high-
dimensional data but may require more tuning and computational resources
Train Model: After assigning the algorithm and getting the data handy, we train our model using the input
data applying the preferred algorithm. It is an action to determine the correspondence between independent
variables, and the prediction targets.
Once a model type is selected, it is trained on the dataset by adjusting the model's parameters to minimize
error (e.g., minimizing the sum of squared residuals in regression).
● Evaluating the Model: After training, the model’s performance must be evaluated on unseen data
(validation set or test set). Common evaluation metrics include:
o For Regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute
Error (MAE), R-squared.
o For Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC.
● Cross-Validation: This involves splitting the data into multiple training and validation sets to assess the
model’s generalizability. It helps prevent over fitting by ensuring the model is not tailored to one
particular split of the data.
● Hyper parameter Tuning: Many models have hyper parameters (e.g., learning rate, depth of trees)
that need to be set before training. Techniques like Grid Search and Random Search are used to find
the best combination of hyper parameters.
Model Prediction: We make predictions by giving the input test data to the trained model. We measure the
accuracy by using a cross-validation strategy or ROC curve which performs well to derive model output for test
data.
6. Model Deployment: There is nothing better than deploying the model in a real-time environment. It helps
us to gain analytical insights into the decision-making procedure. You constantly need to update the model
with additional features for customer satisfaction.

7. Model Refinement and Iteration: Based on evaluation, the model is often refined by adjusting hyper
parameters, adding or removing features, or even switching to a different modeling technique if needed

To predict business decisions, plan market strategies, and create personalized customer interests, we integrate
the machine learning model into the existing production domain. When you go through the Amazon website
and notice the product recommendations completely based on your curiosities. You can experience the
increase in the involvement of the customers utilizing these services. That’s how a deployed model changes
the mindset of the customer and convinces him to purchase the product.
Key Takeaways

SUMMARY OF DA MODEL LIFE CYCLE:


 Understand the purpose of the business analytical problem.
 Generate hypotheses before looking at data.
 Collect reliable data from well-known resources.
 Invest most of the time in data exploration to extract meaningful insights from the data.
 Choose the signature algorithm to train the model and use test data to evaluate.
 Deploy the model into the production environment so it will be available to users and strategize to
make business decisions effectively.
Example: Creating a statistical model to predict the outcome (employee performance) based on the selected
variables.
Building the Model:
1. Select the Model Type: You choose a multiple linear regression model since you have multiple
predictors.
The model can be expressed as:
Performance Score=b0+b1×Years of Experience+b2×Education Level+b3×Training Programs+b4×Attendance
Rate
Collect Data: Gather data on employee performance scores along with their years of experience, education
levels, training attended, and attendance rates.
2. Fit the Model: Use statistical software to fit the model to your data, estimating the coefficients (b0,
b1, b2, b3, b4).
3. Evaluate the Model: Assess how well the model predicts performance using metrics like R-squared,
which indicates how much variance in performance scores is explained by the model.
4. Make Predictions: Use the fitted model to predict the performance scores for new employees based
on their characteristics.
In this example, variable rationalization helped identify important factors influencing employee performance,
while model building created a framework to make predictions based on those factors. This systematic
approach ensures your predictions are based on relevant data relationships!
Why Model Building is Important:
● Predictions and Insights: A well-built model can provide accurate predictions on new, unseen data,
helping with decision-making in various fields like finance, healthcare, and marketing.
● Explanatory Power: Some models, like linear regression, allow for interpretation of the relationship
between independent variables and the outcome, offering insights into the data.
● Automation: A good model can automate prediction tasks, enabling large-scale applications like real-
time recommendations, fraud detection, or inventory management.

3.5 Logistic Regression: Model Theory


Logistic regression is a type of parametric classification model which is used where the response variable is
categorical type. The basic idea behind Logistic regression is to find a relationship between features and
probability of particular outcome.
Logistic regression is used for binary classification where we use sigmoid function, that takes input as
independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input is greater
than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to Class 0. It’s referred to as
regression because it is the extension of linear regression but is mainly used for classification problems.
● Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome
must be a categorical or discrete value.
● It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
● In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function, which
predicts two maximum values (0 or 1).

In medicine, for example, a frequent application is to find out which variables have an influence on a disease.
In this case, 0 could stand for not diseased and 1 for diseased. Subsequently, the influence of age, gender and
smoking status (smoker or not) on this particular disease could be examined.

Business example: For an online retailer, you need to predict which product a particular customer is most
likely to buy. For this, you receive a data set with past visitors and their purchases from the online retailer.
Medical example: You want to investigate whether a person is susceptible to a certain disease or not. For this
purpose, you receive a data set with diseased and non-diseased persons as well as other medical parameters.
Political example: Would a person vote for party A if there were elections next weekend?
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
Logistic regression comes in different forms, which are used depending on the structure of the target variable
and the nature of the classification task. Here are the primary types of logistic regression:
1. Binary Logistic Regression
This is the most common type of logistic regression, where the target variable has two possible outcomes
(also known as binary classification). Examples include:
 Predicting if an email is spam or not spam.
 Determining if a customer will buy a product (yes or no).
 Classifying a loan application as approved or denied.
Equation:

Where P(Y=1) represents the probability of the outcome being the positive class.
2. Multinomial Logistic Regression
In multinomial logistic regression, the target variable has more than two categories that are not ordered. This
method is used for multi-class classification problems. For instance:
 Classifying types of food (e.g., fruit, vegetable, dairy).
 Predicting the weather (e.g., sunny, rainy, cloudy).
 Categorizing the severity of a disease (e.g., mild, moderate, severe).
Multinomial logistic regression typically uses a one-vs-rest (OvR) approach, where separate binary logistic
regressions are run for each category compared to all other categories. Alternatively, some algorithms solve
this problem directly with a softmax function that assigns probabilities across multiple classes.
Equation:

Where P(Y=k) is the probability of the target being in class k and K is the total number of classes.
3. Ordinal Logistic Regression
This type of logistic regression is used when the target variable has more than two categories that are
ordered. For instance:
 Customer satisfaction ratings (e.g., dissatisfied, neutral, satisfied).
 Education levels (e.g., high school, bachelor’s, master’s, PhD).
 Disease severity levels (e.g., low, medium, high).
In ordinal logistic regression, the model assumes that while there are distinct categories, they follow a
meaningful order. The model estimates probabilities for each category, considering the order of categories.
Equation:

Where αk is a threshold specific to the k-th category, and P(Y≤k) gives the cumulative probability up to
category k.
Logistic Regression Types
Type Target Variable Example
Binary Logistic Regression Two categories Spam vs. Not Spam
Multinomial Logistic Regression Multiple categories, unordered Fruit, Vegetable, Dairy
Ordinal Logistic Regression Multiple categories, ordered Low, Medium, High satisfaction level
Assumptions of Logistic Regression
Logistic regression relies on several key assumptions to ensure accurate predictions and valid model
performance. These assumptions guide the correct application of logistic regression and, if unmet, can lead to
biased or unreliable results.
Here are the main assumptions of logistic regression:
1. Binary or Ordinal Outcome (for Binary Logistic Regression)
 Binary Logistic Regression requires a binary (dichotomous) outcome variable (e.g., success/failure,
yes/no).
 For multinomial or ordinal logistic regression, the outcome variable can have multiple categories but
must still be categorical.
2. Linearity of the Logit
 Logistic regression assumes a linear relationship between the logit of the outcome (the log odds) and
each predictor variable.

This means that, while logistic regression does not require a linear relationship between predictors and the
probability of the outcome itself, the log-odds of the outcome must have a linear relationship with predictor
variables.
3. No Multicollinearity Among Predictors
 Predictor variables should not be highly correlated with each other (multicollinearity), as this can
make the estimated coefficients unreliable and lead to high standard errors.
 Techniques such as variance inflation factor (VIF) or correlation matrices can be used to check for
multicollinearity.
4. Independence of Observations
 Each observation should be independent of all others, meaning there should be no clustering or
correlation between observations.
This assumption can be violated in cases where data is collected in groups, such as repeated measures on the
same individual. In these cases, other techniques like mixed models might be more appropriate
5. No Strong Outliers Influencing the Model
 While logistic regression is less sensitive to outliers than linear regression, extreme outliers in the
predictor variables can still disproportionately impact the model.
 Checking for outliers and, if needed, using techniques like robust regression or data transformation
can help minimize their impact.
6. Large Sample Size
 Logistic regression works best with a large sample size because it estimates probabilities based on
observed data.
 Having enough observations in each category of the outcome variable is important to ensure stable
and reliable estimates, especially for rare events.
7. No Perfect Separation
 Logistic regression assumes that no predictor variable or combination of predictors perfectly predicts
the outcome.
 Perfect separation occurs when one or more predictors can perfectly distinguish the categories of the
outcome variable, leading to infinite or undefined coefficients. Regularization methods or different
algorithms may be needed if perfect separation is present.
Logistic Regression Assumptions
Assumption Description
Binary or Ordinal Outcome Binary (or categorical for multinomial/ordinal logistic regression)
Linearity of the Logit Linear relationship between predictors and logit (log-odds of the outcome)
No Multicollinearity Predictors should not be highly correlated with each other
Independence of Observations Each observation should be independent of others
No Strong Outliers Outliers in predictors should not disproportionately influence the model
Large Sample Size Sufficient sample size, especially for categories in binary outcome
No Perfect Separation Predictors should not perfectly predict the outcome

Terminologies involved in Logistic Regression: Here are some common terms involved in logistic regression:
● Independent variables: The input characteristics or predictor factors applied to the dependent
variable’s predictions.
● Dependent variable: The target variable in a logistic regression model, which we are trying to predict.
● Logistic function (sigmoid): The formula used to represent how the independent and dependent
variables relate to one another. The logistic function transforms the input variables into a probability
value between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0.
● Odds: The odds ratio (OR) quantifies the change in odds of the outcome for a one-unit increase in a
predictor variable, holding all other variables constant. It’s derived as:

An OR greater than 1 implies a positive association with the outcome, while an OR less than 1 implies a
negative association
● Log-odds: The log-odds, also known as the log it function, is the natural logarithm of the odds. In
logistic regression, the log odds of the dependent variable are modeled as a linear combination of the
independent variables and the intercept.
● Coefficient: The logistic regression model’s estimated parameters show how the independent and
dependent variables relate to one another.
Coefficients β0,β1,…,βn represent the impact of each predictor variable on the log-odds of the
outcome.
β0 is the intercept term, the log-odds when all predictors are zero.
βi (for i=1,2,…,n) measures the change in the log-odds of the outcome for a one-unit increase in the
predictor variable xi.
● Intercept: A constant term in the logistic regression model, which represents the log odds when all
independent variables are equal to zero.
● Maximum likelihood estimation: The method used to estimate the coefficients of the logistic
regression model, which maximizes the likelihood of observing the data given the model.
● Confusion Matrix: It is a table used to evaluate the performance of a binary classifier by displaying
the counts of true positives, true negatives, false positives, and false negatives.
From this matrix, various performance metrics, like accuracy, precision, recall, and F1-score, can be
calculated.
● Multi co linearity: It occurs when predictor variables are highly correlated with each other, which can
make the model coefficients unstable and inflate the standard errors.
● Logistic regression assumes no high multicollinearity among predictors, and methods like Variance
Inflation Factor (VIF) can be used to detect it.
3.5.2Model Theory:
The logistic regression model theory provides a mathematical framework for predicting binary outcomes (e.g.,
yes/no, success/failure) based on one or more predictor variables. Unlike linear regression, which models
continuous outcomes, logistic regression focuses on modeling the probability of a binary response variable.
1. The Problem with Linear Regression for Binary Outcomes
In binary classification, the outcome variable Y can only take two values (0 or 1). A linear regression model
can’t be directly applied here because:
 The predictions (using a linear model) could fall outside the [0,1] range, which doesn’t make sense for
probabilities.
 Linear regression assumes a linear relationship between predictors and the outcome, which doesn’t fit
well for binary responses.
 In linear regression, the independent variables (e.g., age and gender) are used to estimate the specific
value of the dependent variable (e.g., body weight).
In logistic regression, on the other hand, the dependent variable is dichotomous (0 or 1) and the probability
that expression 1 occurs is estimated.
Logistic Regression Equation:
•The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
•Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid
function’ or also known as the ‘logistic function’ instead of a linear function.
•The hypothesis of logistic regression tends it to limit the cost function between 0 and
1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not
possible as per the hypothesis of logistic regression.

--- Logistic Regression Hypothesis Expectation

Logistic Function (Sigmoid Function):


•The sigmoid function is a mathematical function used to map the predicted values to probabilities.
•The sigmoid function maps any real value into another value within a range of 0 and 1, and so forma S-Form
curve.
•The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a
curve like the "S" form.
•The below image is showing the logistic function:
Fig: Sigmoid Function Graph

The Sigmoid function can be interpreted as a probability indicating to a Class-1 or Class-0


So the Regression model makes the following predictions as

iv. Hypothesis Representation or Linear Combination of inputs: Logistic regression starts with a linear
combination of the input features.So the logistic function is perfect to describe the probability P(y=1).

•When using linear regression, we used a formula for the line equation as:

•In the above equation y is a response variable, x1, x2 ,...xn are the predictor variables, and b0 , b1, b2 ,...,
bn are the coefficients, which are numeric constants.
•For logistic regression, we need the maximum likelihood hypothesis
•Apply sigmoid function on y as

Example: To calculate the probability of a person being sick or not using the logistic regression for the example
above, the model parameters b1, b2, b3 and a must first be determined. Once these have been determined, the
equation for the example above is:

V. Classification Decision:
 Once we have a probability, we can classify the observation based on a threshold (usually 0.5):
o If P≥0.5 predict the positive class (1).
o If P<0.5 predict the negative class (0).
 This threshold can be adjusted depending on the specific application, especially when dealing with
imbalanced datasets or varying costs of false positives vs. false negatives.
vi. Maximum Likelihood Method:
Logistic regression uses Maximum Likelihood Estimation (MLE) to find the best coefficients β\betaβ that
maximize the likelihood of observing the data we have.
MLE adjusts the coefficients so the model’s predicted probabilities best match the actual outcomes in the
training data.
Example: Predicting Admission to a College
Suppose a college wants to predict whether an applicant will be admitted (Yes = 1) or not admitted (No = 0)
based on their entrance exam score. We can model this as a binary classification problem using logistic
regression.
Step 1: Setting Up the Logistic Model
1. Input Variable (Feature):
o We have a single feature: Exam Score (let’s denote it as X).
2. Outcome (Target Variable):
o The outcome Y is binary:
 1 if the applicant is admitted,
 0 if the applicant is not admitted.
3. Model Equation:
o In logistic regression, we estimate the probability P that the applicant is admitted (i.e., Y=1)
given their exam score X.
o Logistic regression uses a linear combination of the input, transformed by the sigmoid function
to keep the result between 0 and 1.
The logit (log-odds) equation is:
z=β0+β1⋅ X
where:
o β0 is the intercept (a constant),
o β1 is the coefficient for the exam score.
4. Sigmoid Function:
o We use the sigmoid function to convert the logit z into a probability:

Step 2: Training the Model


1. Using historical data, the college estimates the values of β0and β1using a technique called Maximum
Likelihood Estimation (MLE).
2. The result is a fitted logistic regression model with specific values for β0and β1that best match the
observed outcomes in the training data.
Suppose after training, we get:
β0=−6, β1=0.1
So, our model equation becomes:
z=−6+0.1⋅X
Step 3: Making Predictions
Using the fitted model, we can predict the probability of admission for a given exam score. Let’s go through an
example calculation.
1. Example Prediction:
o Exam Score (X): 80
oSubstitute X=80 into the model equation:
z=−6+0.1×80=−6+8=2
2. Convert Logit to Probability:
o Now, we use the sigmoid function to get the probability of admission:

o This gives an 88% probability that an applicant with an exam score of 80 will be admitted.
3. Classification Decision:
o If we set the threshold at 0.5, we would classify this applicant as likely to be admitted (since
0.88 > 0.5).
o If an applicant had a lower score, say 50, the probability would be lower, and the classification
might switch to "not admitted."
3.6. Model fit Statistics: In logistic regression, model fit statistics help evaluate how well the model explains
the relationship between independent variables and a binary (yes/no, 0/1) outcome. These statistics assess
the model's quality, accuracy, and predictive power, performance, focusing on its ability to classify outcomes
correctly and fit the data well. Here are the key fit statistics commonly used in logistic regression:
1. Log-Likelihood: The likelihood ratio test compares the fit of the logistic regression model to a baseline
model (usually the null model, which includes only the intercept).The deviance statistic is derived from the
likelihood of the data given the model. The null deviance (deviance of the null model) is compared to the
residual deviance (deviance of the fitted model). Logistic regression aims to maximize this value to find the
best parameters for classifying outcomes.
Likelihood Ratio Test Statistic formula:

● Null Deviance: Represents the deviance of a model with only the intercept (no predictors). It shows
how well the model fits with no predictors.( represents the fit of the baseline model)
● Residual Deviance: Represents the Deviance when predictors are included. Lower residual deviance
indicates a
A lower deviance indicates a better-fitting model. You can also use a chi-squared test to determine if the
difference in deviance is statistically significant.
Deviance is a measure of the goodness of fit for a logistic regression model, similar to the residual sum of
squares in linear regression. The result of the likelihood function is called deviance, which measures the
goodness-of-fit.
Deviance Difference
The difference between the null deviance and residual deviance can be used to assess whether the inclusion
of predictors improves the model.
Δ Deviance=Null Deviance−Residual Deviance (D= -2× (Log-Likelihood of full model−Log-
Likelihood of null model)

If the difference is large and statistically significant (based on a chi-square test), it indicates that the predictors
improve the model.

2. Akaike Information Criterion (AIC): AIC is a measure of the relative quality of a model, balancing goodness
of fit with model complexity (the number of parameters).

The model with the lowest AIC is preferred, as it suggests a good balance between fit and parsimony (i.e.,
fewer parameters).

Formula: AIC=2× (number of parameters) -−2× (Log-Likelihood)


AIC allows us to compare different models: the model with the lowest AIC is considered the best among the
choices.
3. Pseudo R-squared or McFadden’s R-squared
Pseudo R-squared is a measure of the goodness of fit in logistic regression models, analogous to R2R^2R2 in
linear regression. It is not a proportion of variance explained, but it gives a rough idea of model fit.
McFadden's R-squared: The most common pseudo R-squared. It is calculated as:

Values range from 0 to 1, where higher values indicate better model fit. However, McFadden's R² values tend
to be lower than R^2 for linear regression, with values typically between 0.2 and 0.4 being considered
excellent.
4. Bayesian Information Criterion (BIC)
Similar to AIC, BIC is a model selection criterion that penalizes the complexity of the model, but with a
stronger penalty for additional parameters. As with AIC, the model with the lowest BIC is preferred.
Formula:

Where n is the sample size, kis the number of parameters, and L is the likelihood.
A lower BIC is better, and the model with the smallest BIC is considered to have the best trade-off between fit
and complexity.
5.Hosmer-Lemeshow Test: This is a statistical test used to assess the goodness of fit of a logistic regression
model, based on the comparison of observed vs. predicted probabilities.
The data is grouped into deciles based on predicted probabilities, and a chi-squared test is performed to
compare the observed and expected frequencies within each group.
A significant p-value (typically p<0.05) indicates a poor fit, while a non-significant p-value suggests that the
model fits the data well.
6. Area under Receiver Operating Characteristic (ROC) Curve: The AUC-ROC curve is a graphical plot that
shows the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
The area under the curve (AUC) measures the ability of the model to discriminate between the positive and
negative classes. AUC ranges from 0 to 1, with higher values indicating better discrimination.
True Positive Rate (TPR) or Sensitivity: The y-axis on the ROC curve.
False Positive Rate (FPR): The x-axis on the ROC curve.
● Area Under the ROC Curve (AUC-ROC): A single value that summarizes the ROC curve’s performance:
● AUC = 1: Perfect model.
● AUC = 0.5: Model with no discriminative power (similar to random guessing).
● Higher AUC values indicate better model performance
7. Confusion Matrix / Error matrix/Contingency Table:
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N
is the number of target classes. The matrix compares the actual target values with those predicted by the
machine learning model. This gives us a holistic view of how well our classification model is performing
and what kinds of errors it is making. It is a specific table layout that allows visualization of the performance
of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching
matrix).
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:
Let’s decipher the matrix:
 The target variable has two values: Positive or Negative
 The columns represent the actual values of the target variable
 The rows represent the predicted values of the target variable
 True Positive
 True Negative
 False Positive – Type 1 Error
 False Negative – Type 2 Error
 Why we need a Confusion matrix?
 Precision vs Recall
 F1-score
Understanding True Positive, True Negative, False Positive and False Negativ e in a Confusion Matrix
True Positive (TP)
 The predicted value matches the actual value
 The actual value was positive and the model predicted a positive value
True Negative (TN)
 The predicted value matches the actual value
 The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
 The predicted value was falsely predicted
 The actual value was negative but the model predicted a positive value
 Also known as the Type 1 error
False Negative (FN) – Type 2 error
 The predicted value was falsely predicted
 The actual value was positive but the model predicted a negative value
 Also known as the Type 2 error
 To evaluate the performance of a model, we have the performance metrics called,
 Accuracy, Precision, Recall & F1-Score metrics
Accuracy: The proportion of correctly predicted cases or instances. Accuracy is the most intuitive performance
measure and it is simply a ratio of correctly predicted observation to the total observations.
Accuracy is a great measure to understand that the model is Best.
Accuracy is dependable only when you have symmetric datasets where values of false positive and false
negatives are almost same.
Precision: (Positive Predictive Value): Proportion of correctly predicted positives out of all predicted
positives.
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
It tells us how many of the correctly predicted cases actually turned out to be positive.

Precision is a useful metric in cases where False Positive is a higher concern than False Negative. Precision is
important in music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to
customer churns and be harmful to the business.
Recall (Sensitivity OR True Positive Rate): The proportion of actual positives that are correctly identified or
predicted. Recall is the ratio of correctly predicted positive observations to the all observations in actual
class.Recall is a useful metric in cases where False Negative trumps False Positive.Recall is important in medical
cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go
undetected!

Specificity (True Negative Rate): The proportion of actual negatives that are correctly identified.

F1-score: It is a harmonic mean of Precision and Recall. It gives a combined idea about these two metrics. It
is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account. The harmonic mean of
precision and recall, useful when the classes are imbalanced.
F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
Accuracy works best if false positives and false negatives have similar cost.
If the cost of false positives and false negatives are very different, it’s better to look at both Precision and
Recall.
But there is a catch here. If the interpretability of the F1-score is poor, means that we don’t know what our
classifier is maximizing – precision or recall? So, we use it in combination with other evaluation metrics which
gives us a complete picture of the result.
Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on it and get the below
confusion matrix:

The different values of the Confusion matrix would be as follows:


True Positive (TP) = 560
-Means 560 positive class data points were correctly classified
by the model.
True Negative (TN) = 330
-Means 330 negative class data points were correctly classified
by the model.
False Positive (FP) = 60
-Means 60 negative class data points were incorrectly classified
as belonging to the positive class by the model.
False Negative (FN) = 50
-Means 50 positive class data points were incorrectly classified as belonging to the negative class by the
model.
This turned out to be a pretty decent classifier for our dataset considering the relatively larger number of true
positive and true negative values.
Precisely we have the outcomes represented in Confusion Matrix as:
TP = 560, TN = 330, FP = 60, FN = 50
Accuracy:
The accuracy for our model turns out to be:

Precision:
It tells us how many of the correctly predicted cases actually turned out to be positive.

This would determine whether our model is reliable or not.


Recall tells us how many of the actual positive cases we were able to predict correctly with our model.

We can easily calculate Precision and Recall for our model by plugging in the values into the above questions:
F1-Score

AUC (Area Under Curve) ROC (Receiver Operating Characteristics) Curves: Performance measurement is an
essential task in Data Modelling Evaluation. It is one of the most important evaluation metrics for checking any
classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating
Characteristics) So when it comes to a classification problem, we can count on an AUC - ROC Curve.
When we need to check or visualize the performance of the multi-class classification problem,
we use the AUC (Area Under The Curve)
ROC (Receiver Operating Characteristics) curve.
What is the AUC - ROC Curve?
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings.
ROC is a probability curve and AUC represents the degree or measure of separability.
It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the
model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is
at distinguishing between patients with the disease and no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

TPR (True Positive Rate) / Recall /Sensitivity

Specificity

FPR (False Positive Rate)

ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification
model at all classification thresholds. This curve plots two parameters:
True Positive Rate and False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows: TPR=TPTP+FN

False Positive Rate (FPR) is defined as follows:


An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold
classifies more items as positive, thus increasing both False Positives and True Positives. The following figure
shows a typical ROC curve.

3.7 Model construction: In logistic regression involves several key steps that help build a statistical model for
predicting binary outcomes based on a set of independent variables. Logistic regression uses a linear
combination of the predictors but applies a non-linear transformation (the sigmoid function) to ensure the
predicted values lie between 0 and 1, representing probabilities. Here's how the model construction works:
1. Define the Problem
i. Define the Outcome: The first step is to identify the problem, which is typically a binary classification task,
where the target variable has two possible outcomes (e.g., 0 or 1, "yes" or "no", "true" or "false").
For example, in a medical context, YYY could be whether a patient has a disease (1) or does not (0)
ii. Select Independent Variables (Identify predictor variables):
Identify the independent variables (predictors) that influence the dependent variable. These can be
continuous, categorical, or binary variables.Ex: you’ll use to make predictions, such as age, income, exam
score, etc.
2. Data Collection and Preprocessing
 Gather data: Collect a dataset that includes the binary outcome and predictor variables.
 Handle missing values: Decide how to address any missing values, which could involve imputation,
deleting rows, or replacing them with mean/median values.
 Encode categorical variables: Convert categorical variables into numerical form, often by using one-
hot encoding or label encoding.
 Scale numeric variables: Standardizing or normalizing predictors can improve model performance and
interpretability.
 Remove or transform outliers: Depending on the predictors, you may need to handle extreme values
that could distort the model.
3. Define the Logistic Regression Model:
Logistic regression predicts the probability that an instance belongs to the positive class (e.g., 1), calculated by:

Where:
 Y: Binary outcome (0 or 1).
 Xi: Predictor variables.
 β0: Intercept term.
 β1,β2,…,βn : Coefficients (parameters) associated with each predictor.
The model’s goal is to find the best values for the parameters β that maximize the likelihood of observing the
actual data.
4. Train the Model (Estimate Parameters)
 Maximum Likelihood Estimation (MLE) is typically used to estimate the coefficients. MLE finds the
values for β\betaβ that maximize the likelihood of the observed outcomes in the training data.
 In practice, optimization algorithms (such as gradient descent or a variant of it) are used to maximize
the likelihood.
5. Model Evaluation and Validation
After fitting the model to the data, assess its performance using techniques such as:
 Confusion Matrix: Shows true positives, true negatives, false positives, and false negatives, allowing
calculation of accuracy, precision, recall, etc.
 ROC Curve and AUC: Assesses the model’s ability to discriminate between classes at various
thresholds.
 Log-Loss: Measures how close the predicted probabilities are to the actual outcomes.
 Cross-Validation: Split data into training and validation sets multiple times to ensure that the model
generalizes well to unseen data.

6. Interpret the Model Coefficients


In logistic regression, coefficients can be interpreted in terms of odds:
 Exponentiating the coefficients (i.e., eβi) gives the odds ratio for each predictor. For instance, if eβ1=2
for a predictor X1, it means a one-unit increase in X1 doubles the odds of the outcome.
 Positive coefficients indicate a positive association with the outcome (higher probability of 1), while
negative coefficients indicate a negative association.
7. Make Predictions
 After training, you can use the model to predict the probability that new observations belong to the
positive class.
 Apply a threshold (typically 0.5) to classify observations as 0 or 1 based on the predicted probability.
Adjusting this threshold can change sensitivity and specificity, depending on the specific requirements
of the application.
8. Model Refinement (Optional)
Refine the model if performance is not satisfactory:
 Feature Engineering: Create new variables, interactions, or transformations.
 Regularization: Add regularization (L1 or L2 penalties) to handle multi collinearity, reduce over fitting,
or perform feature selection.
Example of Logistic Regression Model Construction
Suppose we’re constructing a model to predict whether a person will purchase a product based on their age
and annual income.
1. Define Problem: The binary outcome is Purchase (Yes = 1, No = 0).
2. Collect and Preprocess Data: Collect data on age, income, and purchase history. Preprocess by
handling missing values, scaling age and income, and encoding the binary outcome.
3. Define the Model: The logistic regression equation is:

4. Train the Model: Fit the model to estimate β0, β1 and β2using training data.
5. Evaluate and Validate: Evaluate performance with AUC, accuracy, or log-loss, and validate using cross-
validation.
6. Interpret Coefficients: Suppose eβ1=1.2 for Age, meaning each additional year of age increases the
odds of purchase by 20%.
7. Predict: For a new individual with specific age and income, calculate the probability of purchasing and
classify based on a chosen threshold.

3.8 Analytics applications to various Business Domains etc


Analytics has become a crucial tool in various business domains, helping companies make data-driven
decisions, optimize operations, improve customer satisfaction, and increase profitability. Different business
sectors leverage analytics in unique ways to gain a competitive edge. Below is an overview of how analytics is
applied across various business domains:
1. Marketing and Sales Analytics
Analytics in marketing and sales helps businesses understand customer behavior, optimize campaigns, and
drive revenue growth.
● Customer Segmentation: Grouping customers based on demographics, purchasing behavior, or
interests to target marketing efforts more effectively.
● Customer Lifetime Value (CLV): Predicting the total value a customer will bring to a company over
their entire relationship.
● Campaign Performance Analysis: Assessing the success of marketing campaigns by analyzing metrics
such as conversion rates, click-through rates (CTR), and return on investment (ROI).
● Churn Prediction: Using predictive analytics to identify customers likely to stop using a product or
service, enabling targeted retention strategies.
● Sales Forecasting: Predicting future sales based on historical data and trends, helping businesses
manage inventory, resources, and set realistic sales targets.
2. Supply Chain and Operations Analytics
Supply chain analytics is used to optimize logistics, inventory management, and production processes to
minimize costs and improve efficiency.
● Demand Forecasting: Predicting product demands to optimize inventory levels, reduces stockouts,
and avoids overproduction.
● Inventory Optimization: Using data analytics to manage inventory levels efficiently, reducing carrying
costs and minimizing waste.
● Supplier Performance Monitoring: Tracking supplier performance in terms of delivery times, quality,
and cost to improve procurement processes.
● Logistics Optimization: Streamlining transportation routes, warehouse locations, and delivery
schedules to reduce shipping costs and improve delivery times.
● Risk Management: Identifying and mitigating risks in the supply chain, such as supplier delays, market
fluctuations, or disruptions.
3. Finance and Accounting Analytics
Analytics in finance helps businesses track financial performance, manage risk, and make informed investment
decisions.
● Financial Forecasting: Predicting future financial outcomes, such as revenue, expenses, and cash flow,
to assist in budgeting and financial planning.
● Risk Analysis: Assessing potential financial risks, such as market volatility, credit risk, or liquidity risk,
to protect the business.
● Fraud Detection: Using machine learning algorithms to detect unusual patterns or transactions that
may indicate fraudulent activity.
● Portfolio Management: Applying analytics to optimize investment portfolios by balancing risk and
return based on historical and real-time data.
● Cost Reduction: Analyzing operational costs to identify areas where savings can be made without
affecting quality or performance.
4. Human Resources (HR) Analytics
HR analytics, or people analytics, uses data to improve hiring, employee retention, and workforce
management.
● Talent Acquisition and Recruitment: Analyzing resumes, job performance data, and other indicators
to find the best candidates for positions.
● Employee Retention and Churn: Using predictive analytics to identify employees at risk of leaving and
implementing strategies to retain them.
● Performance Management: Monitoring employee performance and identifying areas for
improvement, training needs, or promotion opportunities.
● Workforce Planning: Predicting future workforce needs based on factors like turnover rates, market
growth, and skill gaps.
● Diversity and Inclusion Analytics: Measuring diversity metrics and tracking progress on inclusion
initiatives to improve company culture and performance.
5. Healthcare Analytics
Healthcare organizations use analytics to improve patient outcomes, manage resources, and reduce costs.
● Patient Care Analytics: Tracking patient outcomes, treatment effectiveness, and recovery rates to
improve the quality of care.
● Predictive Analytics in Disease Prevention: Predicting disease outbreaks, patient readmissions, or the
likelihood of developing certain conditions based on patient data.
● Operational Efficiency: Optimizing scheduling, staffing, and resource allocation in hospitals and clinics
to reduce wait times and improve care delivery.
● Claims and Billing Fraud Detection: Identifying patterns of fraudulent billing in healthcare claims.
● Population Health Management: Analyzing population health data to identify trends and inform
public health strategies.
6. Retail and E-commerce Analytics
Retailers use analytics to optimize pricing, inventory, and marketing to improve customer experience and
boost sales.
● Personalized Recommendations: Leveraging customer data to offer personalized product
recommendations and improve the customer experience.
● Dynamic Pricing: Adjusting prices in real-time based on demand, competition, and other factors to
maximize revenue.
● Customer Journey Analysis: Tracking how customers interact with the brand across various channels
(online and offline) to optimize marketing strategies.
● Inventory and Supply Chain Analytics: Ensuring that the right products are available at the right
locations to meet customer demand.
● Store Layout Optimization: Analyzing customer traffic patterns within physical stores to optimize
product placement and store layouts for increased sales.
7. Banking and Financial Services Analytics
Analytics in banking and financial services help manage risk, understand customer needs, and streamline
operations.
● Credit Scoring: Using machine learning models to predict the likelihood of loan default and make more
informed lending decisions.
● Fraud Detection and Prevention: Real-time monitoring of transactions for signs of fraudulent activity.
● Customer Segmentation: Grouping customers based on their financial behavior, preferences, and risk
profiles to offer tailored services.
● Risk Management: Using analytics to predict market changes, interest rates, and other financial risks
to mitigate potential losses.
● Customer Relationship Management (CRM): Analyzing customer interactions to improve service
offerings and foster long-term relationships.
8. Telecommunications and Media Analytics
Telecom companies use analytics to improve network performance, reduce churn, and enhance customer
satisfaction.
● Churn Prediction: Identifying customers likely to leave the service and deploying retention strategies.
● Network Optimization: Analyzing network data to optimize performance, reduce downtime, and
improve service quality.
● Content Recommendation: Leveraging analytics to suggest relevant content to users based on their
preferences and viewing habits.
● Customer Experience Management: Analyzing customer feedback and service interactions to improve
satisfaction and loyalty.
● Revenue Assurance: Detecting revenue leaks and ensuring accurate billing and service delivery.
9. Manufacturing Analytics
In the manufacturing sector, analytics helps streamline operations, enhance product quality, and reduce
production costs.
● Predictive Maintenance: Using sensors and historical data to predict when machinery will fail,
allowing for proactive maintenance and reducing downtime.
● Quality Control: Analyzing production data to identify defects or inefficiencies in the manufacturing
process.
● Supply Chain Optimization: Streamlining the supply chain to reduce costs, improve delivery times, and
minimize waste.
● Production Scheduling: Optimizing production schedules based on demand forecasts, available
resources, and labor constraints.
● Inventory Management: Analyzing inventory data to ensure the right levels of raw materials and
finished goods are maintained.
10. Energy and Utilities Analytics
Energy companies use analytics to optimize resource management, improve sustainability, and predict
customer usage patterns.
● Demand Forecasting: Predicting energy demand to optimize resource allocation and reduce waste.
● Smart Grid Analytics: Analyzing data from smart meters to optimize energy distribution, detect fraud,
and improve service reliability.
● Predictive Maintenance: Predicting equipment failures and optimizing maintenance schedules to
reduce downtime.
● Energy Efficiency: Analyzing consumption data to develop programs that encourage more efficient
energy use among customers.
● Sustainability and Carbon Footprint Analysis: Tracking and analyzing environmental impact and
helping companies implement sustainable practices

You might also like