Unit-3 Data Analysis
Unit-3 Data Analysis
The goal of the algorithm is to find the best Fit Line equation that can predict the values based on
the independent variables.
Now, the slope of the line of best fit can be calculated from the formula as follows:
B1 = 46/40 = 1.15
Now, the intercept will be calculated from the formula as follows:
B0= Y – B1X
c = 57 – 1.15*164 = -131.6
Thus, the equation of the line of best fit becomes, y = – 131.6+1.15X
2. Logistic Regression: (Classification Technique) Used when the dependent variable is binary
or categorical (e.g., yes/no, 0/1). Despite its name, logistic regression is used for
classification, not regression. It estimates the probability that a given input belongs to a
certain class using a logistic function.
3. Polynomial Regression: A form of regression where the relationship between the
dependent and independent variables is modeled as an nth-degree polynomial. It helps
capture non-linear relationships between the variables.
4. Ridge and Lasso Regression: These are regularization techniques used to prevent over fitting
by adding penalty terms to the regression equation.
i. Ridge Regression: Shrinks the coefficients by adding a penalty proportional to the
square of their magnitude. It reduces large coefficients but doesn't set any to zero.
ii. Lasso Regression: Adds a penalty proportional to the absolute value of the
coefficients, which can lead to some coefficients being reduced to zero. This
effectively performs feature selection, removing irrelevant variables.
5. Stepwise Regression: (Dimensionality Reduction)A method used to select the most
significant variables for the model by adding or removing predictors based on their statistical
significance. It can be forward (adding variables) or backward (removing variables).
6. Non-Linear Regression: Used when the relationship between the dependent and
independent variables is not linear. It models more complex relationships using curves or
other non-linear functions.
7. Poisson Regression: Used when the dependent variable represents count data (e.g., the
number of occurrences of an event in a fixed time period).
Assumptions of Linear Regression:
For linear regression to give reliable results, some assumptions need to be met:
1. Linearity: The relationship between the independent and dependent variables should be
linear.
2. Independence: The observations should be independent of each other.
3. Homoscedasticity: The variance of the residuals should be constant across all levels of the
independent variables.
4. Normality: The residuals should be normally distributed.
5. No Multicollinearity: Independent variables should not be highly correlated with each
other.
Applications of Regression:
● Finance: Predicting stock prices, returns, or risk factors.
● Marketing: Estimating the effect of advertising on sales.
● Healthcare: Predicting patient outcomes based on health metrics.
● Economics: Modeling relationships between economic indicators (e.g., unemployment rates
and GDP).
● Social Sciences: Studying relationships between social behaviors and outcomes.
Errors:
Sum of all errors: (∑error) = Actual-Predicted= ∑(Y-Yhat)
Sum of absolute value of all errors: (∑|error|)
Sum of square of all errors: (∑error^2)
Regression is a powerful statistical tool for understanding relationships between variables and
making predictions. Different types of regression techniques are used depending on the nature of
the data and the specific goal of the analysis. Proper use of regression can provide deep insights into
patterns and trends within datasets.
3.2 BLUE Property Assumption:
The BLUE property assumption is a foundational concept in regression analysis, especially in the
context of the Ordinary Least Squares (OLS) method. According to the Gauss-Markov Theorem,
under a set of specific assumptions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE)
for estimating the coefficients in a linear regression model. Below, I will explain the components of
the BLUE property and the necessary assumptions in greater detail.
Components of the BLUE Property:
1. Best: The OLS estimator has the smallest possible variance among all linear and unbiased
estimators. In other words, OLS provides the most efficient estimates compared to other
unbiased methods. Lower variance means that the OLS estimates are more reliable because
they have less variability from sample to sample.
2. Linear: The OLS estimator is a linear combination of the observed values of the dependent
variable. This means that the estimated coefficients (β^0,β^1,…,β^k) are calculated by
applying linear operations (addition and scalar multiplication) to the observed data.
3. Unbiased: The OLS estimator, on average, provides the true value of the parameter being
estimated. This means that if we repeatedly draw random samples from the population and
apply the OLS method, the average of the estimated coefficients will equal the true
population parameters.
o Mathematically, this means that: E (β^) =β where E (β^) is the expected value of the
OLS estimator, and β is the true parameter value.
4. Estimator: The OLS method provides estimates of the unknown parameters (like β0,β1,…,βk
) in a regression model. These estimates are derived from the data and aim to describe the
relationship between the dependent and independent variables.
Example:
Imagine you want to estimate the average height of students in a class. You measure the
heights of 10
randomly chosen students:
1. Unbiased: If you were to repeat this process many times, the average height you calculate
should be close to the true average height of the entire class. This means your method
doesn’t
systematically overestimate or underestimate the average.
2. Linear: You calculate the average height by simply adding the heights of the students and
dividing by 10.
3. Best: Among all the ways to estimate the average height (like using weights or different
formulas), the simple average gives you the least variability in your estimates when repeated
multiple times, making it the best in terms of precision.
Assumptions for the BLUE Property:
The OLS estimator is guaranteed to be BLUE if the following five assumptions (also called the Gauss-
Markov assumptions) hold true in a linear regression model:
1. Linearity of the model in parameters
The model must be linear in terms of the parameters (coefficients), though not necessarily in terms
of the variables themselves. The model can be written as:
Y=β0+β1X1+β2X2+ +βk Xk+ϵ
In the simple linear regression equation of OLS estimates properties is Y=β0+β1X+ ϵ (ϵ error)
The above equation is based on the following assumptions
a. Randomness of ϵ
b. Mean of ϵ is Zero
c. Variance of ϵ is constant
d. The variance of ϵ has normal distribution
e. Error ϵ of different observations are independent.
Here, Y is the dependent variable, X1, X2… Xkare the independent variables, β0,β1,…,βk are the
parameters to be estimated, and ϵ is the error term (the difference between the observed and
predicted values).
The linearity assumption ensures that the relationship between the dependent and independent
variables can be described using a straight line or a linear combination of the independent variables.
2. Exogeneity (Zero Conditional Mean of the Errors)
The error term (ϵ) must have an expected value of zero given any values of the independent
variables. Mathematically:
E(ϵ∣X)=0
This means that the error term is uncorrelated with the independent variables and there is no
omitted variable bias. It implies that the independent variables in the model are not systematically
related to the unobserved factors (the errors) that influence the dependent variable.
If this assumption is violated, the OLS estimators become biased, meaning the average of the
estimates does not equal the true population parameters.
3. Homoscedasticity (Constant Variance of Errors)
The variance of the error terms should remain constant across all levels of the independent
variables. Formally, this means:
Var(ϵi∣X)=σ2∀i
This is called homoscedasticity, which means that the spread (variance) of the errors is the same no
matter what value of the independent variable is used.
If the error variance changes for different values of the independent variables (a condition called
heteroscedasticity), OLS estimators are still unbiased, but they are no longer efficient (i.e., they do
not have the minimum variance). This reduces the reliability of the OLS estimates.
Homoscedasticity vs Heteroscedasticity:
2. Minimization of Residuals: The independent variable values as xi and the dependent ones as
yi. These are the errors between the observed data and the values predicted by the model.
Mathematically, for a given set of observations (x1,y1),(x2,y2),…,(xn, yn) and predicted
values y^1,y^2,…,y^n , the residual for each data point is:
ri=yi−y^I where yi is the observed value and y^i is the predicted value by the model.
o The method works by minimizing the following cost function:
Where Yi are the observed values and Y^i are the predicted values from the model
3. Sum of Squared Residuals (SSR): Calculate the average values of xi and yi as X and Y. LSE
minimizes the sum of squared residuals, given by:
SSR=∑i=1 to n(yi−y^i)2
This ensures that both large positive and large negative residuals are penalized equally, as squaring
removes any negative signs.
To assess how well the model fits the data, various metrics can be used:
R-squared: Indicates the proportion of variance in the dependent variable
explained by the model.
Adjusted R-squared: Adjusts R-squared for the number of predictors, useful
for comparing models with different numbers of predictors.
Residual Analysis: Examining residuals can help identify patterns or
deviations from assumptions.
4. Best Fit Line: In the case of a linear regression model, the goal is to find a line (or hyper
plane in higher dimensions) that best fits the data. The equation of the line in simple linear
regression is:
Y=β0+β1X+ϵ
Where:
o Β1 is the slope of the line(coefficient).
o Β0 is the intercept.
o LSE finds the values of m and b that minimize the sum of squared residuals between
the observed y values and the predicted y^values from this line.
5. Estimating Parameters:
In the simple linear Regression: The estimates for the slope m and intercept b are found
using calculus by setting the derivative of the sum of squared residuals to zero. The result of
the least squares method gives the parameters (coefficients) of the regression model, which
show the strength and direction of the relationship between the predictors and the
response variable. The formulas are:
Slope B1:
o Intercept B0 :
For multiple linear regressions, the approach generalizes by minimizing the sum of squared
residuals for a model with multiple predictors.
6. Curve Fitting: LSE can also be applied to non-linear models where the relationship between
variables is more complex. For example, in polynomial regression, the goal is to fit a
polynomial curve to the data.
Use Least Squares Estimation:
● Minimizes Error: By minimizing the sum of squared residuals, LSE produces estimates that,
on average, lead to the smallest prediction errors for the given data.
● Simple and Efficient: LSE provides an algebraic solution to finding model parameters, which
is computationally efficient for many applications, especially linear regression.
● Widely Applicable: LSE can be used for linear, polynomial, and other forms of regression and
curve fitting in fields like economics, engineering, machine learning, and more.
Advantages and Limitations:
● Advantages:
o Minimizes error in predictions.
o Simple to implement and interpret.
o Provides the "best" linear fit under standard assumptions.
● Limitations:
o Sensitive to outliers, as squaring residuals gives more weight to large errors.
o Assumes linearity between the variables (though can be extended to non-linear
relationships with more complex models).
7. Applications:
o Least squares estimators are widely used across various fields, including economics,
biology, engineering, and social sciences, for tasks like predictive modeling, trend
analysis, and causal inference.
Example in Data Analysis
Consider a dataset where you want to analyze the relationship between advertising spend and sales.
You collect data on these variables and use least squares regression to model the relationship.
1. Model Creation: You create a linear model:
Sales=β0+β1Advertising Spend
Fitting the Model: By applying the least squares method, you estimate β (intercept) and β1 (slope).
2. Analysis: After fitting, you find that β1=2.5, indicating that for every additional dollar spent
on advertising, sales increase by $2.50.
3. Evaluation: You check the R-squared value to see how well your model explains the
variation in sales, and perform residual analysis to validate the assumptions.
The least squares estimator is a powerful tool in data analysis, enabling researchers and analysts to
model relationships, make predictions, and derive insights from data. Its simplicity, interpretability,
and effectiveness make it a cornerstone of statistical analysis in many fields.
Example Scenario
Suppose you want to understand the relationship between the number of hours studied and exam
scores for a group of students. Here’s a small dataset:
Hours Studied (X) Exam Score (Y)
1 50
2 55
3 65
4 70
5 80
Step 1: Model Structure
We assume a linear relationship between hours studied and exam scores:
Y=β0+β1X+ϵ
Where:
Y is the exam score,
Xis the hours studied,
β0 is the intercept (score when no hours are studied),
β1 is the slope (change in score per additional hour studied),
ϵ is the error term.
Step 2: Calculate the Estimates
1. Mean Values:
o Mean of X(Hours Studied): Xˉ=1+2+3+4+5/5=3
Mean of Y (Exam Score): Yˉ=50+55+65+70+80/5=62
2. Calculate β1 (Slope):
● Feature Selection or stepwise Forward selection: This involves choosing which variables (or
features) from the dataset should be included in the model. Some features may be
irrelevant or redundant and should be removed to avoid over fitting, where the model
performs well on training data but poorly on new data.
Common methods include statistical tests, correlation analysis, and algorithms like Recursive
Feature Elimination (RFE).
● Feature Engineering: This step involves creating new features or modifying existing ones to
better represent the relationships in the data. For example, you might create interaction
terms, polynomial features, or log transformations of variables that show non-linear
relationships.
● Feature Transformation OR Normalizing and Standardizing Variables: Modifying variables
to improve their distribution or align them with the model's assumptions. Examples include
scaling, normalization, encoding categorical variables, and creating interaction terms. To
improve model performance, especially for algorithms sensitive to scale (like gradient
descent), variables may need to be normalized (scaling to a 0–1 range) or standardized
(transforming to have a mean of 0 and standard deviation of 1).
● Handling Missing Values: Missing data can bias results or reduce model accuracy. Strategies
like mean/median imputation, forward/backward filling, or more advanced techniques like
Multiple Imputation by Chained Equations (MICE) are often used.
● Dealing with Multicollinearity: It occurs when two or more independent variables are highly
correlated, which can cause instability in the model. Identifying and eliminating or
combining highly correlated variables helps improve model stability.
● Dimensionality Reduction: Reducing the number of features to avoid overfitting and
improve computation. Techniques include PCA, Singular Value Decomposition (SVD), or
more advanced algorithms like t-SNE and UMAP.
Why Variable Rationalization Matters:
● Prevents Over fitting: Fewer, more relevant variables reduce the risk of the model fitting
noise in the data rather than the actual signal.
● Improves Interpretability: With fewer, well-chosen variables, it's easier to understand the
relationships between the variables and the outcome.
● Boosts Efficiency: Reducing the number of variables makes the model more computationally
efficient, particularly when dealing with large datasets.
Example Context
Objective: Predict if a customer will default on a loan (a classification problem with a binary target:
Default or No Default).
Dataset Variables:
1. Age: Customer’s age.
2. Income: Monthly income.
3. Employment Type: Type of employment (e.g., salaried, self-employed).
4. Credit Score: Customer’s credit rating.
5. Loan Amount: Amount of loan requested.
6. Education Level: Highest level of education.
7. Marital Status: Married, single, divorced, etc.
8. Number of Dependents: Number of dependents the customer has.
Example Variable Rationalization
1.1 Feature Selection
Purpose: Remove irrelevant variables and keep only the ones that significantly impact
default prediction.
Example: Through exploratory data analysis, you may find that Marital Status and Education
Level have low correlation with the target variable (default rate) and decide to exclude
them.
1.2 Feature Transformation
Purpose: Make the data suitable for the model and improve performance.
Example: Transform Income and Loan Amount using logarithmic scaling if they show a highly
skewed distribution. This can make the data more normalized, improving the model’s
accuracy.
1.3 Handling Missing Values
Purpose: Ensure model consistency and avoid bias.
Example: If Credit Score has some missing values, you could fill them with the median credit
score of customers with similar income levels.
1.4 Dealing with Multicollinearity
Purpose: Reduce redundancy among variables.
Example: If Income and Loan Amount are highly correlated, consider combining them into a
single feature, such as Debt-to-Income Ratio, to avoid issues with multicollinearity.
1.5 Dimensionality Reduction
Purpose: Simplify the model and reduce overfitting risk.
Example: If the dataset includes many detailed financial metrics, you could use Principal
Component Analysis (PCA) to combine them into a few principal components that capture
the essential information.
3.4.2. Model Building: It creates the predictive model that uses these features to make accurate
predictions and offer insights. Together, they ensure the development of an efficient, interpretable,
and accurate model.
It is the process of creating a statistical or machine learning model that best explains or predicts the
relationships between variables in a dataset or an outcome. This involves selecting an appropriate
algorithm, training the model on data, evaluating its performance, and refining it.
Key Steps in Model Building:
1. Problem Definition
2. Hypothesis Generation/selecting a model type
3. Data Collection/training the model
4. Data Exploration/Transformation
5. Predictive Modeling
6. Model Deployment
1. Defining the Problem: Clearly define what you are trying to predict or explain, setting up target
variables and objectives.
The first step in constructing a model is to understand the industrial problem in a more
comprehensive way. To identify the purpose of the problem and the prediction target, we
must define the project objectives appropriately.
Therefore, to proceed with an analytical approach, we have to recognize the obstacles first.
Remember, excellent results always depend on a better understanding of the problem.
2. Hypothesis Generation: Hypothesis generation is the guessing approach through which we derive
some essential data parameters that have a significant correlation with the prediction target.
Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders into
account. We search for every suitable factor that can influence the outcome.
Hypothesis generation focuses on what you can create rather than what is available in the dataset.
3. Data Collection: Data collection is gathering data from relevant sources regarding the analytical
problem, and then we extract meaningful insights from the data for prediction.
4. Data Exploration/Transformation
The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary
features, null values, unanticipated small values, or immense values. So, before applying any
algorithmic model to data, we have to explore it first.
By inspecting the data, we get to understand the explicit and hidden trends in data. We find
the relation between data features and the target variable.
Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only
7. Model Refinement and Iteration: Based on evaluation, the model is often refined by adjusting hyper
parameters, adding or removing features, or even switching to a different modeling technique if needed
To predict business decisions, plan market strategies, and create personalized customer interests, we integrate
the machine learning model into the existing production domain. When you go through the Amazon website
and notice the product recommendations completely based on your curiosities. You can experience the
increase in the involvement of the customers utilizing these services. That’s how a deployed model changes
the mindset of the customer and convinces him to purchase the product.
Key Takeaways
In medicine, for example, a frequent application is to find out which variables have an influence on a disease.
In this case, 0 could stand for not diseased and 1 for diseased. Subsequently, the influence of age, gender and
smoking status (smoker or not) on this particular disease could be examined.
Business example: For an online retailer, you need to predict which product a particular customer is most
likely to buy. For this, you receive a data set with past visitors and their purchases from the online retailer.
Medical example: You want to investigate whether a person is susceptible to a certain disease or not. For this
purpose, you receive a data set with diseased and non-diseased persons as well as other medical parameters.
Political example: Would a person vote for party A if there were elections next weekend?
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
Logistic regression comes in different forms, which are used depending on the structure of the target variable
and the nature of the classification task. Here are the primary types of logistic regression:
1. Binary Logistic Regression
This is the most common type of logistic regression, where the target variable has two possible outcomes
(also known as binary classification). Examples include:
Predicting if an email is spam or not spam.
Determining if a customer will buy a product (yes or no).
Classifying a loan application as approved or denied.
Equation:
Where P(Y=1) represents the probability of the outcome being the positive class.
2. Multinomial Logistic Regression
In multinomial logistic regression, the target variable has more than two categories that are not ordered. This
method is used for multi-class classification problems. For instance:
Classifying types of food (e.g., fruit, vegetable, dairy).
Predicting the weather (e.g., sunny, rainy, cloudy).
Categorizing the severity of a disease (e.g., mild, moderate, severe).
Multinomial logistic regression typically uses a one-vs-rest (OvR) approach, where separate binary logistic
regressions are run for each category compared to all other categories. Alternatively, some algorithms solve
this problem directly with a softmax function that assigns probabilities across multiple classes.
Equation:
Where P(Y=k) is the probability of the target being in class k and K is the total number of classes.
3. Ordinal Logistic Regression
This type of logistic regression is used when the target variable has more than two categories that are
ordered. For instance:
Customer satisfaction ratings (e.g., dissatisfied, neutral, satisfied).
Education levels (e.g., high school, bachelor’s, master’s, PhD).
Disease severity levels (e.g., low, medium, high).
In ordinal logistic regression, the model assumes that while there are distinct categories, they follow a
meaningful order. The model estimates probabilities for each category, considering the order of categories.
Equation:
Where αk is a threshold specific to the k-th category, and P(Y≤k) gives the cumulative probability up to
category k.
Logistic Regression Types
Type Target Variable Example
Binary Logistic Regression Two categories Spam vs. Not Spam
Multinomial Logistic Regression Multiple categories, unordered Fruit, Vegetable, Dairy
Ordinal Logistic Regression Multiple categories, ordered Low, Medium, High satisfaction level
Assumptions of Logistic Regression
Logistic regression relies on several key assumptions to ensure accurate predictions and valid model
performance. These assumptions guide the correct application of logistic regression and, if unmet, can lead to
biased or unreliable results.
Here are the main assumptions of logistic regression:
1. Binary or Ordinal Outcome (for Binary Logistic Regression)
Binary Logistic Regression requires a binary (dichotomous) outcome variable (e.g., success/failure,
yes/no).
For multinomial or ordinal logistic regression, the outcome variable can have multiple categories but
must still be categorical.
2. Linearity of the Logit
Logistic regression assumes a linear relationship between the logit of the outcome (the log odds) and
each predictor variable.
This means that, while logistic regression does not require a linear relationship between predictors and the
probability of the outcome itself, the log-odds of the outcome must have a linear relationship with predictor
variables.
3. No Multicollinearity Among Predictors
Predictor variables should not be highly correlated with each other (multicollinearity), as this can
make the estimated coefficients unreliable and lead to high standard errors.
Techniques such as variance inflation factor (VIF) or correlation matrices can be used to check for
multicollinearity.
4. Independence of Observations
Each observation should be independent of all others, meaning there should be no clustering or
correlation between observations.
This assumption can be violated in cases where data is collected in groups, such as repeated measures on the
same individual. In these cases, other techniques like mixed models might be more appropriate
5. No Strong Outliers Influencing the Model
While logistic regression is less sensitive to outliers than linear regression, extreme outliers in the
predictor variables can still disproportionately impact the model.
Checking for outliers and, if needed, using techniques like robust regression or data transformation
can help minimize their impact.
6. Large Sample Size
Logistic regression works best with a large sample size because it estimates probabilities based on
observed data.
Having enough observations in each category of the outcome variable is important to ensure stable
and reliable estimates, especially for rare events.
7. No Perfect Separation
Logistic regression assumes that no predictor variable or combination of predictors perfectly predicts
the outcome.
Perfect separation occurs when one or more predictors can perfectly distinguish the categories of the
outcome variable, leading to infinite or undefined coefficients. Regularization methods or different
algorithms may be needed if perfect separation is present.
Logistic Regression Assumptions
Assumption Description
Binary or Ordinal Outcome Binary (or categorical for multinomial/ordinal logistic regression)
Linearity of the Logit Linear relationship between predictors and logit (log-odds of the outcome)
No Multicollinearity Predictors should not be highly correlated with each other
Independence of Observations Each observation should be independent of others
No Strong Outliers Outliers in predictors should not disproportionately influence the model
Large Sample Size Sufficient sample size, especially for categories in binary outcome
No Perfect Separation Predictors should not perfectly predict the outcome
Terminologies involved in Logistic Regression: Here are some common terms involved in logistic regression:
● Independent variables: The input characteristics or predictor factors applied to the dependent
variable’s predictions.
● Dependent variable: The target variable in a logistic regression model, which we are trying to predict.
● Logistic function (sigmoid): The formula used to represent how the independent and dependent
variables relate to one another. The logistic function transforms the input variables into a probability
value between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0.
● Odds: The odds ratio (OR) quantifies the change in odds of the outcome for a one-unit increase in a
predictor variable, holding all other variables constant. It’s derived as:
An OR greater than 1 implies a positive association with the outcome, while an OR less than 1 implies a
negative association
● Log-odds: The log-odds, also known as the log it function, is the natural logarithm of the odds. In
logistic regression, the log odds of the dependent variable are modeled as a linear combination of the
independent variables and the intercept.
● Coefficient: The logistic regression model’s estimated parameters show how the independent and
dependent variables relate to one another.
Coefficients β0,β1,…,βn represent the impact of each predictor variable on the log-odds of the
outcome.
β0 is the intercept term, the log-odds when all predictors are zero.
βi (for i=1,2,…,n) measures the change in the log-odds of the outcome for a one-unit increase in the
predictor variable xi.
● Intercept: A constant term in the logistic regression model, which represents the log odds when all
independent variables are equal to zero.
● Maximum likelihood estimation: The method used to estimate the coefficients of the logistic
regression model, which maximizes the likelihood of observing the data given the model.
● Confusion Matrix: It is a table used to evaluate the performance of a binary classifier by displaying
the counts of true positives, true negatives, false positives, and false negatives.
From this matrix, various performance metrics, like accuracy, precision, recall, and F1-score, can be
calculated.
● Multi co linearity: It occurs when predictor variables are highly correlated with each other, which can
make the model coefficients unstable and inflate the standard errors.
● Logistic regression assumes no high multicollinearity among predictors, and methods like Variance
Inflation Factor (VIF) can be used to detect it.
3.5.2Model Theory:
The logistic regression model theory provides a mathematical framework for predicting binary outcomes (e.g.,
yes/no, success/failure) based on one or more predictor variables. Unlike linear regression, which models
continuous outcomes, logistic regression focuses on modeling the probability of a binary response variable.
1. The Problem with Linear Regression for Binary Outcomes
In binary classification, the outcome variable Y can only take two values (0 or 1). A linear regression model
can’t be directly applied here because:
The predictions (using a linear model) could fall outside the [0,1] range, which doesn’t make sense for
probabilities.
Linear regression assumes a linear relationship between predictors and the outcome, which doesn’t fit
well for binary responses.
In linear regression, the independent variables (e.g., age and gender) are used to estimate the specific
value of the dependent variable (e.g., body weight).
In logistic regression, on the other hand, the dependent variable is dichotomous (0 or 1) and the probability
that expression 1 occurs is estimated.
Logistic Regression Equation:
•The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical
steps to get Logistic Regression equations are given below:
•Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid
function’ or also known as the ‘logistic function’ instead of a linear function.
•The hypothesis of logistic regression tends it to limit the cost function between 0 and
1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not
possible as per the hypothesis of logistic regression.
iv. Hypothesis Representation or Linear Combination of inputs: Logistic regression starts with a linear
combination of the input features.So the logistic function is perfect to describe the probability P(y=1).
•When using linear regression, we used a formula for the line equation as:
•In the above equation y is a response variable, x1, x2 ,...xn are the predictor variables, and b0 , b1, b2 ,...,
bn are the coefficients, which are numeric constants.
•For logistic regression, we need the maximum likelihood hypothesis
•Apply sigmoid function on y as
Example: To calculate the probability of a person being sick or not using the logistic regression for the example
above, the model parameters b1, b2, b3 and a must first be determined. Once these have been determined, the
equation for the example above is:
V. Classification Decision:
Once we have a probability, we can classify the observation based on a threshold (usually 0.5):
o If P≥0.5 predict the positive class (1).
o If P<0.5 predict the negative class (0).
This threshold can be adjusted depending on the specific application, especially when dealing with
imbalanced datasets or varying costs of false positives vs. false negatives.
vi. Maximum Likelihood Method:
Logistic regression uses Maximum Likelihood Estimation (MLE) to find the best coefficients β\betaβ that
maximize the likelihood of observing the data we have.
MLE adjusts the coefficients so the model’s predicted probabilities best match the actual outcomes in the
training data.
Example: Predicting Admission to a College
Suppose a college wants to predict whether an applicant will be admitted (Yes = 1) or not admitted (No = 0)
based on their entrance exam score. We can model this as a binary classification problem using logistic
regression.
Step 1: Setting Up the Logistic Model
1. Input Variable (Feature):
o We have a single feature: Exam Score (let’s denote it as X).
2. Outcome (Target Variable):
o The outcome Y is binary:
1 if the applicant is admitted,
0 if the applicant is not admitted.
3. Model Equation:
o In logistic regression, we estimate the probability P that the applicant is admitted (i.e., Y=1)
given their exam score X.
o Logistic regression uses a linear combination of the input, transformed by the sigmoid function
to keep the result between 0 and 1.
The logit (log-odds) equation is:
z=β0+β1⋅ X
where:
o β0 is the intercept (a constant),
o β1 is the coefficient for the exam score.
4. Sigmoid Function:
o We use the sigmoid function to convert the logit z into a probability:
o This gives an 88% probability that an applicant with an exam score of 80 will be admitted.
3. Classification Decision:
o If we set the threshold at 0.5, we would classify this applicant as likely to be admitted (since
0.88 > 0.5).
o If an applicant had a lower score, say 50, the probability would be lower, and the classification
might switch to "not admitted."
3.6. Model fit Statistics: In logistic regression, model fit statistics help evaluate how well the model explains
the relationship between independent variables and a binary (yes/no, 0/1) outcome. These statistics assess
the model's quality, accuracy, and predictive power, performance, focusing on its ability to classify outcomes
correctly and fit the data well. Here are the key fit statistics commonly used in logistic regression:
1. Log-Likelihood: The likelihood ratio test compares the fit of the logistic regression model to a baseline
model (usually the null model, which includes only the intercept).The deviance statistic is derived from the
likelihood of the data given the model. The null deviance (deviance of the null model) is compared to the
residual deviance (deviance of the fitted model). Logistic regression aims to maximize this value to find the
best parameters for classifying outcomes.
Likelihood Ratio Test Statistic formula:
● Null Deviance: Represents the deviance of a model with only the intercept (no predictors). It shows
how well the model fits with no predictors.( represents the fit of the baseline model)
● Residual Deviance: Represents the Deviance when predictors are included. Lower residual deviance
indicates a
A lower deviance indicates a better-fitting model. You can also use a chi-squared test to determine if the
difference in deviance is statistically significant.
Deviance is a measure of the goodness of fit for a logistic regression model, similar to the residual sum of
squares in linear regression. The result of the likelihood function is called deviance, which measures the
goodness-of-fit.
Deviance Difference
The difference between the null deviance and residual deviance can be used to assess whether the inclusion
of predictors improves the model.
Δ Deviance=Null Deviance−Residual Deviance (D= -2× (Log-Likelihood of full model−Log-
Likelihood of null model)
If the difference is large and statistically significant (based on a chi-square test), it indicates that the predictors
improve the model.
2. Akaike Information Criterion (AIC): AIC is a measure of the relative quality of a model, balancing goodness
of fit with model complexity (the number of parameters).
The model with the lowest AIC is preferred, as it suggests a good balance between fit and parsimony (i.e.,
fewer parameters).
Values range from 0 to 1, where higher values indicate better model fit. However, McFadden's R² values tend
to be lower than R^2 for linear regression, with values typically between 0.2 and 0.4 being considered
excellent.
4. Bayesian Information Criterion (BIC)
Similar to AIC, BIC is a model selection criterion that penalizes the complexity of the model, but with a
stronger penalty for additional parameters. As with AIC, the model with the lowest BIC is preferred.
Formula:
Where n is the sample size, kis the number of parameters, and L is the likelihood.
A lower BIC is better, and the model with the smallest BIC is considered to have the best trade-off between fit
and complexity.
5.Hosmer-Lemeshow Test: This is a statistical test used to assess the goodness of fit of a logistic regression
model, based on the comparison of observed vs. predicted probabilities.
The data is grouped into deciles based on predicted probabilities, and a chi-squared test is performed to
compare the observed and expected frequencies within each group.
A significant p-value (typically p<0.05) indicates a poor fit, while a non-significant p-value suggests that the
model fits the data well.
6. Area under Receiver Operating Characteristic (ROC) Curve: The AUC-ROC curve is a graphical plot that
shows the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
The area under the curve (AUC) measures the ability of the model to discriminate between the positive and
negative classes. AUC ranges from 0 to 1, with higher values indicating better discrimination.
True Positive Rate (TPR) or Sensitivity: The y-axis on the ROC curve.
False Positive Rate (FPR): The x-axis on the ROC curve.
● Area Under the ROC Curve (AUC-ROC): A single value that summarizes the ROC curve’s performance:
● AUC = 1: Perfect model.
● AUC = 0.5: Model with no discriminative power (similar to random guessing).
● Higher AUC values indicate better model performance
7. Confusion Matrix / Error matrix/Contingency Table:
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N
is the number of target classes. The matrix compares the actual target values with those predicted by the
machine learning model. This gives us a holistic view of how well our classification model is performing
and what kinds of errors it is making. It is a specific table layout that allows visualization of the performance
of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching
matrix).
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:
Let’s decipher the matrix:
The target variable has two values: Positive or Negative
The columns represent the actual values of the target variable
The rows represent the predicted values of the target variable
True Positive
True Negative
False Positive – Type 1 Error
False Negative – Type 2 Error
Why we need a Confusion matrix?
Precision vs Recall
F1-score
Understanding True Positive, True Negative, False Positive and False Negativ e in a Confusion Matrix
True Positive (TP)
The predicted value matches the actual value
The actual value was positive and the model predicted a positive value
True Negative (TN)
The predicted value matches the actual value
The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
The predicted value was falsely predicted
The actual value was negative but the model predicted a positive value
Also known as the Type 1 error
False Negative (FN) – Type 2 error
The predicted value was falsely predicted
The actual value was positive but the model predicted a negative value
Also known as the Type 2 error
To evaluate the performance of a model, we have the performance metrics called,
Accuracy, Precision, Recall & F1-Score metrics
Accuracy: The proportion of correctly predicted cases or instances. Accuracy is the most intuitive performance
measure and it is simply a ratio of correctly predicted observation to the total observations.
Accuracy is a great measure to understand that the model is Best.
Accuracy is dependable only when you have symmetric datasets where values of false positive and false
negatives are almost same.
Precision: (Positive Predictive Value): Proportion of correctly predicted positives out of all predicted
positives.
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
It tells us how many of the correctly predicted cases actually turned out to be positive.
Precision is a useful metric in cases where False Positive is a higher concern than False Negative. Precision is
important in music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to
customer churns and be harmful to the business.
Recall (Sensitivity OR True Positive Rate): The proportion of actual positives that are correctly identified or
predicted. Recall is the ratio of correctly predicted positive observations to the all observations in actual
class.Recall is a useful metric in cases where False Negative trumps False Positive.Recall is important in medical
cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go
undetected!
Specificity (True Negative Rate): The proportion of actual negatives that are correctly identified.
F1-score: It is a harmonic mean of Precision and Recall. It gives a combined idea about these two metrics. It
is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account. The harmonic mean of
precision and recall, useful when the classes are imbalanced.
F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
Accuracy works best if false positives and false negatives have similar cost.
If the cost of false positives and false negatives are very different, it’s better to look at both Precision and
Recall.
But there is a catch here. If the interpretability of the F1-score is poor, means that we don’t know what our
classifier is maximizing – precision or recall? So, we use it in combination with other evaluation metrics which
gives us a complete picture of the result.
Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on it and get the below
confusion matrix:
Precision:
It tells us how many of the correctly predicted cases actually turned out to be positive.
We can easily calculate Precision and Recall for our model by plugging in the values into the above questions:
F1-Score
AUC (Area Under Curve) ROC (Receiver Operating Characteristics) Curves: Performance measurement is an
essential task in Data Modelling Evaluation. It is one of the most important evaluation metrics for checking any
classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating
Characteristics) So when it comes to a classification problem, we can count on an AUC - ROC Curve.
When we need to check or visualize the performance of the multi-class classification problem,
we use the AUC (Area Under The Curve)
ROC (Receiver Operating Characteristics) curve.
What is the AUC - ROC Curve?
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings.
ROC is a probability curve and AUC represents the degree or measure of separability.
It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the
model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is
at distinguishing between patients with the disease and no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.
Specificity
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification
model at all classification thresholds. This curve plots two parameters:
True Positive Rate and False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows: TPR=TPTP+FN
3.7 Model construction: In logistic regression involves several key steps that help build a statistical model for
predicting binary outcomes based on a set of independent variables. Logistic regression uses a linear
combination of the predictors but applies a non-linear transformation (the sigmoid function) to ensure the
predicted values lie between 0 and 1, representing probabilities. Here's how the model construction works:
1. Define the Problem
i. Define the Outcome: The first step is to identify the problem, which is typically a binary classification task,
where the target variable has two possible outcomes (e.g., 0 or 1, "yes" or "no", "true" or "false").
For example, in a medical context, YYY could be whether a patient has a disease (1) or does not (0)
ii. Select Independent Variables (Identify predictor variables):
Identify the independent variables (predictors) that influence the dependent variable. These can be
continuous, categorical, or binary variables.Ex: you’ll use to make predictions, such as age, income, exam
score, etc.
2. Data Collection and Preprocessing
Gather data: Collect a dataset that includes the binary outcome and predictor variables.
Handle missing values: Decide how to address any missing values, which could involve imputation,
deleting rows, or replacing them with mean/median values.
Encode categorical variables: Convert categorical variables into numerical form, often by using one-
hot encoding or label encoding.
Scale numeric variables: Standardizing or normalizing predictors can improve model performance and
interpretability.
Remove or transform outliers: Depending on the predictors, you may need to handle extreme values
that could distort the model.
3. Define the Logistic Regression Model:
Logistic regression predicts the probability that an instance belongs to the positive class (e.g., 1), calculated by:
Where:
Y: Binary outcome (0 or 1).
Xi: Predictor variables.
β0: Intercept term.
β1,β2,…,βn : Coefficients (parameters) associated with each predictor.
The model’s goal is to find the best values for the parameters β that maximize the likelihood of observing the
actual data.
4. Train the Model (Estimate Parameters)
Maximum Likelihood Estimation (MLE) is typically used to estimate the coefficients. MLE finds the
values for β\betaβ that maximize the likelihood of the observed outcomes in the training data.
In practice, optimization algorithms (such as gradient descent or a variant of it) are used to maximize
the likelihood.
5. Model Evaluation and Validation
After fitting the model to the data, assess its performance using techniques such as:
Confusion Matrix: Shows true positives, true negatives, false positives, and false negatives, allowing
calculation of accuracy, precision, recall, etc.
ROC Curve and AUC: Assesses the model’s ability to discriminate between classes at various
thresholds.
Log-Loss: Measures how close the predicted probabilities are to the actual outcomes.
Cross-Validation: Split data into training and validation sets multiple times to ensure that the model
generalizes well to unseen data.
4. Train the Model: Fit the model to estimate β0, β1 and β2using training data.
5. Evaluate and Validate: Evaluate performance with AUC, accuracy, or log-loss, and validate using cross-
validation.
6. Interpret Coefficients: Suppose eβ1=1.2 for Age, meaning each additional year of age increases the
odds of purchase by 20%.
7. Predict: For a new individual with specific age and income, calculate the probability of purchasing and
classify based on a chosen threshold.