Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

DA Unit-3

Uploaded by

Nikhitha Nicky
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

DA Unit-3

Uploaded by

Nikhitha Nicky
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT-3

Regression – Concepts:
Introduction:
The term regression is used to indicate the estimation or prediction of the average value of one
variable for a specified value of another variable.
Regression analysis is a very widely used statistical tool to establish a relationship model between
two variables.
“Regression Analysis is a statistical process for estimating the relationships between the Dependent
Variables /Criterion Variables / Response Variables & One or More Independent variables / Predictor
variables.
Regression describes how an independent variable is numerically related to the dependent variable.
Regression can be used for prediction, estimation and hypothesis testing, and modelling causal
relationships.
When Regression is chosen?
A regression problem is when the output variable is a real or continuous value, such as “salary” or
“weight”.
Many different models can be used, the simplest is linear regression. It tries to fit data with the best
hyperplane which goes through the points.
Mathematically a linear relationship represents a straight line when plotted as a graph.
A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
Types of Regression Analysis Techniques:
1. Linear Regression
2. Logistic Regression
3. Ridge Regression
4. Lasso Regression
5. Polynomial Regression
6. Bayesian Linear Regression

1. Linear Regression
Linear regression is used for predictive analysis. Linear regression is a linear approach for modelling
the relationship between the criterion or the scalar response and the multiple predictors or
explanatory variables. Linear regression focuses on the conditional probability distribution of the
response given the values of the predictors. For linear regression, there is a danger of over fitting.
The formula for linear regression is:
Syntax:
y = θx + b
where,
θ – It is the model weights or parameters
b – It is known as the bias.
This is the most basic form of regression analysis and is used to model a linear relationship between
a single dependent variable and one or more independent variables.
2. Logistic Regression
Logistic regression is a supervised machine learning algorithm used for classification tasks where the
goal is to predict the probability that an instance belongs to a given class or not. Logistic regression is
a statistical algorithm which analyse the relationship between two data factors. Logistic regression
predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1. In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).
3. Ridge Regression
Ridge regression is a technique for analysing multiple regression data. When multicollinearity occurs,
least squares estimates are unbiased. This is a regularized linear regression model; it tries to reduce
the model complexity by adding a penalty term to the cost function. A degree of bias is added to the
regression estimates, and as a result, ridge regression reduces the standard errors.
Here is the code for simple demonstration of the Ridge regression approach.

4. Lasso Regression
Lasso regression is a regression analysis method that performs both variable selection
and regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of
the provided covariates for use in the final model.
This is another regularized linear regression model, it works by adding a penalty term to the cost
function, but it tends to zero out some features’ coefficients, which makes it useful for feature
selection.
5. Polynomial Regression
This is an extension of linear regression and is used to model a non-linear relationship between the
dependent variable and independent variables. Here as well syntax remains the same but now in the
input variables, we include some polynomial or higher degree terms of some already existing
features as well. Linear regression was only able to fit a linear model to the data at hand but
with polynomial features, we can easily fit some non-linear relationship between the target as well
as input features.
6. Bayesian Linear Regression
As the name suggests this algorithm is purely based on Bayes Theorem. Because of this reason only
we do not use the Least Square method to determine the coefficients of the regression model. So,
the technique which is used here to find the model weights and parameters relies on features
posterior distribution and this provides an extra stability factor to the regression model which is
based on this technique.
Advantages & Limitations of Regression:
 Fast and easy to model and is particularly useful when the relationship to be modelled is not
extremely complex and if you don’t have a lot of data.
 Very intuitive to understand and interpret.
 Linear Regression is very sensitive to outliers.

ROC (Receiver Operating Characteristic)


ROC Analysis
ROC (Receiver Operating Characteristic) analysis is a statistical method used to evaluate the
performance of a binary classification model. It's a widely used technique in machine learning, data
science, and medical diagnosis.
What does ROC analysis do?
ROC analysis assesses the ability of a classification model to distinguish between two classes,
typically positive (e.g., disease present) and negative (e.g., disease absent). The analysis provides a
graphical representation of the model's performance, which helps to:
1. Evaluate the model's accuracy
2. Compare the performance of different models
3. Identify the optimal threshold for classification
ROC Curve
A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the
performance of a binary classifier model. While ROC curves are typically used for binary
classification, they can also be used for classification problems with more than two categories. This
means that the ROC curve shows the relationship between the true positive rate (TPR) and the false
positive rate (FPR) at different threshold settings. Here the TPR refers to how often the model
correctly identifies positive cases and FPR refers to how often the model incorrectly identifies
negative cases. ROC curve is calculated from a sample of the population rather than the entire
population, it provides an estimate of the model's performance. ROC curve shows how the model's
sensitivity (its ability to detect true positives) changes as the false positive rate changes. When the
probability distributions of true positives and false positives are known, the ROC curve can be
calculated using the cumulative distribution functions of these probabilities. ROC analysis provides a
way to compare and select models based on their performance, without considering the costs or
class distribution. ROC analysis provides a way to compare and select models based on their
performance, without considering the costs or class distribution. ROC analysis is closely related to
the analysis of costs and benefits in decision-making, particularly in diagnostic applications.

BLUE Property Assumptions


BLUE Property
The BLUE property stands for Best Linear Unbiased Estimator. It refers to an estimator that fulfils
three key conditions:
1. Unbiasedness: The estimator should be unbiased, meaning its expected value is equal to the
true value of the parameter being estimated. In other words, the difference between the
expected value of the estimator and the true value is zero (i.e., no bias).
2. Least Variance: The estimator should have the smallest variance among all unbiased
estimators. This means that the estimator’s values are closely clustered around the true
parameter value, implying greater precision in the estimation.
3. Linearity: The estimator should be linear in nature, meaning it can be expressed as a linear
function of the observed data.
BLUE Properties of OLS Method:
In the context of Ordinary Least Squares (OLS) regression, the BLUE properties are defined as
follows:
1. Unbiased: The OLS estimators are unbiased. This means that the expected value of the
estimated coefficients (e.g., βo and β1) equals the true values of the parameters (β0 and β1)
in the population.
2. Least Variance: Among all linear estimators that are unbiased, OLS estimators have the
smallest variance. This property ensures that the OLS estimates are the most precise among
the unbiased estimators.
3. Linearity: OLS estimators are linear functions of the observed data. In a simple linear
regression model y = βo + β₁x + μ, the estimators for βo and β1 are linear combinations of
the observed data points (specifically, the independent variable x).
Thus, the BLUE property guarantees that OLS estimators are the best (in terms of smallest variance)
among all unbiased, linear estimators. This makes OLS an optimal method when the classical
assumptions hold (such as no autocorrelation, homoscedasticity, and random sampling).

BLUE Property Assumptions:


I. Random Sampling: Data must be randomly sampled to ensure unbiased OLS estimators.
II. Zero Conditional Mean: The error term (μ) should have an expected value of zero given the
independent variable (x): E(μ∣X) =0, ensuring no systematic correlation between errors and
predictors.
III. Homoscedasticity: The error term should have constant variance across all levels of xxx,
ensuring efficient estimates.
IV. No Autocorrelation: Error terms should not be correlated with one another; this is
important, especially in time series data.
V. Normality of Errors: Errors should ideally follow a normal distribution (helpful for inference,
but not required for BLUE).
VI. No Perfect Multicollinearity: Independent variables should not be perfectly correlated to
ensure stable coefficient estimation.
VII. Linearity: The relationship between y and x should be linear for accurate estimation.

(BLUE Property Assumptions in short)


To summarize, for OLS to produce the Best Linear Unbiased Estimators (BLUE), the following
assumptions must hold:
1. Random sampling of observations.
2. The conditional mean of errors is zero (E(μ∣X)=0)
3. Homoscedasticity (constant variance of errors).
4. No autocorrelation of errors.
5. Normal distribution of errors (optional, but helpful for inference).
6. No perfect multicollinearity.
7. Linearity of the relationship between dependent and independent variables.
8.
Ordinary Least Squares (OLS) Estimation:
Ordinary Least Squares (OLS) is a method used for estimating the parameters of a linear regression
model. In a simple linear regression model, the goal is to find the best-fitting line through the data
points that minimizes the sum of squared residuals (the differences between observed and predicted
values).
Steps of OLS Estimation:
1. Model Setup: In simple linear regression, the model is represented as:
Y = β0 + β1x + μ
Where:
 y = Dependent variable
 x = Independent variable
 β0 = Intercept (constant)
 β1 = Slope (coefficient)
 μ = Error term
The goal of OLS is to estimate the values of β0 and β1 that best fit the data.
2. Objective of OLS: OLS aims to find the parameter estimates β0 and β1 that minimize the sum of
squared residuals (errors):
3. Deriving the OLS Estimators

4. Key Properties of OLS Estimators:


 Unbiasedness: OLS estimators are unbiased, meaning that the expected value of the
estimators equals the true population values: E(βo) =βo, E(β1) =β1
 This ensures that the OLS method produces correct estimates on average.
 Least Variance (Efficiency): OLS estimators have the smallest variance among all linear
estimators, making them efficient. This is true when the assumptions of the classical linear
regression model are satisfied.
 Best Linear Unbiased Estimator (BLUE): OLS is considered the Best Linear Unbiased
Estimator because it is both unbiased and has the smallest variance among all unbiased
linear estimators, given the classical assumptions hold.
5. Assumptions for OLS to be BLUE:
For OLS estimators to have the Best Linear Unbiased Estimator (BLUE) properties, the following
assumptions must hold:
i. Random Sampling: Observations must be randomly sampled from the population to avoid
bias.
ii. Zero Conditional Mean: The expected value of the error term, given the independent
variable, must be zero:
E(μ∣X) = 0
This ensures no systematic relationship between the errors and the predictors.
iii. Homoscedasticity: The variance of the error term μ should be constant for all values of x:
Var(μ) = σ2
iv. No Autocorrelation: The error terms should not be correlated with each other, particularly
important in time series data.
v. Normality of Errors (optional): For hypothesis testing and confidence intervals, the error
terms are ideally normally distributed, though this is not necessary for OLS to be BLUE.
vi. No Perfect Multicollinearity: There should be no perfect linear relationship among the
independent variables.
vii. Linearity: The relationship between the dependent variable and independent variables must
be linear.

6. Minimum Mean Squared Error (MSE):


OLS minimizes the Mean Squared Error (MSE), which is the average of the squared differences
between the observed values and the predicted values:

The smaller the MSE, the better the model’s fit to the data.
Homoscedasticity vs. Heteroscedasticity
Parameters Homoscedasticity Heteroscedasticity
Definition: Homoscedasticity refers to the Heteroscedasticity occurs when
situation in which the variance of the variance of the error term
the error term is constant across all differs across the values of the
levels of the independent independent variable(s). This
variable(s). In other words, the means that the error terms exhibit
"spread" of the residuals (errors) is non-constant variance at different
the same for all values of the levels of the independent
independent variable(s). variable(s).
Error The error term should exhibit The error variance changes as the
uniform variance across all values value of the independent variable
of the independent variable. changes.
Residuals The residuals (errors) are evenly Residuals can become more
scattered around the regression spread out (or tighter) as the
line. independent variable increases,
leading to inconsistent variance.
Key Point This is a crucial assumption for This violates the assumption of
Ordinary Least Squares (OLS) homoscedasticity and can lead to
regression to provide the best inefficiency in parameter
(BLUE) estimators. estimation.
Visual Representation: A scatter plot of residuals should A scatter plot of residuals will
show a consistent spread or show a non-uniform spread of
"scatter" of points around the points. The spread could increase
horizontal axis (residual = 0 line) at or decrease as the independent
all levels of the independent variable(s) increase (e.g., fanning
variable(s). out or narrowing down).

Prediction errors Prediction errors are consistently Prediction errors increase as years
spread across all levels of of education rise.
education.

Variable Rationalization:
 The data set may have a large number of attributes. But some of those attributes can be irrelevant
or redundant. The goal of Variable Rationalization is to improve the Data Processing in an optimal
way through attribute subset selection.
 This process is to find a minimum set of attributes such that dropping of those irrelevant attributes
does not much affect the utility of data and the cost of data analysis could be reduced.
 Mining on a reduced data set also makes the discovered pattern easier to understand.
As part of Data processing, we use the below methods of Attribute subset selection
1. Stepwise Forward Selection
2. Stepwise Backward Elimination
3. Combination of Forward Selection and Backward Elimination
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.
1. Stepwise Forward Selection:
This procedure starts with an empty set of attributes as the minimal set. The most relevant
attributes are chosen (having minimum p-value) and are added to the minimal set. In each
iteration, one attribute is added to a reduced set.
2. Stepwise Backward Elimination:
Here all the attributes are considered in the initial set of attributes. In each iteration, one
attribute is eliminated from the set of attributes whose p-value is higher than significance
level.
3. Combination of Forward Selection and Backward Elimination:
The stepwise forward selection and backward elimination are combined so as to select the
relevant attributes most efficiently. This is the most common technique which is generally
used for attribute selection.
4. Decision Tree Induction:
This approach uses decision tree for attribute selection. It constructs a flow chart like
structure having nodes denoting a test on an attribute. Each branch corresponds to the
outcome of test and leaf nodes is a class prediction. The attribute that is not the part of tree
is considered irrelevant and hence discarded.

Model Building
Model Building Life Cycle in Data Analytics: When we come across a business analytical problem,
without acknowledging the hurdles (challenges or obstacles), we proceed towards the execution.
Before realizing the misfortunes, we try to implement and predict the
outcomes. The problem-solving steps involved in the data science model-
building life cycle. Let’s understand every model building step in-depth.
The data science model-building life cycle includes some important steps
to follow. The following are the steps to follow to build a Data Model
1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment
1.Problem Definition:
 The first step in constructing a model is to understand the industrial problem in a more
comprehensive way. To identify the purpose of the problem and the prediction target, we must
define the project objectives appropriately.
 Therefore, to proceed with an analytical approach, we have to recognize the obstacles first.
Remember, excellent results always depend on a better understanding of the problem.
2. Hypothesis Generation
Hypothesis generation is the guessing approach through which we derive some essential data
parameters that have a significant correlation with the prediction target.
 Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders into
account. We search for every suitable factor that can influence the outcome.
 Hypothesis generation focuses on what you can create rather than what is available in the dataset.
3. Data Collection
 Data collection is gathering data from relevant sources regarding the analytical problem, then we
extract meaningful insights from the data for prediction.
The data gathered must have:
 Proficiency in answer hypothesis questions.
 Capacity to elaborate on every data parameter.
 Effectiveness to justify your research.
 Competency to predict outcomes accurately.
4.Data Exploration/Transformation
 The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary features,
null values, unanticipated small values, or immense values. So, before applying any algorithmic
model to data, we have to explore it first.
 By inspecting the data, we get to understand the explicit and hidden trends in data. We find the
relation between data features and the target variable.
 Usually, a data scientist invests his 60–70% of project time dealing with data exploration only.
 There are several sub steps involved in data exploration: o
i. Feature Identification:
You need to analyse which data features are available and which ones are not. Identify
independent and target variables. Identify data types and categories of these variables.
ii. Univariate Analysis:
- Examine each variable individually.
- For continuous variables: Analyse mean, median, standard deviation, and skewness.
- For categorical variables: Use frequency tables to understand data distribution and measure
counts and frequency.
iii. Multi-variate Analysis:
The bi-variate analysis helps to discover the relation between two or more variables.
We can find the correlation in case of continuous variables and the case of categorical,
iv. Filling Null Values:
- Replace null values in continuous variables with mean or mode.
- Replace null values in categorical variables with the most frequent value.
- Avoid deleting rows with null values to preserve information.
5. Predictive Modelling:
 Predictive modelling is a mathematical approach to create a statistical model to forecast future
behaviour based on input test data. Steps involved in predictive modelling:
Algorithm Selection:
- For structured data and predicting continuous/categorical outcomes, use supervised learning
(regression, classification).
- For unstructured data and clustering, use unsupervised learning.
- Apply multiple algorithms to achieve a more accurate model.
Train Model:
After assigning the algorithm and getting the data handy, we train our model using the input data
applying the preferred algorithm. It is an action to determine the correspondence between
independent variables, and the prediction targets.
Model Prediction:
- Use the trained model to make predictions on input test data.
- Evaluate accuracy using cross-validation or ROC curve analysis.
6. Model Deployment:
- Deploying a model in a real-time environment provides valuable analytical insights.
- Continuously update the model with new features to improve customer satisfaction.
- Integrate the model into production to inform business decisions, market strategies, and
personalized customer experiences.
- A well-deployed model can increase customer engagement and drive sales, as seen in personalized
product recommendations on websites like Amazon.
Model Theory, Model fit Statistics, Model Construction
Introduction:
Logistic regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
 The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
 The curve from the logistic function indicates the likelihood of something such as whether or not
the cells are abnormal or not, a mouse is obese or not based on its weight, etc.
 Logistic regression uses the concept of predictive modelling as regression; therefore, it is called
logistic regression, but is used to classify samples; therefore, it falls under the classification
algorithm.
 In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Types of Logistic Regressions
On the basis of the categories, Logistic Regression can be classified into three types:
 Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
 Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as "cat", "dogs", or "sheep"
 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
Multi-collinearity:
 Multicollinearity is a statistical phenomenon in which multiple independent variables show high
correlation between each other and they are too inter-related.
 Multicollinearity also called as Collinearity and it is an undesired situation for any statistical
regression model since it diminishes the reliability of the model itself.
Assumptions for Logistic Regression:
 The dependent variable must be categorical in nature.
 The independent variable should not have multi-collinearity.
Logistic Regression Equation:
 The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
 Logistic Regression uses a more complex cost function, this cost function can be defined as the
‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.
 The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore
linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not
possible as per the hypothesis of logistic regression.
0  h (x) 1 --- Logistic Regression Hypothesis Expectation
Logistic Function (Sigmoid Function):
 The sigmoid function is a mathematical function used to map the predicted values to probabilities.
 The sigmoid function maps any real value into another value within a range of 0 and 1, and so
forma S-Form curve.
 The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so
it forms a curve like the "S" form.
 The below image is showing the logistic function:

The Sigmoid function can be interpreted as a probability indicating to a Class-1 or Class0. So the
Regression model makes the following predictions as

z  sigmoid (y)   (y) 

Hypothesis representation:

Confusion Matrix (or) Error Matrix (or) Contingency Table:


What is a Confusion Matrix?
“A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those
predicted by the machine learning model. This gives us a holistic view of how well our classification
model is performing and what kinds of errors it is making. It is a specific table layout that allows
visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised
learning it is usually called a matching matrix).” For a binary classification problem, we would have a
2 x 2 matrix as shown below with 4 values:

Let’s break the matrix:


 The target variable has two values: Positive or Negative
 The columns represent the actual values of the target variable
 The rows represent the predicted values of the target variable :
i. True Positive
ii. True Negative
iii. False Positive – Type 1 Error
iv. False Negative – Type 2 Error
Why we need a Confusion matrix?
 Precision vs Recall
 F1-score
Accuracy, Precision, Recall & F1-Score metrics Accuracy:
Accuracy
Accuracy is the most straightforward performance measure and it is simply a ratio of correctly
predicted observations to the total observations.
 Accuracy is a great measure to understand that the model is Best.
 Accuracy is dependable only when you have symmetric datasets where values of false positive and
false negatives are almost same.

Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations.
It tells us how many of the correctly predicted cases actually turned out to be positive.

 Precision is a useful metric in cases where False Positive is a higher concern than False
 Precision is important in music or video recommendation systems, e-commerce websites, etc.
Wrong results could lead to customer churn and be harmful to the business.

Recall: (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all observations in actual class.

 Recall is a useful metric in cases where False Negative is more significant than the False Positive.
 Recall is crucial because it measures the model's ability to detect actual positive cases (e.g.,
patients with a disease).
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea about these two
metrics. It is maximum when Precision is equal to Recall. Therefore, this score takes both false
positives and false negatives into account.

 F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
 Accuracy works best if false positives and false negatives have similar cost.
 If the cost of false positives and false negatives are very different, it’s better to look at both
Precision and Recall.
 But there is a catch here. If the interpretability of the F1-score is poor, means that we don’t know
what our classifier is maximizing – precision or recall? So, we use it in combination with other
evaluation metrics which gives us a complete picture of the result.
AUC (Area Under Curve) ROC (Receiver Operating Characteristics)
Curves:
Performance measurement is an essential task in Data Modelling Evaluation. It is one of the most
important evaluation metrics for checking any classification model’s performance. It is also written as
AUROC (Area Under the Receiver Operating Characteristics) So when it comes to a classification
problem, we can count on an AUC - ROC Curve. When we need to check or visualize the performance
of the multi-class classification problem, we use the AUC (Area Under the Curve) ROC (Receiver
Operating Characteristics) curve. What is the AUC - ROC Curve? AUC - ROC curve is a performance
measurement for the classification problems at various threshold settings. ROC is a probability curve
and AUC represents the degree or measure of separability. It tells how much the model is capable of
distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0
and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between
patients with the disease and no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

ROC curve An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve plots two
parameters:  True Positive Rate  False Positive Rate  True Positive Rate (TPR) is a synonym for recall
and is therefore defined as follows:

False Positive Rate (FPR) is defined as follows:


An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.

Application Of Modelling in Business:


 Applications of Data Modelling can be termed as Business analytics.
 Business analytics involves the collating, sorting, processing, and studying of
business-related data using statistical models and iterative methodologies. The goal
of BA is to narrow down which datasets are useful and which can increase revenue,
productivity, and efficiency.
 Business analytics (BA) is the combination of skills, technologies, and practices used
to examine an organization's data and performance as a way to gain insights and
make data-driven decisions in the future using statistical analysis.

The following applications are the most common.


1. Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and they are an ideal
way of gathering information about a purchaser’s spending habits, financial situation,
behaviour trends, demographics, and lifestyle preferences.
2. Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain customer loyalty
to stay in business for the long distance. CRM systems analyse important performance indicators
such as demographics, buying patterns, socio-economic information, and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to extract insights that
help organizations guide their way through tricky terrain.
4. Human Resources
Business analysts help Human Resources by analysing data on high-performing candidates.
This data includes educational background, attrition rate, and average employment length.
It helps HR forecast the best candidate-company fits.
5. Manufacturing:
Business analysts optimize operations, reduce costs, and improve efficiency by analysing data on
equipment downtime, inventory levels, and maintenance costs.
6. Marketing:
Business analysts analyse data to inform marketing strategies, optimize campaigns, and improve
customer engagement by examining marketing metrics, consumer behaviour, and market trends.
What is the Hosmer-Lemeshow Test?
The Hosmer-Lemeshow test (HL test) is a goodness of fit test for logistic regression, especially for risk
prediction models. A goodness of fit test tells you how well your data fits the model. Specifically, the
HL test calculates if the observed event rates match the expected event rates in population
subgroups.
The test is only used for binary response variables (a variable with two outcomes like alive or dead,
yes or no).
Running The Test
Data is first regrouped by ordering the predicted probabilities and forming the number of groups, g.
The Hosmer-Lemeshow test statistic is calculated with the following formula (which is for the 10-
group case—modify for your specific number of groups):

Where:
 Χ2 = chi squared.
 nj = number of observations in the jth group.
 Oj = number of observed cases in the jth group.
 Oj = number of expected cases in the jth group.
 Σ = summation notation. For the above formula, we’re summing from 1 to 10. Modify the
summation for your number of groups.
This test is usually run using technology. The output returns a chi-square value (a Hosmer-Lemeshow
chi-squared) and a p-value (e.g. Pr > ChiSq). Small p-values mean that the model is a poor fit.
Like most goodness of fit tests, these small p-values (usually under 5%) mean that your model
is not a good fit. But large p-values don’t necessarily mean that your model is a good fit, just that
there isn’t enough evidence to say it’s a poor fit. Many situations can cause large p-values, including
poor test power. Low power is one of the reasons this test has been highly criticized.

The HL (Hosmer-Lemeshow) test has several issues:


1. Overfitting: It doesn't account for overfitting, which can lead to inaccurate results.
2. Low power: The test often lacks the statistical power to detect significant differences.
3. Subgroup selection: There's little guidance on choosing the number of subgroups (g), which can be
arbitrary.
4. Sensitivity to g: Small changes in g can drastically alter p-values, making results unstable.
5. Arbitrary bin choices: The test requires binning data, but there's no clear guideline on how to do
this.

You might also like