DA Unit-3
DA Unit-3
Regression – Concepts:
Introduction:
The term regression is used to indicate the estimation or prediction of the average value of one
variable for a specified value of another variable.
Regression analysis is a very widely used statistical tool to establish a relationship model between
two variables.
“Regression Analysis is a statistical process for estimating the relationships between the Dependent
Variables /Criterion Variables / Response Variables & One or More Independent variables / Predictor
variables.
Regression describes how an independent variable is numerically related to the dependent variable.
Regression can be used for prediction, estimation and hypothesis testing, and modelling causal
relationships.
When Regression is chosen?
A regression problem is when the output variable is a real or continuous value, such as “salary” or
“weight”.
Many different models can be used, the simplest is linear regression. It tries to fit data with the best
hyperplane which goes through the points.
Mathematically a linear relationship represents a straight line when plotted as a graph.
A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
Types of Regression Analysis Techniques:
1. Linear Regression
2. Logistic Regression
3. Ridge Regression
4. Lasso Regression
5. Polynomial Regression
6. Bayesian Linear Regression
1. Linear Regression
Linear regression is used for predictive analysis. Linear regression is a linear approach for modelling
the relationship between the criterion or the scalar response and the multiple predictors or
explanatory variables. Linear regression focuses on the conditional probability distribution of the
response given the values of the predictors. For linear regression, there is a danger of over fitting.
The formula for linear regression is:
Syntax:
y = θx + b
where,
θ – It is the model weights or parameters
b – It is known as the bias.
This is the most basic form of regression analysis and is used to model a linear relationship between
a single dependent variable and one or more independent variables.
2. Logistic Regression
Logistic regression is a supervised machine learning algorithm used for classification tasks where the
goal is to predict the probability that an instance belongs to a given class or not. Logistic regression is
a statistical algorithm which analyse the relationship between two data factors. Logistic regression
predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1. In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).
3. Ridge Regression
Ridge regression is a technique for analysing multiple regression data. When multicollinearity occurs,
least squares estimates are unbiased. This is a regularized linear regression model; it tries to reduce
the model complexity by adding a penalty term to the cost function. A degree of bias is added to the
regression estimates, and as a result, ridge regression reduces the standard errors.
Here is the code for simple demonstration of the Ridge regression approach.
4. Lasso Regression
Lasso regression is a regression analysis method that performs both variable selection
and regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of
the provided covariates for use in the final model.
This is another regularized linear regression model, it works by adding a penalty term to the cost
function, but it tends to zero out some features’ coefficients, which makes it useful for feature
selection.
5. Polynomial Regression
This is an extension of linear regression and is used to model a non-linear relationship between the
dependent variable and independent variables. Here as well syntax remains the same but now in the
input variables, we include some polynomial or higher degree terms of some already existing
features as well. Linear regression was only able to fit a linear model to the data at hand but
with polynomial features, we can easily fit some non-linear relationship between the target as well
as input features.
6. Bayesian Linear Regression
As the name suggests this algorithm is purely based on Bayes Theorem. Because of this reason only
we do not use the Least Square method to determine the coefficients of the regression model. So,
the technique which is used here to find the model weights and parameters relies on features
posterior distribution and this provides an extra stability factor to the regression model which is
based on this technique.
Advantages & Limitations of Regression:
Fast and easy to model and is particularly useful when the relationship to be modelled is not
extremely complex and if you don’t have a lot of data.
Very intuitive to understand and interpret.
Linear Regression is very sensitive to outliers.
The smaller the MSE, the better the model’s fit to the data.
Homoscedasticity vs. Heteroscedasticity
Parameters Homoscedasticity Heteroscedasticity
Definition: Homoscedasticity refers to the Heteroscedasticity occurs when
situation in which the variance of the variance of the error term
the error term is constant across all differs across the values of the
levels of the independent independent variable(s). This
variable(s). In other words, the means that the error terms exhibit
"spread" of the residuals (errors) is non-constant variance at different
the same for all values of the levels of the independent
independent variable(s). variable(s).
Error The error term should exhibit The error variance changes as the
uniform variance across all values value of the independent variable
of the independent variable. changes.
Residuals The residuals (errors) are evenly Residuals can become more
scattered around the regression spread out (or tighter) as the
line. independent variable increases,
leading to inconsistent variance.
Key Point This is a crucial assumption for This violates the assumption of
Ordinary Least Squares (OLS) homoscedasticity and can lead to
regression to provide the best inefficiency in parameter
(BLUE) estimators. estimation.
Visual Representation: A scatter plot of residuals should A scatter plot of residuals will
show a consistent spread or show a non-uniform spread of
"scatter" of points around the points. The spread could increase
horizontal axis (residual = 0 line) at or decrease as the independent
all levels of the independent variable(s) increase (e.g., fanning
variable(s). out or narrowing down).
Prediction errors Prediction errors are consistently Prediction errors increase as years
spread across all levels of of education rise.
education.
Variable Rationalization:
The data set may have a large number of attributes. But some of those attributes can be irrelevant
or redundant. The goal of Variable Rationalization is to improve the Data Processing in an optimal
way through attribute subset selection.
This process is to find a minimum set of attributes such that dropping of those irrelevant attributes
does not much affect the utility of data and the cost of data analysis could be reduced.
Mining on a reduced data set also makes the discovered pattern easier to understand.
As part of Data processing, we use the below methods of Attribute subset selection
1. Stepwise Forward Selection
2. Stepwise Backward Elimination
3. Combination of Forward Selection and Backward Elimination
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.
1. Stepwise Forward Selection:
This procedure starts with an empty set of attributes as the minimal set. The most relevant
attributes are chosen (having minimum p-value) and are added to the minimal set. In each
iteration, one attribute is added to a reduced set.
2. Stepwise Backward Elimination:
Here all the attributes are considered in the initial set of attributes. In each iteration, one
attribute is eliminated from the set of attributes whose p-value is higher than significance
level.
3. Combination of Forward Selection and Backward Elimination:
The stepwise forward selection and backward elimination are combined so as to select the
relevant attributes most efficiently. This is the most common technique which is generally
used for attribute selection.
4. Decision Tree Induction:
This approach uses decision tree for attribute selection. It constructs a flow chart like
structure having nodes denoting a test on an attribute. Each branch corresponds to the
outcome of test and leaf nodes is a class prediction. The attribute that is not the part of tree
is considered irrelevant and hence discarded.
Model Building
Model Building Life Cycle in Data Analytics: When we come across a business analytical problem,
without acknowledging the hurdles (challenges or obstacles), we proceed towards the execution.
Before realizing the misfortunes, we try to implement and predict the
outcomes. The problem-solving steps involved in the data science model-
building life cycle. Let’s understand every model building step in-depth.
The data science model-building life cycle includes some important steps
to follow. The following are the steps to follow to build a Data Model
1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment
1.Problem Definition:
The first step in constructing a model is to understand the industrial problem in a more
comprehensive way. To identify the purpose of the problem and the prediction target, we must
define the project objectives appropriately.
Therefore, to proceed with an analytical approach, we have to recognize the obstacles first.
Remember, excellent results always depend on a better understanding of the problem.
2. Hypothesis Generation
Hypothesis generation is the guessing approach through which we derive some essential data
parameters that have a significant correlation with the prediction target.
Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders into
account. We search for every suitable factor that can influence the outcome.
Hypothesis generation focuses on what you can create rather than what is available in the dataset.
3. Data Collection
Data collection is gathering data from relevant sources regarding the analytical problem, then we
extract meaningful insights from the data for prediction.
The data gathered must have:
Proficiency in answer hypothesis questions.
Capacity to elaborate on every data parameter.
Effectiveness to justify your research.
Competency to predict outcomes accurately.
4.Data Exploration/Transformation
The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary features,
null values, unanticipated small values, or immense values. So, before applying any algorithmic
model to data, we have to explore it first.
By inspecting the data, we get to understand the explicit and hidden trends in data. We find the
relation between data features and the target variable.
Usually, a data scientist invests his 60–70% of project time dealing with data exploration only.
There are several sub steps involved in data exploration: o
i. Feature Identification:
You need to analyse which data features are available and which ones are not. Identify
independent and target variables. Identify data types and categories of these variables.
ii. Univariate Analysis:
- Examine each variable individually.
- For continuous variables: Analyse mean, median, standard deviation, and skewness.
- For categorical variables: Use frequency tables to understand data distribution and measure
counts and frequency.
iii. Multi-variate Analysis:
The bi-variate analysis helps to discover the relation between two or more variables.
We can find the correlation in case of continuous variables and the case of categorical,
iv. Filling Null Values:
- Replace null values in continuous variables with mean or mode.
- Replace null values in categorical variables with the most frequent value.
- Avoid deleting rows with null values to preserve information.
5. Predictive Modelling:
Predictive modelling is a mathematical approach to create a statistical model to forecast future
behaviour based on input test data. Steps involved in predictive modelling:
Algorithm Selection:
- For structured data and predicting continuous/categorical outcomes, use supervised learning
(regression, classification).
- For unstructured data and clustering, use unsupervised learning.
- Apply multiple algorithms to achieve a more accurate model.
Train Model:
After assigning the algorithm and getting the data handy, we train our model using the input data
applying the preferred algorithm. It is an action to determine the correspondence between
independent variables, and the prediction targets.
Model Prediction:
- Use the trained model to make predictions on input test data.
- Evaluate accuracy using cross-validation or ROC curve analysis.
6. Model Deployment:
- Deploying a model in a real-time environment provides valuable analytical insights.
- Continuously update the model with new features to improve customer satisfaction.
- Integrate the model into production to inform business decisions, market strategies, and
personalized customer experiences.
- A well-deployed model can increase customer engagement and drive sales, as seen in personalized
product recommendations on websites like Amazon.
Model Theory, Model fit Statistics, Model Construction
Introduction:
Logistic regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether or not
the cells are abnormal or not, a mouse is obese or not based on its weight, etc.
Logistic regression uses the concept of predictive modelling as regression; therefore, it is called
logistic regression, but is used to classify samples; therefore, it falls under the classification
algorithm.
In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Types of Logistic Regressions
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types
of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
Multi-collinearity:
Multicollinearity is a statistical phenomenon in which multiple independent variables show high
correlation between each other and they are too inter-related.
Multicollinearity also called as Collinearity and it is an undesired situation for any statistical
regression model since it diminishes the reliability of the model itself.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
Logistic Regression uses a more complex cost function, this cost function can be defined as the
‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore
linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not
possible as per the hypothesis of logistic regression.
0 h (x) 1 --- Logistic Regression Hypothesis Expectation
Logistic Function (Sigmoid Function):
The sigmoid function is a mathematical function used to map the predicted values to probabilities.
The sigmoid function maps any real value into another value within a range of 0 and 1, and so
forma S-Form curve.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so
it forms a curve like the "S" form.
The below image is showing the logistic function:
The Sigmoid function can be interpreted as a probability indicating to a Class-1 or Class0. So the
Regression model makes the following predictions as
Hypothesis representation:
Precision:
Precision is the ratio of correctly predicted positive observations to the total predicted positive
observations.
It tells us how many of the correctly predicted cases actually turned out to be positive.
Precision is a useful metric in cases where False Positive is a higher concern than False
Precision is important in music or video recommendation systems, e-commerce websites, etc.
Wrong results could lead to customer churn and be harmful to the business.
Recall: (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all observations in actual class.
Recall is a useful metric in cases where False Negative is more significant than the False Positive.
Recall is crucial because it measures the model's ability to detect actual positive cases (e.g.,
patients with a disease).
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea about these two
metrics. It is maximum when Precision is equal to Recall. Therefore, this score takes both false
positives and false negatives into account.
F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
Accuracy works best if false positives and false negatives have similar cost.
If the cost of false positives and false negatives are very different, it’s better to look at both
Precision and Recall.
But there is a catch here. If the interpretability of the F1-score is poor, means that we don’t know
what our classifier is maximizing – precision or recall? So, we use it in combination with other
evaluation metrics which gives us a complete picture of the result.
AUC (Area Under Curve) ROC (Receiver Operating Characteristics)
Curves:
Performance measurement is an essential task in Data Modelling Evaluation. It is one of the most
important evaluation metrics for checking any classification model’s performance. It is also written as
AUROC (Area Under the Receiver Operating Characteristics) So when it comes to a classification
problem, we can count on an AUC - ROC Curve. When we need to check or visualize the performance
of the multi-class classification problem, we use the AUC (Area Under the Curve) ROC (Receiver
Operating Characteristics) curve. What is the AUC - ROC Curve? AUC - ROC curve is a performance
measurement for the classification problems at various threshold settings. ROC is a probability curve
and AUC represents the degree or measure of separability. It tells how much the model is capable of
distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0
and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between
patients with the disease and no disease.
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.
ROC curve An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve plots two
parameters: True Positive Rate False Positive Rate True Positive Rate (TPR) is a synonym for recall
and is therefore defined as follows:
Where:
Χ2 = chi squared.
nj = number of observations in the jth group.
Oj = number of observed cases in the jth group.
Oj = number of expected cases in the jth group.
Σ = summation notation. For the above formula, we’re summing from 1 to 10. Modify the
summation for your number of groups.
This test is usually run using technology. The output returns a chi-square value (a Hosmer-Lemeshow
chi-squared) and a p-value (e.g. Pr > ChiSq). Small p-values mean that the model is a poor fit.
Like most goodness of fit tests, these small p-values (usually under 5%) mean that your model
is not a good fit. But large p-values don’t necessarily mean that your model is a good fit, just that
there isn’t enough evidence to say it’s a poor fit. Many situations can cause large p-values, including
poor test power. Low power is one of the reasons this test has been highly criticized.