M2 Dav
M2 Dav
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Topics to be covered
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Regression examples
1. Deterministic / functional
- the equation exactly describes the relationship between the two variables.
- Eg :
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Regression - Types of relationships
2. Statistical
- Relationship between the variables is not perfect.
- Eg : The response variable y is the mortality due to skin cancer (number of
deaths per 10 million people) and the predictor variable x is the latitude
(degrees North) at the center of each of 48 states in the United States (U.S. Skin
Cancer data)
- Height and weight
- Alcohol consumed and blood alcohol content
- Vital lung capacity and pack-years of smoking
- Driving speed and gas mileage
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Correlation and Regression
● Correlation: is there a relationship between 2 variables?
● Regression: how well a certain independent variable predict dependent
variable?
● CORRELATION != CAUSATION
● In order to infer causality: manipulate independent variable and observe
effect on dependent variable
● Correlation tells you if there is an association between x and y but it
doesn’t describe the relationship or allow you to predict one variable
from the other.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Correlation : Methods for studying Correlation
1. Scatter Diagrams
2. Karl Pearson's Coefficient of Correlation (Covariance Method)
3. Two way Frequency Table (Bivariate Correlation Method)
4. Rank Correlation Method
5. Concurrent Deviation Method
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Correlation - Types of Correlation
Scatter Diagrams
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Karl Pearson's Coefficient of Correlation (Covariance Method)
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Wikipedia
Karl Pearson's Coefficient of Correlation (Covariance Method)
X 39 65 62 90 82 75 25 98 36 78
Y 47 53 58 86 62 68 60 91 51 84
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Karl Pearson's Coefficient of Correlation (Covariance Method)
X 39 65 62 90 82 75 25 98 36 78
Y 47 53 58 86 62 68 60 91 51 84
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Karl Pearson's Coefficient of Correlation (Covariance Method)
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Properties of Karl Pearson's Coefficient of Correlation
1. The correlation coefficient between X and Y is same as the correlation coefficient between Y and X
(i.e, rxy = ryx ).
2. The correlation coefficient is free from the units of measurements of X and Y
3. The correlation coefficient is unaffected by change of scale and origin.
Interpretation of Pearson’s Correlation coefficient
The correlation coefficient lies between -1 and +1. i.e. -1 ≤ r ≤ 1
● A positive value of ‘r’ indicates positive correlation.
● A negative value of ‘r’ indicates negative correlation
● If r = +1, then the correlation is perfect positive
● If r = –1, then the correlation is perfect negative.
● If r = 0, then the variables are uncorrelated.
● If r ≥ 0.7 then the correlation will be of higher degree. In interpretation we use the adjective ‘highly’
● If X and Y are independent, then rxy = 0. However the converse need not be true
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Brain Kart
Limitations of Karl Pearson's Coefficient of Correlation
1. Outliers (extreme observations) strongly influence the correlation coefficient. If we see outliers in our data,
we should be careful about the conclusions we draw from the value of r. The outliers may be dropped before the
calculation for meaningful conclusion.
2. Correlation does not imply causal relationship. That a change in one variable causes a change in another.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Types of Regression Models
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on # independent variables
● Involves one dependent variable and more than one independent variable.
● Models complex relationships by considering multiple factors simultaneously.
● Eg : predicting house prices based on features like square footage, number of bedrooms, and
location.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on # independent variables
3. Polynomial Regression:
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on # independent variables
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on # independent variables
5. Logistic Regression
1. Linear Regression
2. Polynomial Regression
3. Exponential Regression
4. Logarithmic Regression
5. Power Regression (Power Law)
6. Sigmoidal (Logistic) Regression
7. Piecewise Regression
8. Quantile Regression
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on shape of Regression Line
1. Linear Regression:
2. Polynomial Regression:
3. Exponential Regression:
4. Logarithmic Regression:
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on shape of Regression Line
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on shape of Regression Line
7. Piecewise Regression:
○ Regression line : multiple linear segments / piecewise continuous curves (Spline Regression)
○ Involves fitting multiple linear or nonlinear regression models to different segments of the data.
○ Useful when the relationship between variables changes at certain points or intervals.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on shape of Regression Line
8. Quantile Regression:
● Regression Line: different for the quantiles.
● when the variability in the residuals is not constant across all values of the independent variable.
● estimates different quantiles of the dependent variable.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on type of dependent variable
1. Linear Regression
2. Logistic Regression
3. Multinomial Logistic Regression
4. Ordinal Regression
5. Poisson Regression
6. Negative Binomial Regression
7. Survival Analysis (Cox Proportional-Hazards Model)
8. Robust Regression
9. Quantile Regression
10. Ridge Regression and Lasso Regression
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on type of dependent variable
1. Linear Regression:
○ Type of Dependent Variable: Continuous
○ Predicting a continuous outcome variable,
○ Eg: predicting sales, temperature, or height.
2. Logistic Regression:
○ Type of Dependent Variable: Binary (0 or 1)
○ Used for binary classification problems,
○ Eg: predicting whether a customer will buy a product (1) or not (0).
3. Multinomial Logistic Regression:
○ Type of Dependent Variable: Categorical with more than two categories
○ Suitable when the dependent variable has more than two unordered categories,
○ Eg: predicting the type of fruit (apple, orange, banana).
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on type of dependent variable
4. Ordinal Regression / Ordered Logistic Regression:
○ Type of Dependent Variable: Ordered categorical
○ Used when the dependent variable has ordered categories, but the intervals between them are not
assumed to be equal.
○ Eg: predicting the satisfaction level (low, medium, high).
5. Poisson Regression:
○ Type of Dependent Variable: Count data (non-negative integers)
○ Appropriate for modeling count data,
○ Eg: predict the number of customer arrivals, phone calls, or accidents in a day.
6. Negative Binomial Regression:
○ Type of Dependent Variable: Count data with overdispersion
○ Used when the count data exhibit more variability than expected in a Poisson regression, often due to
unobserved heterogeneity.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Types of Regression Models - based on type of dependent variable
7. Survival Analysis (Cox Proportional-Hazards Model):
○ Type of Dependent Variable: Time until an event occurs
○ Applied in medical research, economics, and other fields
○ to model the time until an event (e.g., death, failure, or relapse) occurs.
8. Robust Regression:
○ Type of Dependent Variable: Continuous, resistant to outliers
○ when there are outliers in the data, and traditional linear regression may be sensitive to them.
9. Quantile Regression:
○ Type of Dependent Variable: Conditional quantiles
○ Examining the effect of predictors on different quantiles of the dependent variable,
○ Provides insights into distributional changes.
10. Ridge Regression and Lasso Regression:
○ Type of Dependent Variable: Continuous, with potential multicollinearity
○ Used when dealing with multicollinearity in LR, and to prevent overfitting.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
● A line that fits the data "best" will be one for which the n prediction errors.
● The line which meets "least squares criterion,"
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
Trend is positive
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
Trend is negative
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What is the “Best Fitting Line ?”
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Numeracy
Simple Linear Regression : Class Assignment - 1
Consider the example below where the mass, y (grams), of a chemical is related to the time,
x (seconds), for which the chemical reaction has been taking place according to the table.
Find the equation of the regression line.
Time Mass
x
y (grams)
(seconds)
5 40
7 120
12 180
16 210
20 240
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Numeracy
Simple Linear Regression : Class Assignment - 1 Solution
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Numeracy
Simple Linear Regression : Class Assignment - 2
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Numeracy
Simple Linear Regression : Class Assignment - 2 Solution
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Numeracy
Simple Linear Regression : Class Assignment - 3
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Wall Street Mojo
Simple Linear Regression : Class Assignment - 3 Solution
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Wall Street Mojo
Simple Linear Regression - What do b0 & b1 estimate?
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - What do b0 & b1 estimate?
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - 4 Conditions
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - 4 Conditions (contd…)
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Simple Linear Regression - What is The Common Error Variance?
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Why should we care about ?
● Here, brand B thermometer yield more precise future predictions than the brand A thermometer.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - How to estimate ?
● Sample Variance :
○
○ It is really close to being like an average.
○ Here, the population mean is unknown, estimate with
○ So, divide by n-1, and not n
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - How to estimate ?
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Coefficient of Determination,
Characteristics of
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Caution
1. The coefficient of determination and the correlation coefficient r quantify the strength of a linear
relationship. When = 0% and r = 0, suggests that there is no linear relation between x and y, and
yet a perfect curved (or "curvilinear" relationship) exists.
2. A large value should not be interpreted as meaning that the estimated regression line fits the data
well. Another function might better describe the trend in the data.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Caution
3. The coefficient of determination and the correlation coefficient r can both be greatly affected by
just one data point (or a few data points).
Eg: Relationship between the number of deaths in an earthquake and its magnitude is examined.
- Data on n = 6 earthquakes were recorded, and the fitted line plot on the left was obtained.
- Slope of the line Remove one unusual data point:
- slope of the line changes from +179.5 to -87.1
- Correlation, r = 0.732
- r changes from a +0.732 to -0.960
Magnitude of earthquake increases ⇒ # deaths increases
- changes from 53.5% to 92.1%.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Caution
4. Correlation (or association) does not imply causation.
5. Ecological correlations — correlations that are based on rates or averages — tend to overstate the
strength of an association.
6. A "statistically significant" value does not imply that the slope is meaningfully different from 0.
7. A large value does not necessarily mean that a useful prediction of the response , or
estimation of the mean response , can be made. It is still possible to get prediction intervals or
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Pearson Correlation Coefficient
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Pearson Correlation Coefficient
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Pearson Correlation Coefficient
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Pearson Correlation Coefficient
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Simple Linear Regression - Pearson Correlation Coefficient
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Test for the Population Correlation Coefficient
● Used to learn of a linear association between two variables, when it isn't obvious
which variable should be regarded as the response.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Steps for Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Multiple Linear Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Multiple Linear Regression - Interpretation
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Multiple Linear Regression - Bivariate Model
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Multiple Linear Regression - Bivariate Model
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Multiple Linear Regression - Class Assignment : 1
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Multiple Linear Regression - Class Assignment : 2
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science
Polynomial Regression
● Add some polynomial terms to linear regression to convert it into Polynomial regression.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression - Evaluation Metrics
● provide information about how well the model fits the data
● used to compare different models or to select the best model for a given problem.
● Some commonly used evaluation metrics for polynomial regression include:
1. Mean Squared Error (MSE):
a. measures the average squared difference between the predicted and actual values.
b. Calculated as the sum of the squared differences divided by the number of observations.
c. The lower the MSE, the better the model performance.
2. Root Mean Squared Error (RMSE):
a. square root of the MSE and provides a measure of the average deviation of the predictions from the actual
values.
b. The lower the RMSE, the better the model performance.
3. R-squared (R2) Score:
● measures the proportion of the variance in the dependent variable that is explained by the independent
variable(s) in the model.
● It ranges from 0 to 1, with higher values indicating better model performance.
4. Adjusted R-squared Score:
● similar to the R-squared score
● takes into account the number of independent variables in the model.
● It is adjusted for degrees of freedom and penalizes the model for including unnecessary independent
variables.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression - Quadratic Model
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression - Solved Problem
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression - Solved Problem
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression - Solved Problem
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Polynomial Regression - Solved Problem
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression
● One of the common assumptions underlying most process modeling methods, including linear and nonlinear
least squares regression,
○ Each data point provides equally precise information about the deterministic part of the total process
variation.
● it is assumed that the standard deviation of the error term is constant over all values of the predictor
or explanatory variables.
○ This assumption, clearly does not hold, even approximately, in every modeling application.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression
● In a weighted fit,
○ less weight is given to the less precise measurements
○ more weight to more precise measurements when estimating the unknown parameters in the model.
● Using weights that are inversely proportional to the variance at each level of the explanatory variables
yields the most precise parameter estimates possible.
● ∑ wi (yi -y’i )^2
○ wi = Weighting factor for the ith calibration standard
■ (w=1 for unweighted least square regression)
○ yi = Observed instrument response for the i th calibration standard
○ y’i = Predicted (or calculated) response for the ith standard
○ ∑ = The sum of all individual values
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression - Benefits
● Weighted least squares is an efficient method that makes good use of small data sets.
● It also shares the ability to provide different types of easily interpretable statistical
intervals for estimation, prediction, calibration and optimization.
● The main advantage that weighted least squares enjoys over other methods is the ability to
handle regression situations in which the data points are of varying quality.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression - Disadvantages
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression - Disadvantages
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Weighted Least Square (WLS) Regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Ridge Regression Model
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Ridge Regression Model
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Ridge Regression Model
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Ridge Regression Model
● Standardization of Variables:
● It is common practice to standardize the predictor variables before applying Ridge Regression.
● Standardization involves subtracting the mean and dividing by the standard deviation for each variable.
○ This ensures that all variables are on a similar scale,
○ regularization term has a consistent impact across predictors.
● No Variable Selection:
● Ridge Regression does not perform variable selection in the same way as methods like LASSO (L1
regularization).
● It tends to shrink all coefficients toward zero, but it rarely sets them exactly to zero.
● Cross-Validation for λ Selection:
● The choice of the shrinkage parameter (λ) is critical.
● Cross-validation techniques, such as k-fold cross-validation, are often employed to select an optimal λ
that balances model complexity and performance.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Ridge Regression Model
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Loess Regression Model
● Loess (Locally Weighted Scatterplot Smoothing) regression
● non-parametric regression technique used for estimating relationships between variables.
● useful when dealing with complex and non-linear relationships in the data.
● fits a smooth curve to the data by locally fitting a polynomial regression model to subsets of the
data.
● Key Characteristics:
1. Local Regression:
● by fitting a polynomial model to a subset of the data points within a specified neighborhood (window)
around each point.
● This allows the model to capture local trends in the data.
2. Weighted Regression:
● The fitting process in Loess involves assigning weights to data points based on their proximity to
the point being predicted.
● Points closer to the target point have higher weights, while those farther away have lower weights.
● This weighting emphasizes the influence of nearby points in the local regression
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Loess Regression Model
3. Polynomial Fitting:
● In each local subset of the data, a polynomial regression model is fitted.
● The degree of the polynomial is typically low (e.g., quadratic or cubic) to avoid overfitting.
4. Smoothing Parameter:
● The degree of smoothing in Loess is controlled by a parameter often denoted as α or τ.
● This parameter determines the size of the local neighborhood and influences the degree of flexibility in the
fitted curve.
● A larger smoothing parameter results in a smoother curve.
5. Residual Weighting:
● beneficial when dealing with heteroscedasticity (varying levels of variability across the data).
6. Adaptive Bandwidth:
● The bandwidth or window size can vary across different regions of the dataset.
● This adaptability helps to capture local features accurately, especially in areas where the relationship
between variables changes.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Loess Regression Model
7. Iterative Process:
● The fitting process in Loess is typically iterative.
● After the initial fit, the weights are adjusted based on the residuals, and the process is repeated. This
iteration helps in refining the model and improving the fit.
8. Robustness
● Loess regression is generally robust to outliers since the influence of each point is locally determined.
● Outliers in one region may have minimal impact on the fit in other regions.
ADC601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Analytics Vidya
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
Note :
Residuals are assumed to be normally distributed with a mean near zero and a constant
variance.
Intercept
● estimated income of $7,263 for a newborn female with no education.
● It is important to note that the available dataset does not include such a person.
● The minimum age and education in the dataset are 18 and 10 years, respectively
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
Note:
● Coefficient values are only estimates based on the observed incomes in the sample, there is some
uncertainty or sampling error for the coefficient estimates.
● Std. Error
○ provides the sampling error associated with each coefficient
○ used to perform a hypothesis test, using the t-distribution, to determine if each coefficient is statistically
different from zero.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
● If a coefficient is not statistically different from zero, the coefficient and the associated variable in the
model should be excluded from the model.
● In this example, the associated hypothesis tests’ p-values, Pr(>|t|), are very small for the Intercept, Age, and
Education parameters
● For small p-values, as is the case for the Intercept, Age, and Education parameters, the null hypothesis
would be rejected.
● For the Gender parameter, the corresponding p-value is fairly large at 0.13.
● In other words, at a 90% confidence level, the null hypothesis would not be rejected.
● So, dropping the variable Gender from the linear regression model should be considered
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Hypothesis Testing for
In this example, the p-value of 2.2e – 16 is small, which indicates that the null hypothesis should be
rejected.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Interaction Models
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Interaction Models - Types
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Qualitative Predictor variables
● also known as categorical variables,
● pose some unique challenges compared to continuous variables.
● can be either nominal or ordinal,
● their inclusion in regression models requires special considerations.
● Some techniques used with qualitative predictor variables:
1. Dummy Coding (Indicator Variables):
● common technique for representing categorical variables in regression models.
● For a categorical variable with k levels, k−1 dummy variables are created.
● Each dummy variable takes the value 0 or 1, ⇒ absence or presence of a particular level of the
categorical variable.
● Example: For a variable "Color" with three levels (Red, Green, Blue), you might create two dummy
variables: D1 for Green and D2 for Blue, with Red as the reference level.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Qualitative Predictor variables
2. Effect Coding:
● Similar to dummy coding,
● reference category is assigned a value of -1, and the other categories are coded as 0 or 1.
● useful when you are interested in the overall average effect of the categorical variable.
3. Contrast Coding:
○ Involves creating contrasts that represent specific comparisons of interest among the levels of the
categorical variable.
○ Popular contrast codings include treatment (dummy) coding and Helmert coding.
4. Interaction with Dummy Variables:
○ Used when both continuous and categorical predictors,
○ to capture potential differential effects.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model evaluation Measures
1. Mean Absolute Error (MAE):
○ Represents the average absolute differences between the observed and predicted values.
2. Mean Squared Error (MSE):
○ Measures the average squared differences between observed and predicted values.
3. Root Mean Squared Error (RMSE):
○ Square root of the mean squared error, providing a more interpretable scale.
4. R-squared (R2):
○ Represents the proportion of the variance in the dependent variable that is predictable from the
independent variables.
○ Ranges from 0 to 1, where 1 indicates a perfect fit.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model evaluation Measures
5. Adjusted R-squared:
○ Adjusts R-squared for the number of predictors in the model, providing a more realistic measure.
○ Penalizes the addition of irrelevant variables that do not improve the model significantly.
6. Mean Absolute Percentage Error (MAPE):
○ Represents the average percentage difference between observed and predicted values.
○ Useful for expressing errors as a percentage of the observed values.
7. Mean Bias Deviation (MBD):
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model Selection Procedures
● choosing the most appropriate model from a set of candidate models.
● find a model that balances goodness of fit with simplicity to avoid overfitting.
1. Stepwise Regression
2. Subset Selection:
3. Regularization Techniques
4. Information Criteria
5. Cross-Validation
6. Bootstrap Resampling
7. Model Comparison
8. Domain Knowledge
9. Model Diagnostics
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model Selection Procedures
1. Stepwise Regression:
● Forward Selection:
■ Starts with an empty model
■ Adds predictors one at a time,
■ selecting the one that most improves the model fit at each step.
● Backward Elimination:
■ Starts with all predictors in the model
■ Removes the one that contributes least to the model fit at each step.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model Selection Procedures
2. Subset Selection
○ Best Subset Selection:
■ Fits all possible combinations of predictors
■ selects the model with the best fit based on a criterion such as the Akaike Information Criterion
(AIC) or the Bayesian Information Criterion (BIC).
■ balance the goodness of fit and the number of parameters in the model.
■ Model with the lowest AIC or BIC is often selected.
■ Formula :
● L is the likelihood of the model.
● k is the number of parameters in the model.
● n is the number of observations in the dataset.
○ Recursive Feature Elimination (RFE):
■ Iteratively removes the least important variable until the desired number of features is reached.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model Selection Procedures
3. Regularization Techniques:
○ Ridge Regression:
■ Introduces a regularization term to the least squares equation,
■ preventing overfitting by penalizing large coefficients.
○ Lasso Regression:
■ Similar to ridge regression
■ but uses the absolute values of coefficients, promoting sparsity and variable selection.
○ Elastic Net Regression:
■ A combination of ridge and lasso regularization, balancing their strengths.
4. Information Criteria:
○ Akaike Information Criterion (AIC):
■ Penalizes models for complexity,
■ favoring simpler models that explain the data well.
○ Bayesian Information Criterion (BIC):
■ Similar to AIC
■ But places a stronger penalty on model complexity.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model Selection Procedures
5. Cross-Validation:
○ k-Fold Cross-Validation:
■ Divide the data into k folds,
■ train the model on k-1 folds,
■ Validate on the remaining fold.
■ Repeat this process k times, rotating the validation set.
○ Leave-One-Out Cross-Validation (LOOCV):
■ Special case of k-fold where k equals the number of observations.
■ Each observation serves as a validation set in turn.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Model Selection Procedures
6. Bootstrap Resampling:
○ Bootstrap Aggregating (Bagging):
■ Create multiple bootstrap samples from the dataset,
■ train models on each sample,
■ average their predictions to reduce variance.
○ Bootstrap Confidence Intervals:
■ Assess the stability and reliability of regression coefficients using bootstrap resampling.
7. Model Comparison:
○ Compare different regression models using statistical tests or information criteria to determine the
most suitable model for your data.
8. Domain Knowledge:
○ Consider the theoretical aspects of the problem and domain knowledge to guide variable selection and
model specification.
9. Model Diagnostics:
○ Use residual analysis, leverage plots, and other diagnostic tools to identify potential issues with the
chosen model.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Leverage in Regression
● influence that a single data point can have on the overall fit of a regression model.
● It is a measure of how much a particular observation can affect the estimated regression coefficients.
● important in multiple linear regression, where there are more than one independent variable.
● measure of how far an independent variable value is from the mean of the independent variables.
● Leverage for an observation i in a dataset with n observations is given by Hat Matrix
○ Also known as Projection Matrix
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Leverage in Regression
● Influence and Outliers:
○ Observations with high leverage can have a strong influence on the estimated coefficients.
○ If a data point has high leverage and an extreme value in the dependent variable,
■ it can significantly impact the regression model,
■ potentially leading to outliers or influential points.
● High Leverage Points:
○ Points with high leverage typically have extreme values in one or more independent variables.
○ These points have the potential to disproportionately influence the regression model,
■ especially if they deviate from the overall pattern of the data.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Leverage in Regression
● Diagnostic plots,
○ such as leverage-residual plots,
○ Helps to identify observations with high
leverage.
○ Observations with both high leverage and
large residuals may have a substantial
impact on the model.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Residual Leverage Plot (Regression Diagnostic)
Regression analysis requires some assumptions to be followed by the dataset.
● Observations are independent of each other. It should be correlated to another observation.
● Data is normally distributed.
● The relationship b/w the independent variable and the mean of the dependent variable is linear.
● The data is in homoscedasticity, (variance of the residual is the same for each value of the dependent
variable.)
To perform a good linear regression analysis, check whether these assumptions are violated:
● If the data contain non-linear trends then it will not be properly fitted by linear regression resulting in a high
residual or error rate.
● To check for the normality in the dataset, draw a Q-Q plot on the data.
● The presence of correlation between observations is known as autocorrelation. autocorrelation plot.
● The presence of homoscedasticity (Scale Location plot, the Residual vs Legacy plot.)
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Residual Leverage Plot (Regression Diagnostic)
1. Residual vs fitted plot:
○ This plot is used to check for linearity and homoscedasticity,
○ if there is a linear relationship then it should have a horizontal line with much deviation.
○ If the model meets the condition for homoscedasticity, the graph should be equally
spread around the y=0 line.
● Q-Q plot:
○ This plot is used to check for the normality of the dataset,
○ if there is normality that exists in the dataset then, the scatter points will be distributed
along the 45 degrees dashed line.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Residual Leverage Plot (Regression Diagnostic)
3. Scale-Location plot:
○ It is a plot of square rooted standardized value vs predicted value.
○ This plot is used for checking the homoscedasticity of residuals.
○ Equally spread residuals across the horizontal line indicate the homoscedasticity of residuals.
4. Residual vs Leverage plot:
○ plot between standardized residuals and leverage points of the points.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression
● Probability Prediction:
○ The output of the sigmoid function represents the probability that the instance belongs to the positive
class (class 1): P(y=1)=σ(z)
○ The probability of belonging to the negative class (class 0) is then 1−P(y=1).
○ based on the input variables , the probability of an event
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression Applications
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression Adavantages
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression - Problem
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression - Problem
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression - Problem
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression - Problem
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression - Problem
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Generalized Linear Model
● statistical modeling framework that extends the classical linear regression model
● to handle a broader range of data types and distributions.
● assumes normally distributed errors and continuous response variables,
● accommodating various types of response variables and distributional assumptions.
● Key components of a Generalized Linear Model include:
○ Random Component:
■ This part of the model specifies the distributional family of the response variable, which can be chosen
based on the nature of the data. Examples of distribution families include Gaussian (for continuous
data), Binomial (for binary data), Poisson (for count data), and Gamma (for positively skewed
continuous data).
○ Systematic Component:
■ describes how the linear predictor is related to the predictors.
■ It includes a linear combination of the predictor variables, each multiplied by a regression coefficient.
○ Link Function:
■ connects the mean of the distribution (specified by the random component) to the linear predictor in the
systematic component.
■ The choice of link function depends on the distributional family and the characteristics of the data.
Common link functions include the identity, logit, log, and inverse.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Generalized Linear Model
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Logistic Regression Vs Generalized Linear Model
Feature Logistic Regression Generalized Linear Models (GLM)
Type of Model Specific case of GLM for binary outcomes General framework that includes LR as a special case
Dependent Variable Binary (0 or 1) Can handle various types (e.g., continuous, count)
Interpretability Coefficients represent log-odds Interpretation depends on the chosen link function
Examples Predicting whether an email is spam or not Predicting house prices, count of events, etc.
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Linear Regression Vs Logistic Regression
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Linear Regression Vs Logistic Regression
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501
Linear Regression Vs Logistic Regression
ADC 601 : Data Analytics & Visualization Department of Artificial Intelligence & Data Science Courtesy : Stat 501