Predictive Analytics Using Regression
Predictive Analytics Using Regression
USING REGRESSION
Sumeet Gupta
Associate Professor
Indian Institute of Management Raipur
Outline
Basic Concepts
Applications of Predictive Modeling
Linear Regression in One Variable using OLS
Multiple Linear Regression
Assumptions in Regression
Explanatory Vs Predictive Modeling
Performance Evaluation of Predictive Models
Practical Exercises
Case: Nils Baker
Case: Pedigree Vs Grit
BASIC CONCEPTS
Scatterplot
A picture to explore the relationship in bivariate data
Correlation r
Measures strength of the relationship (from 1 to 1)
Regression
Predicting one variable from the other
X
Y
X
Y
X
Y
X
Y
Y = + X +
Randomness of individuals
Population relationship, on average
Y
Correlation is r = 0.964
Linear relationship
Straight line
with scatter
Increasing relationship
Tilts up and to the right
90
eBay
Yahoo!
60
MSN
30
0
100
200
Pages per person
Correlation is r = 0.419
Positive association
Straight line
with scatter
Increasing relationship
Tilts up and to the right
$1,000
Dollars (billions)
Linear relationship
$500
$0
100
200 300
Deals
400
10
fee?
Correlation is r = 0.890
Linear relationship
Straight line
with scatter
Decreasing relationship
Tilts down and to the right
Interest rate
6.0%
5.5%
5.0%
0%
1%
2%
3%
Loan fee
4%
11
Correlation is r = 0.11
No relationship?
Tilt is neither
up nor down
2%
Today's change
A weak relationship?
1%
0%
-1%
-2%
-3%
-3% -2% -1% 0% 1% 2% 3%
Yesterday's change
12
Strike Price
The right to buy at a lower strike price has more value
A nonlinear relationship
A curved relationship
Correlation r = 0.895
A negative relationship:
$100
Call Price
$75
$50
$25
$0
$450
$500
$550
Strike Price
$600
$650
13
A nonlinear relationship
Not a straight line:
Correlation r = 0.0155
r suggests no relationship
up nor down
Yield of process
A curved relationship
160
150
140
130
120
500
900
14
r = 0.957
Log of miles
Circuit miles
(millions)
Correlation r = 0.820
2,000
1,000
0
0
1,000 2,000
Investment
($millions)
20
15
15
20
Log of investment
15
Correlation r = 0.994
Ordinary bonds only
Bid price
Correlation r = 0.950
$150
$100
0%
5%
10%
Coupon rate
16
An outlier is visible
A disaster (a fire at the factory)
Cost
r = 0.623
10,000
20
40
60
Number produced
Cost
5,000
Outlier removed:
More details,
r = 0.869
4,000
3,000
20
30
40
50
Number produced
17
Experience
15
10
20
5
15
5
Salary
30
35
55
22
40
27
Salary ($thousand)
Correlation r = 0.8667
60
50
40
30
20
0
10
20 Experience
18
Salary (Y)
60
50
40
30
20
10
0
10
20
Experience (X)
19
20
50
Marys predicted value is 48.8
Salary
40
30
20
10
0
10
Experience
20
21
Se = SY
(1 r ) nn 12
2
Se = 11.686 (1 0.86672 )
6 1
= 6.52
62
Predicted salaries are about 6.52 (i.e., $6,520) away from actual
salaries
22
Salary
60
50
40
30
20
0
10 Experience20
23
24
25
regression coefficient
Used in the usual way to find confidence intervals and hypothesis
tests for individual regression coefficients
26
= $38,966
Actual Page Costs are $25,315
Residual is $25,315 38,966 = $13,651
Audubon has Page Costs $13,651 lower than you would expect for
a magazine with its characteristics (Audience, Percent Male, and
Median Income)
27
Standard Error
Standard Error of Estimate Se
Indicates the approximate size of the prediction errors
About how far are the Y values from their predictions?
For the magazine data
Se = S = $21,578
Actual Page Costs are about $21,578 from their predictions for this
28
Coeff. of Determination
The strength of association is measured by the square of the multiple
correlation coefficient, R2, which is also called the coefficient of
multiple determination.
R2 =
SS reg
SS y
R2
= R2
k(1 - R 2)
n-k-1
29
Coeff. of Determination
Coefficient of Determination R2
Indicates the percentage of the variation in Y that is explained by
(or attributed to) all of the X variables
How well do the X variables explain Y?
For the magazine data
R2 = 0.787 = 78.7%
The X variables (Audience, Percent Male, and Median Income) taken
30
The F test
Is the regression significant?
Do the X variables, taken together, explain a significant amount of
the variation in Y?
The null hypothesis claims that, in the population, the X variables
do not help explain Y; all coefficients are 0
H0: 1 = 2 = = k = 0
The research hypothesis claims that, in the population, at least
31
The F test
H0 : R2pop = 0
This is equivalent to the following null hypothesis:
H0: 1 = 2 = 3 = . . . = k = 0
The overall test can be conducted by using an F statistic:
F=
SS reg /k
SS res /(n - k - 1)
R 2 /k
(1 - R 2 )/(n- k - 1)
32
significant
33
Example: F test
For the magazine data, The X variables (Audience, Percent
Male, and Median Income) explain a very highly significant
34
t Tests
A t test for each regression coefficient
To be used only if the F test is significant
If F is not significant, you should not look at the t tests
b j tSb j
tstatistic = b j / Sb j
35
Example: t Tests
Testing b1, the coefficient for Audience
b1 = 3.79, t = 13.5, p = 0.000
Audience has a very highly significant effect on Page Costs, after
36
Assumptions in Regression
Assumptions underlying the statistical techniques
Assumptions in Regression
Linearity
The independent variable has a linear relationship with the dependent
variable
Normality
The residuals or the dependent variable follow a normal distribution
Multicollinearity
When some X variables are too similar to one another
Homoskedasticity
The variability in Y values for a given set of predictors is the same
regardless of the values of the predictors
Independence among cases (Absence of correlated errors)
The cases are independent of each other
38
Assumptions in Regression
Normality
The residuals or the dependent variable follow a normal
distribution
If the variation from normality is significant then all
statistical tests are invalid
Graphical Analysis
Histogram and Normal probability plot
Peaked and Skewed distribution result in non-normality
Statistical Analysis
normal
Kolmogorov Smirnov Test; Shapiro-Wilks Test
39
Assumptions in Regression
Normality
40
Assumptions in Regression
Homoskedasticity
Assumption related primarily to dependence
41
Assumptions in Regression
Homoskedasticity
Graphical Analysis
Analysis of residuals in case of Regression
Statistical Analysis
Variances within groups formed by non-metric variables
Levene Test
Boxs M Test
Remedy
Data Transformation
42
Assumptions in Regression
Homoskedasticity
Graphical Analysis
43
Assumptions in Regression
Linearity
Assumption for all multivariate techniques based on
between variables
Identification
Remedy
Data Transformations
44
Assumptions in Regression
Linearity
45
Assumptions in Regression
Absence of Correlated Errors
Prediction errors should not be correlated with each
other
Identification
Most possible cause is the data collection process, such as
46
Assumptions in Regression
Multicollinearity
Multicollinearity arises when intercorrelations among the predictors
47
Assumptions in Regression
Multicollinearity
The ability of an independent variable to improve the prediction of the
48
Assumptions in Regression
Multicollinearity
Measuring Multicollinearity
Tolerance
Amount of variability of the selected independent variable not explained
49
Assumptions in Regression
Multicollinearity
Remedy for Multicollinearity
A simple procedure for adjusting for multicollinearity consists of
using only one of the variables in a highly correlated set of
variables.
Omit highly correlated independent variables and identify other
independent variables to help the prediction
Alternatively, the set of independent variables can be transformed into
a new set of predictors that are mutually independent by using
techniques such as principal components analysis.
More specialized techniques, such as ridge regression and latent root
regression, can also be used.
50
Assumptions in Regression
Data Transformations
To correct violations of the statistical assumptions
Transformation
Positively Skewed Distribution Logarithmic Transformation
If the residuals in regression are cone shaped then
Cone opens to right Inverse transformation
Cone opens to left Square root transformation
51
Assumptions in Regression
Data Transformations
Transformation to achieve
Linearity
52
Assumptions in Regression
Data Transformations
53
Assumptions in Regression
General guidelines for transformation
For a noticeable effect of transformation the ratio of a variables
mean to the standard deviation should be less than 4.0
When the transformation can be performed on either of the two
variables, select the one with smallest ratio of mean/sd.
Transformation should be applied to independent variables
except in case of heteroscedasticity
Heteroscedasticity can only be remedied by transformation of
the dependent variable in a dependent relationship
If the heteroscedastic relationship is also non-linear the
dependent variable and perhaps the independent variables must
be transformed
Transformations may change the interpretation of the variables
54
Issues in Regression
Variable Selection
How to choose from a long list of X variables?
Too many: waste the information in the data
Too few: risk ignoring useful predictive information
Model Misspecification
Perhaps the multiple regression linear model is wrong
Unequal variability? Nonlinearity? Interaction?
EXPLANATORY VS
PREDICTIVE MODELING
Performance Evaluation
Prediction Error for observation i= Actual y value
predicted y value
Popular numerical measures of predictive accuracy
MAE or MAD (Mean absolute error / deviation)
Average Error
Performance Evaluation
RMSE (Root mean squared error)
CASE
is expected to obtain higher returns at their current funds and by how much?
If hired by the firm, who is expected to obtain higher returns and by how
much?
Can you prove at the 5% level of significance that Bob would get higher
expected returns if he had attended Princeton instead of Ohio State?
Can you prove at the 10% level of significance that Bob would get at least 1%
higher expected returns by managing a growth fund?
Is there strong evidence that fund managers with MBA perform worse than
fund managers without MBA? What is held constant in this comparison?
Based on your analysis of the case, which candidate do you support for
AMBTPMs job opening: Bob or Putney? Discuss
Thank You