Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Module 2 Transcripts_v3

Uploaded by

deathslayer112
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 2 Transcripts_v3

Uploaded by

deathslayer112
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 103

Module 2: Multiple Linear Regression

2.1. Fundamentals, Objectives and Examples


This module will begin with the introduction of the multiple linear regression model
along with its motivation with three data examples. This lesson will cover the
fundamentals and objectives of multiple linear regression.

Slide 3:

In multiple linear regression, the data consist of the response variable and a series of
predicting or explanatory variables. More specifically, we observe "n" realizations of the
response variable along with the corresponding predicting variables. The relationship
captured is the linear relationship between the response variable and the predicting
variables. In this model, the deviances, or epsilons (also called "error terms") are the
difference between the response variable and the linear function in x’s. For estimating
multiple linear regression, we assume that the error terms have zero mean, constant
variance, and are independent. For statistical inference, we also need to assume
that the error terms are normally distributed. Let’s review these assumptions in more
detail.

Zero mean assumption means that the expected value of the errors is zero across all
errors; this also implies that the linearity assumption holds.

Constant variance assumption means that it cannot be true that the model is more
accurate for some parts of the population and less accurate for other parts. A violation
of this assumption means that the estimates are not as efficient in estimating the true
parameters, resulting in poorly calibrated confidence and prediction intervals.

Independence assumption means that the response variables are independently


drawn from the data-generating process. Violation of this assumption can lead to a
misleading assessment of the strength of the regression.

If the normality assumption is violated, hypothesis tests and confidence or prediction


intervals can be misleading.

In the linear regression model, the parameters defining the regression line, the
regression coefficients, β0, β1 through βp, are unknown parameters. We have an
additional parameter, the variance of the errors, denoted with sigma squared. Model
parameters are unknown regardless of how much data we observe. But we can derive
some approximations or estimates of the parameters given the data and the model
assumptions. The parameter estimates will take different values if one uses different
data sets, meaning that the estimates are uncertain. In a different lesson, we will
describe the distribution of the estimated regression coefficients to capture this
uncertainty.

Slide 4:

In multiple linear regression, the model can be written in a matrix form. We stack up all
the values of the response variable into a vector called here the Y vector. We define the
design matrix as a matrix consisting of columns of predicting variables, including the
column of ones corresponding to the intercept. We also stack up the regression
parameters, the betas, into one vector of parameters and we stack up the error terms
into one vector, ε .
The resulting matrix formulation of the model is provided on the slide. We will be using
this formulation when I introduce the estimation and inference approaches.

Slide 5:

Even with only a small number of variables, say two predicting variables, there are a
number of different approaches to regression that will yield different results and may be
more or less useful in different scenarios. The slide presents four basic approaches
demonstrating the flexibility of linear regression. To keep the examples simple, I will
demonstrate each with just two predicting variables, but all these models can use more
than two predicting variables.

Let’s start by examining a simple "first-order" model. For this model, when we fix the
value of a predicting variable, say x 1, the expected value of Y is a linear function of the
other variable, x2. If we graph a regression function as a function of only one variable,
say x2, for several values of x1, we obtain as contours of the regression function a
collection of lines.

The "linear" in linear regression refers to fitting the response as a linear function of the
observed data, but it doesn't mean necessarily that the ‘linear’ is the linear relationship
in the individual predictors. In fact, we can extend this model to a Second Order model
where we include the square of the predictors, so we include an x 1 squared and x2
squared as additional predictors. For this model, when we fix x 2, the expected change in
y for one unit increase in x1 is not β1, but β1 plus β3x1. If we graph a regression function
as a function of only one variable, say x 1, for several different values of x2, we obtain as
contours of the regression function a collection of curves rather than lines. Thus, while
the estimation of the model is the same as that for a linear model, the interpretation is
not.
In the third model, First Order Interaction model, we extend the 1st order model to
include an interaction term between the two predicting variables. The contours of the
regression function are non-parallel straight lines for any interaction model. Specifically,
when x1 is increased by 1 unit, the expected change in Y is β 1 plus β3 times x2 thus
depending on x2.

The final example, the Second Order Interaction Model, combines aspects of these
previous examples. Here, the model includes both second-order and interaction terms.
In this model, a plot of the contours of the regression function yields non-parallel
curves.

Slide 7

Here I am contrasting the surface of the response for the 1 st and 2nd order regression
models with two predicting factors. While for the 1 st order the surface is linear, for the
2nd order the surface is non-linear. Note that the 2 nd order is expressed as a linear model
where the 2nd order factors are linear terms in the model but because we have 2 nd order
terms, the resulting surface is non-linear.
Slide 8

Here I am contrasting the surface of the response for the 1 st and 2nd order regression
models with interactions with two predicting factors. As mentioned earlier, the
regression lines are not anymore parallel. The interaction terms make the surface of the
response variable non-parallel, as shown in the example on the slide. Thus, adding
interaction terms to the model makes the interpretation of the relationships between
the response and the predicting variables more challenging.

Slide 9:
In Module 1, we contrasted the simple linear regression model with the ANOVA model.
In simple linear regression, we considered modeling variations in the response
with respect to a quantitative variable whereas in ANOVA, we considered modeling
variations in the response with respect to qualitative variables. Multiple regression is a
generalization of both models.

Slide 10:
When we have both quantitative and qualitative variables in a multiple regression
model, we need to understand how to interpret the model and how to model qualitative
variables. Let’s assume a model with both quantitative and qualitative variables where
the qualitative variable has three levels. When we have a qualitative variable with k
levels, we only include k-1 dummy variables if the regression model has an intercept.
Thus, for this example, we include the two dummy variables d 1 and d2 but not d3. When
adding the dummy variables, the intercept of the model will vary depending on the label
of the response variable. For example, if we have d 1 equal to 0 and d2 equal to 0, then
we consider the response variable for the third category; thus, the model is β 0 plus β1x1
where the intercept is β0. In contrast, if d1 is equal to 1 and the other two are zero, then
the intercept is β0 plus β2. If d2 is equal to 1 then the intercept is β 0 plus β3.

Thus, for this regression model, the three models, each for one category, result in
parallel regression lines.

As we discussed in the previous slide, if we include an interaction term between the


qualitative variables and the quantitative variables, the regression lines are non-parallel
as shown in the figure.

Slide 11
The first example builds from one of the examples in Module 1 covering the simple
linear regression; in this example, we studied the relationship between advertisement
expenditure and medical supply sales under a new advertising program, assessing
whether an increase in advertising expenditure would be expected to lead to an
increase in the sales and by how much.

Slide 12
We will now include other predicting variables to explain the sales such as:

● total amount of bonuses paid


● market share in the territory where the company's offices are located
● largest competitor’s sales
● the region in which the office is located.

In this example we have both quantitative and qualitative predicting variables. The
indicator of the region in which each office is located is a qualitative or categorical
variable. It is important to account for these additional predicting variables, in addition
to advertisement expenditure, since they control for factors that may impact sales.
Slide 13:
SAT is a standardized test used for college admissions in the United States. But back in
1982, SAT was not widely used for college admissions. In this example, we will study the
average SAT scores by state for all the states in 1982 in the United States. The average
SAT scores varied considerably by state with mean scores falling between 790 for South
Carolina to 1,088 for Iowa. The researchers of this study examined compositional and
demographic variables to assess to what extent the variables were tied to SAT scores.

Research questions to be addressed in these examples are:

● Which variables are associated with SAT scores?


● How do the states rank?
● Which states perform best for the amount of money they spend?

Slide 14:
The response variable is the mean SAT score, consisting of the verbal and quantitative
tests combined.

The predicting variables are:

● X1: "takers" The percentage of total eligible students, high school seniors, in
the state who took the exam.
● X2: "income" The median income of families of test takers in hundreds of
dollars.
● X3: "years" The average number of years that test takers had in social sciences,
natural sciences, and humanities combined.
● X4: "public" A percentage of test takers who attended public schools.
● X5: "expend" A state expenditure on secondary schools in hundreds of dollars
per student.
● X6 "rank" The median percentile of ranking of test takers.

Slide 15:
Bike sharing systems are of great interest due to their important role in traffic
management and ever-increasing number of people choosing it as their preferred mode
of transport. In this study, we will address the key challenge of demand forecasting for
bikes under a bike sharing program using two-years of historical data corresponding to
years 2011 and 2012 for Washington D.C., USA, one of the first bike sharing programs
in the US. The data were acquired from the UCI Machine Learning Repository. The
dataset included 17380 observations with 17 attributes.

Slide 16:
Demand for bikes is dependent upon various environmental and time factors such as
weather conditions, precipitation, day of week, season, hour of the day, and so on. We
are interested in predicting the number of bikes rented per hour, which is the response
variable. The details of attributes included in the data are listed on the slide. They
include seasonal effects such as day of the week or month of the year. We also have a
factor specifying weather conditions, coded as a categorical factor, as well as other
weather factors. Overall, we have six qualitative predicting variables. For example,
three of these predicting variables are capturing seasonality in bike rental, including
day of the week, month of the year and hour of the day. We also have three
quantitative variables accounting for variations in weather, more specifically, in
temperature, rainfall, and wind speed, all three expected to impact bike rental.

Slide 17:
At first glance, the year predicting variables may seem obviously to be a quantitative
variable, and indeed, measures of time certainly can be. But in cases where there are
only a few years across many observations (in this case we have only two different
years of data), it may be better to consider year as a qualitative variable. If the
observations are made over many years, then considering ‘year’ as a quantitative
variable might be more appropriate. Generally, we can transform a quantitative
variable into a qualitative variable, or categorical variable, when we see that
there are non-linear relationships in how the predictive variable explains the response.

Slide 18:
Typically, a regression analysis is used for the following purposes:

● Prediction of the response variable,


● Modeling the relationship of the associations between predicting variables and
the response variable,
● Testing hypothesis of association relationships.
Why restrict ourselves to linear models? Well, they are simpler to understand, and
they're simpler mathematically. But most importantly, they work well for a wide range
of circumstances. It's a good idea when considering this kind of model, or in fact any
statistical model, to remember the words of a famous statistician, George Box: "All
models are wrong, but some are useful." We do not believe that a linear model will
provide a true representation of reality, rather we think that it might provide a useful
representation of reality. Another useful piece of advice comes from another very
famous statistician, John Tukey. "Embrace your data, not your models."

Summary:
To conclude, in this lesson we learned the fundamentals of multiple linear regression. I
also illustrated multiple regression analysis with three examples.
2.2. Estimation & Interpretation
The topic of this lesson is estimation of multiple linear regression. Specifically, I'll
introduce the approach for estimating the regression coefficients and the variance
parameter of the error terms. We will also learn about model interpretation along with
different roles the predicting factors have in multiple linear regression.

Slide 3:

I will begin this lesson by reviewing the multiple linear regression model and the
notation of the model in matrix form. We define the design matrix as a matrix
consisting of columns of predicting variables, including the column of ones
corresponding to the intercept. We'll also stack up all the values of the response
variable into a vector called here the Y vector. The same for the regression parameters,
the betas, and the error terms. The resulting matrix formulation of the model is
provided on the slide. We will be using this formulation in the estimation approach
introduced in this lesson.
Slide 4:

Estimating the model parameters in multiple linear regression is similar to the approach
we learned for estimating the parameters of a regression model with a single predicting
variable. Specifically, we minimize the sum of least squares, where the sum of least
squares is the sum of the squared differences between the observed responses y i and
the expected responses, or the linear combination β 0+β1x1 + … + βpxp. We can re-write
it into a matrix format as Y - Xβ transpose times Y - Xβ. where Y is the stacked vector of
responses, X is the design matrix, β is a stacked vector of parameters, all defined in the
previous slide.

If we use linear algebra to minimize the sum of least squares error, we obtain a system
of equations with p+1 equations and p+1 unknowns, specifically the beta coefficients,
including the intercept.

To solve this system of equations for the beta coefficients, we need to assume that X TX
is invertible. If this matrix is invertible, then the estimator is ^β as provided on the slide,
which is the inverse of XTX multiplied by the transpose of X and by Y. I will return later
to this very important condition that XTX needs to be invertible when I will introduce an
important concept in multiple linear regression, multicollinearity.
Slide 5:

We will review here some notation on the estimated beta coefficients. Again, we begin
with the true parameters, beta_0, beta_1 to beta_p, and their estimators, beta_0 hat,
beta_1 hat to beta_p hat, derived using the approach provided in the previous slide. The
vector of the estimated betas is a function of the design matrix X and the response
variable Y. Recall that we assume the response variable Y to be a random variable,
hence the estimated beta is also a random variable. We will derive the statistical
properties of the estimated beta in a different lesson when I introduce statistical
inference in multiple linear regression. Here I will highlight the difference between the
estimator beta hat which is a function of the random variable Y and the estimate of the
beta, where we replace Y (the random Y) with observed or realized response values
defined here by small ‘y’. The estimated values for the regression coefficients given the
observations y_1,..,y_n are fixed values, used in the interpretation of the model
relationships. In this course, we will use interchangeably: beta hat to denote the
estimator of beta and a function of the random Y versus beta hat when an estimate or
approximate value of beta and a function of the observed y. However, you will need to
remember the difference between the two formulations as provided in this slide.
Slide 6:

The fitted values are derived from the linear model by replacing the true coefficients
with the estimated ones, specifically, given by X Beta hat where the estimated
regression coefficients are derived as in the previous slides.

We can re-write the fitted values as H times Y where H denotes the matrix that we
multiply the observed vectors of the response values, y, also called the hat matrix. The
hat matrix is only a function of the design matrix X and thus it depends on the design
only.

To obtain the residuals, we take the difference between observed and fitted. We can
rewrite the residuals as (I – H) * Y, where I is the identity matrix and H again is the hat
matrix.

The estimated variance is now the sum of squared errors divided by n-p-1. The variance
estimator for multiple linear regression is similar to the estimator for simple regression
except that now we're using n-p-1 in the denominator rather than n-2.

Assuming that the error terms are normally distributed, the sampling distribution of the
estimated variance, or so-called mean squared errors or MSE, is a chi-squared
distribution with n-p-1 degrees of freedom.
Slide 7:

Let’s look closer at the estimator for the variance, specifically its statistical properties.

We do not have the “true” error terms because we don't know the true betas. But we
can replace the error terms with the residuals defined on the previous slide, I will call
them here the epsilon hat.

Thus, we use the sample variance of the residuals to estimate the variance of the error
terms, sigma squared.

The only difference from using the sample variance of the residuals vs the sample
variance formula you have learned in basic statistics is the use of n-p-1 in the
denominator, that is n-p-1 degrees of freedom. We will review this in the next slide.

I will note here that the sampling distribution of the variance estimator is a chi-squared
distribution. This is because the residuals also are going to be normally distributed when
the response data are normally distributed. We reviewed this distribution when I
introduced the variance estimator for the simple linear regression model.
Slide 6:

Why do we use ‘n-p-1’? ’This is because, when we replace the error terms with the
residuals, we also replaced p+1 coefficients parameters.

More specifically, we replaced β0 with β0 hat, β1 with β1 hat, and so on. Thus, we lose
p+1 degrees of freedom.

To conclude, the sampling distribution of the variance estimator or MSE is Chi-


square with n- p- 1 degrees of freedom.

Slide 9:
The interpretation of the estimated regression coefficients in multiple linear regression
is similar to that in simple linear regression, except that we need to consider multiple
predicting variables in the model explaining the response jointly. As before, the
estimated intercept is an estimate of the expected value of the response variable when
all predictors equal zero. The estimated value for one of the regression coefficients β i
represents the estimated expected change in y associated with one unit of change in
the corresponding predicting variable, Xi, holding all else in the model fixed. This
interpretation applies to all regression coefficients β1 through βp.

Thus now, we need to specify that there are other predicting variables in the
model held fixed while we vary one of the predictors. This is a very important
aspect of the interpretation of the regression coefficients in multiple linear regression as
I'll explain next.

Slide 10:
Modeling the relationship of a predicting variable to a response variable can be done
with a simple linear regression model as we learned in Module 1, for example, when we
fit the relationship between advertising expenditure and sales. But it also can be done
in a multiple regression context when we add other predicting variables in the model.
Thus, I will differentiate between so-called marginal and conditional models. The
marginal model, or simple linear regression, captures the association of one
predicting variable to the response variable marginally, that means without
consideration of other factors. The conditional or multiple linear regression model
captures the association of a predicting variable to the response variable, conditional of
other predicting variables in the model.

It’s important to highlight that the estimated regression coefficients for the conditional
and marginal relationships can be different, not only in magnitude but also in sign or
direction of the relationship. Thus, the two models used to capture the relationship
between a predicting variable and a response variable will provide different estimates of
the relationship. I will illustrate this aspect with a specific example in a different lesson.

Slide 11:
But why do we need multiple linear regression when we can use simple linear
regression? Often, the relationship between a response and a predicting variable is
dependent on other factors and cannot be singled out to be estimated using a simple
linear regression. Multiple linear regression allows for quantifying the relationship of a
predicting variable to a response when other factors vary. One of the dangers of using
multiple linear regression without much knowledge of fundamentals about regression is
the interpretation of the model results. This is particularly important in the context of
making causal statements when the setup of the regression does not allow so. Causality
statements can only be made in a controlled environment such as randomized trials or
experiments. In experimental studies, analysts can change the setting of one particular
factor in the environment, holding others fixed thereby isolating its effect. But such
isolation is not possible with observational data. Most of the data used in regression
analysis will likely come from observational studies, which generate data without the
ability to control biases and correlations among the observations, unless the regression
model carefully considers biases in the observed sample. Multiple regression provides a
statistical version of this practice, controlling for the bias, through its ability to
statistically represent a conditional action that would otherwise be impossible. However,
interpretation of relationships under multiple regression will need to be carefully
considered as part of the entire multiple regression model.

Let's look at a very specific example. We take a sample of college students and
determine their college GPA as well as their high school GPA and their SAT score. We
then build a model of college GPA as a function of high school GPA and SAT:
COLGPA = 1.3 + 0.7 HSGPA - 0.0003 SAT

Based on this model, it is tempting to say that the coefficient for SAT must have the
wrong sign, because it seems to say that higher values of SAT are associated with lower
values of college GPA. However, what it says is that higher values of SAT are associated
with lower values of college GPA, on the condition that high school GPA is held fixed.
High school GPA and SAT are correlated with each other. Thus, changing SAT by one
unit, holding high school GPA, may not actually even happen, it may not be possible.
This simple example illustrates the fact that we cannot make direct or causal
statements about how SAT impacts college GPA. We can only say that there is an
associative relationship. We need to be careful in interpreting the regression
coefficients when there are other predictive factors in a model that are correlated to
SAT, such as high school GPA.

To conclude this slide, the coefficient of multiple regression thus must be


interpreted in the context of other predictors in the model. We do not want to
interpret them marginally.

Slide 12:
More explicitly or practically speaking, multiple linear regression allows including
variables to explain the variability in the response variable, taking different roles.
Particularly, I differentiate factors into controlling, explanatory, or predictive factors.

Controlling variables can be used to control for bias selection in a sample. They're
used as default variables to capture more meaningful relationships with respect to other
explanatory or predicting factors. They are used in regression for observational studies,
for example, when there are known sources of bias selection in the sample data. They
are not necessarily of direct interest, but once a researcher identifies biases in the
sample, he or she will need to correct for those biases and will do so through controlling
variables.

Explanatory variables can be used to explain variability in the response variable.


They may be included in the model even if other similar variables are in the model.

Predictive variables can be used to best predict variability in the response regardless
of their explanatory power. Thus, when selecting explanatory variables, the objective is
to explain the variability in the response. Whereas when selecting predictive variables,
the objective is to predict the response.
Summary:
To conclude, in this lesson, we learned about the estimation approach in multiple
regression analysis, derived the statistical properties of the variance of the error terms
and reviewed the interpretation of multiple linear regression.
2.3. Regression Parameter Estimation: Data Example
The topic of this lesson is parameter estimation illustrated with a data example. We'll
learn how to implement a multiple linear regression model in R, and how to interpret it.

Slide 3:
Let's return to the example in which we study the relationship between advertisement
expenditure and sales. We now consider other predictive variables, divided into
quantitative and qualitative predicting variables.

Slide 4:
This is a set of questions we may be interested to address in this example.

A. Fit a linear regression with all the predictors and estimate the regression
coefficients. What are the estimated regression coefficients and the estimated
regression line?
B. Interpret and compare the estimated coefficients from the conditional model
versus the marginal model. Having analyzed the simple regression in Module 1,
we want to compare it to the conditional model.
C. Learn how the predictions of those two models will be different, and which of the
predictions are more meaningful.
a. What does the model predict as the advertisement expenditure increases
for an additional $1,000 using the full regression model?
b. Is the prediction different when compared to the prediction from the
simple linear model with just the advertisement expenditure variable?
D. Compare the estimated error variance under the conditional versus the marginal
model.
a. What is the estimate of the error variance?
b. Is it different from the simple linear regression model? Why?

Slide 5:
We will again use R to perform a multiple linear regression analysis. We have one
qualitative factor and thus we use "as.factor()" command to convert the column
corresponding to the region into a categorical variable in order to specify in the lm()
command that this is a qualitative variable.
Using the lm() command to fit the regression model, with the response variable ‘sales’
regressed against all the predictors in the data, instead of specifying each variable
explicitly, I have included here a dot after the tilde. This tells the lm() command to
include all the columns (except the response) as predictors. When using this approach
to fit the regression, it is thus important to first convert all qualitative predictors in the
dataset using the as.factor() command. The output for this model is on the slide. Let’s
review the output.

Slide 6:

The highlighted column from the output consists of the estimated regression
coefficients.

The estimated β coefficient for the advertising expenditure is equal to 1.4092. We would
interpret this estimate to be the expected additional gain in sales, in thousands of
dollars, for each additional $100 expenditure in advertisement, while holding all other
predictors fixed. A reminder here; the unit for sales is $1000 and the unit for advertising
expenditure is $100.

We can contrast this with the estimated coefficient from the marginal model, which was
2.772, larger than the estimated coefficient from the conditional model. The difference
in the interpretation is that the estimated coefficient from the conditional model is given
other predictors included in the model to explain the variability in the response while for
the estimated coefficient from the marginal model is without accounting for other
predicting factors.
Slide 7:

Comparing the predictions of these two approaches, the conditional model predicts that
an additional $1,000 in advertising expenditure will yield $14,000 in additional sales,
while the marginal model predicts a larger increase in sales, specifically, $27,700, for
the same increase in advertising expenditure. Which model is more meaningful?
Because sales vary with other factors, that means they are impacted by other factors,
the interpretation based on the conditional model is more meaningful.

Under the full model, the estimated variance is 55.57 2, derived as the squared residual
standard error where the residual standard error is provided in the output. Under the
simple linear model, the residual standard error was 101.4. The estimated variance
under the multiple linear regression model is thus much smaller than the estimated
variance from the simple linear regression model. This is because, when we include
multiple variables in a model, the model better explains variability in the response as
compared to the model when we include only one variable. Thus, the remaining
variability that is unexplained is smaller for the multiple regression model than for the
simple regression model.

Slide 8:
In our second example, we will examine compositional and demographic variables to
determine to what extent these characteristics impact SAT scores. All of the variables in
this study are used as explanatory factors except for two variables, which are used as
controlling variables. If we would rank states by mean response (the mean SAT
score), we'll need to first control for these two factors. Moreover, if we want to study the
impact of the other explanatory factors, again, we need to control for these two factors.
The first controlling variable is ranking, the rank of those students taking SAT to control
for this bias across all the states. The second controlling variable is takers. Researchers
of the study from which these data were derived noted that states with high average
SAT scores had low percentage of students taking the exam. This is because Midwest
states used to administer different tests to students going to in-state colleges. Only
their best students planning to attend out-of-state colleges took the SAT. As the
percentage of takers increase for other states, so does the likelihood that the takers
include the lower qualified students. Thus, we need to control for the bias selection due
to the ‘takers’ variable also.

Summary:
To summarize, in this lesson, I introduced the estimation approach along with the
interpretation of the fitted model in multiple regression analysis using data examples.
2.4 Statistical Inference
In this lesson, we will cover statistical inference on the regression parameters in
multiple linear regression, specifically, the statistical properties of the estimated
regression coefficients, along with procedures on how to estimate confidence intervals.
We will also learn hypothesis testing for the significance of the overall regression or
subsets of the regression coefficients.

Slide 3:

Let's begin first with the statistical properties of the regression estimators. The
expectation of the vector of the estimators of the regression coefficients is equal to the
vector of the true regression parameters. We can derive this similarly to simple linear
regression; again, this says that the estimated regression coefficients are unbiased.
The variance of the estimated regression coefficients is provided also on the slide; it is
derived as the inverse of X transpose X multiplied by sigma squared, where X is the
design matrix. Because beta hat is a vector of the estimated regression parameters,
the variance is a matrix, called covariance matrix. The diagonal of the covariance
matrix includes the variances of the coefficients, and off the diagonal includes the
covariances between pairs of the estimated regression coefficients.

I will remind you again that the condition for obtaining the estimated regression
coefficients by minimizing the sum of least squares is that X transpose X is invertible; if
this condition does not hold, the variance of the estimator is not finite. I will return to
this condition when I introduce multicollinearity.

Similar to simple linear regression, β hat is a linear combination of the Ys. Assuming the
error terms are normal, i.e, the response variables are normally distributed, then β hat
has also a normal distribution with the mean and the covariance matrix as provided on
the slide. Again, because β hat is a vector, then this is a multivariate normal
distribution.

Slide 4:

The covariance matrix denoted by Σ on the previous slide depends on sigma squared,
which is unknow. We thus replace sigma squared with its estimator given by the mean
square error or MSE, which is equal to the sum of squared of the residuals divided by n-
p-1.

Now, when we replace sigma squared with its estimator, the sampling distribution for
the individual regression coefficients becomes a t-distribution with n-p-1 degrees of
freedom, where n-p-1 comes from the degrees of freedom of the variance estimator.

Slide 5:
Again the sampling distribution of the estimated regression coefficients is a t-
distribution, and based on this distribution now we can derive confidence intervals for
the regression coefficients. To derive a (1 – alpha) confidence interval for β j, we center it
at βj hat + or - the t-critical point multiplied by the standard deviation of the estimator β j
hat, that is the square root of the variance of β j hat.

We use this confidence interval to evaluate whether β j is statistically significant by


checking whether 0 is in the confidence interval. If it is not, we conclude that β j is
statistically significant. Alternatively, we can evaluate whether a coefficient is
statistically significant using hypothesis testing as presented next.

Slide 6:

We can use statistical inference using hypothesis testing to test for statistical
significance of individual βj, specifically, using the t-test. Importantly, this test measures
the statistical significance of βj given all other predicting variables in the model when
using a multiple linear regression. Here, the null hypothesis is that the coefficient is 0
versus the alternative hypothesis that it is not. Similar to simple linear regression, the t-
value is βj hat minus 0 divided by the standard deviation of β j hat.

If this t-value is larger in absolute value than the t critical point, we reject the null
hypothesis and conclude that the coefficient is statistically significant.
Slide 7:

How will the procedure change if we test whether the coefficient is equal to a constant
‘b’? Again, the t-value looks very similar, but we replace 0 with the null value b. We
reject the null hypothesis if the t-value in absolute value is larger than the t critical point
with n-p-1 degrees of freedom.

Alternatively, we can make a decision using the p-value, computed similarly as for the
statistical significance tests presented in Module 1. If the p-value is small, for example
smaller than .01, we reject the null hypothesis that β 0 is equal to the value b.

Slide 8:

How does the hypothesis testing procedure change if we test whether the coefficient is
statistically positive or statistically negative? If we want to test for a positive
relationship, then the p-value is the probability of the upper or right tail from the t-value
of the t-distribution with n-p-1. If we want to test for a negative relationship, the p-value
is the probability of the lower or left tail of the t-distribution.

Slide 9:

We also have an Analysis of Variance or ANOVA table for multiple linear


regression. Here we divide the variability in the response variable into the variability
due to the regression and the variability due to the errors. In the ANOVA table as
provided in the slide, we have multiple columns corresponding respectively to the
degrees of freedom, the sum of squares, the mean sum of square and the F statistic.
Let’s understand each column.

The number of degrees of freedom for the variability due to the regression is equal to
‘p’, where ‘p’ is the number of predicting variables. The number of degrees of freedom
corresponding to the source of variability due to the residuals or error is n-p-1, which
corresponds to the degrees of freedom for the variance estimator. The number of total
degrees of freedom is the sum across those two, which is n-1.

The sum of the squared differences between the fitted values and the average across all
the responses is the sum of squares for regression or SSReg. The sum of squared
differences between observations and the average is sum of squares total or SST. Sum
of squared residuals or errors is SSE. The formula of the sum of squares are provided
below the table. To obtain the corresponding mean sum of squares, we take the sum of
squares and divided it by the corresponding degrees of freedom.

The F-statistic is the ratio between the two mean sum of squares.
Slide 10:

We will use the ANOVA introduced in the previous slide to test the hypothesis that the
regression coefficients (excluding the intercept) are zero versus the alternative
hypothesis that at least one of the regression coefficients is not equal to zero, meaning
that at least one of the predictors included in the model has predictive power. This is
called the test for overall regression because it indicates whether the overall regression
has any predictive or explanatory power for the response variable, specifically, if at
least one of the predicting variables explains the variability in the response.

To perform this test using ANOVA, we can use the F-test just like in the ANOVA table.
The F-statistic is the ratio between the mean sum of squares for regression and mean
sum of squares of errors or residuals. We reject the F-statistic if it's larger than the F
critical point with p and n-p-1 degrees of freedom, with alpha being the significance
level of the test.

We can also use the p-value, which is the probability of the tail left to the F-statistic of
the F-distribution with p and n-p-1 degrees of freedom. If this p-value is small, then we
reject again the null hypothesis of all zero regression coefficients.

Rejecting the null hypothesis means, again, that at least one of the coefficients is
different from 0 at the alpha significance level, hence the overall regression is
statistically significant.

Slide 11:
In the next slides I will focus on another statistical inference approach in multiple linear
regression, specifically, testing for subsets of regression coefficients. For this test, we
will first learn how to decompose the variability in the response variable more
granularly. First, we use the ANOVA in multiple linear regression to decompose the
variability in the response, the sum of square total, into the sum of the square
regression of the full model plus the sum of square error of the full model. Because in
multiple linear regression, we explain the variability in the response using multiple
predicting variables, we can further decompose the sum of squares for regression into
multiple components.

First, SSReg(X1) is the sum of squares for regression due to X 1 alone, specifically for the
regression explained by using X1 to predict Y, or the marginal model in X 1.

The next component, SSReg(X2|X1) is the extra sum of squares explained by using X 2 in
addition to X1 to predict Y, which means that we already have X 1 in the model, now we
are adding X2 to the model and this is the extra sum of square for adding X 2 to the
model.

The next component, SSReg(X3|X1,X2) which is the extra sum of squares explained by
using X3 in addition to the predicting variables X1, X2 to predict Y.

The last component, SSReg(Xp|X1,...,Xp-1) is the extra sum of squares explained by using
Xp in addition to all the other predictors in a model to predict Y. It's important to take
into account the order with which the predicting variables are added to the model, we
first add the X1, then X2, then X3, and so on. If we consider a different order in which the
variables enter the model, then the decomposition of the sum of squares for regression
will be different from the one provided here.
Slide 12:

Let's see how we can use this decomposition to evaluate the significance of subsets of
coefficients. For example, if we want to address the question whether X 1 alone
significantly aids in predicting Y, we can compare the sum of squares for regression to
the sum of squared errors of the model including X 1.

If we want to answer whether the addition of X 2 significantly contributes to the


prediction of Y after we account (or control) for the contribution of X 1, we can compare
the extra sum of squares for regression, which is the additional sum of square of
regression generated by adding X2 to the model that already has X1 versus the sum of
squared errors.

We will use the same method to determine whether the addition of X 3 contributes to the
prediction of Y after we control for or explain the contribution of X 1 and X2.

The same if we want to determine whether the addition of the last variable X p
contributes to the prediction of Y after we control for or explain the contribution of all
other predicting variables.
Slide 13:

Formally, let’s now consider the full model with the predicting variables divided into two
groups, X’s and Z’s, and with β regression coefficients corresponding to the X predictors
and the 𝛾 (gamma) coefficients for the Z predictors. For example, the X's can be
controlling factors, and Z's can be additional explanatory factors. We may want to test
the null hypothesis that all the gamma coefficients corresponding to the Z variables are
zero versus the alternative that at least one of the gamma coefficients is not zero. The
interpretation of the null hypothesis test is that the addition of the Z variables does
result in significantly explaining the variability in the response in addition to the X
variables being already in the model.

To perform this test, we can use what we call the Partial F-test, which compares the
extra sum of squares for regression due to the addition of the Z variables to the model
that already has the X variables versus the sum of squared errors of the full model.

We reject the null hypothesis if this F-statistic is larger than the critical point of the F-
distribution with q and n-p-q-1 degrees of freedom where q is the number of additional
variables added to the model and n-p-q-1 is the number of degrees of freedom of MSE
of the full model.

More specifically, if we reject the null hypothesis, we conclude that at least one of the
gamma coefficients is statistically significantly different from zero, or that some or all Z
variables add predictive or explanatory power to the model already including the X
variables.
Slide 14:

A special case of this test is for q=1 or adding only one Z variable to the model already
including the X variables. The null hypothesis is that the gamma regression coefficient
corresponding to Z is zero vs the alternative that it is different from zero. The
corresponding F-statistic is now the extra sum of squares for regression due to the
addition of the Z variable to the model divided by the MSE of the full model.

This test is in fact equivalent to the t-test for statistical significance of the gamma
regression coefficient.

Thus, the interpretation of the t-test for statistical significance is conditional on the
presence of other predicting variables to be in the model. More specifically, if we reject
the null hypothesis, we conclude that the regression coefficient for which we perform
the test is statistically significant given that the other variables are in the model.
Equivalently, we interpret that the predicting variable corresponding to the regression
coefficient to be tested significantly explains the variability in the response variable
given all the other variables being in the model. This interpretation is very important in
that we cannot and should not select the predicting variables that explain or predict
the variability in the response based on the t-tests for statistical significance because
the statistical significance depends on the other variables included in the model. I will
expand on this aspect further in Module 4 where I will introduce variable selection
approaches that can be used toward this goal.
Summary:
In this lesson we learned three different inference approaches, specifically, testing for
statistical significance of the regression coefficients, of the overall regression, and of a
subset of regression coefficients. We will illustrate these approaches with data
examples in the next lesson.
2.5. Statistical Inference: Data Examples
In this lesson, I will illustrate the implementation of statistical inference in multiple
linear regression with a data example using the R statistical software.

Slide 3:
We will apply the statistical inference approaches to the data example in which we're
interested in explaining the variations in the mean SAT score. The reason I selected this
example is the clear delineation of the variables into controlling and explanatory
variables.

Slide 4:
This is a set of questions we will address using statistical inference:

a. What is the sampling distribution of the estimated regression coefficient


corresponding to the ‘takers’ controlling factor?

b. Is this estimated coefficient statistically significant?


c. Obtain the 99% confidence interval for the same coefficient.
d. What is the F-statistic for the overall regression? Do we reject the null hypothesis
that all regression coefficients are zero?
e. Given the two controlling factors in the model, we will also test the null
hypothesis that the coefficients of the rest of explanatory variables are zero or
not.

Slide 5-6:
As always, first we read the data into R. The data file has a header, meaning the
columns already have names, so we set header equal to TRUE. Once again, we use the
lm() function to fit a multiple linear regression model with SAT as the response variable.
The R output of this fitted linear regression model is provided on the slide. The output
provides information about estimated regression coefficients, standard errors, T-values,
and the p-value of the statistical significance tests for the regression coefficients.

a. Let’s focus first on one controlling variable, ‘takers’, the first variable entering. The
estimated coefficient for takers is -0.48, and the standard error is 0.693. The
sampling distribution is a t-distribution with 43 degrees of freedom, corresponding to
n-p-1.
b. We can use the P-value in this output for evaluating the statistical significance of the
regression coefficient corresponding to ‘takers’, β 1. The p-value for the coefficients
of ‘takers’ is greater than 0.1, meaning that we do not reject the null hypothesis that
the coefficient corresponding to this predictor is zero. However, as this is one of the
controlling factors, we would expect this predictor to impact the variation in the
mean SAT scores. Recall however that we interpret statistical significance in the
context of a multiple linear regression. That means, we conclude that the coefficient
is not statistically significant given that there are other predictors in a model, given
for example that the ‘rank’ predicting variable is in the model. We will learn later
that ‘takers’ and ‘rank’ are highly correlated.

Slide 7:

To estimate the confidence intervals for the individual regression coefficients, we use
the confint() command, requiring specification of the fitted model, the predicting
variable for which we want to estimate the confidence interval, and the confidence level
if different than the default 0.95.
The confidence interval for the regression coefficient corresponding to ‘takers’ takes
values between -2.3 and 1.3; because the interval includes the 0 value, we conclude
that it is plausible that regression coefficient to be zero given all other variables are in
the model.

Please note that we do not discard this variable from the model because its regression
coefficient is not statistically significant. This is for two reasons. First, ‘takers’ is a
controlling factor hence we will keep it in the model even if not statistically significant.
Second, as highlighted before, we interpret statistical significance given other
predictors in the model.

Slide 8:

If we want to test for overall regression, we can use the F-test as introduced in the
previous lesson. The R output provides the F-value, which is 51.91, and the
corresponding P-value, which is approximately zero, meaning that at least one of the
predictive variables has predictive power or that the overall regression is statistically
significant.
Slide 9-10:

To test whether the explanatory factors, including income + years + public + expend,
add explanatory power in addition to the controlling factors, we can perform the partial
F-test using the R command anova(). First, we create a reduced model with only the
controlling factors: takers and rank, then use the anova() command to compare the two
models, the full model with all the variables (all the predictors) and the reduced model
with only the controlling factors.

The output provides the partial F-value, which is 8.6221 and the P-value approximately
equal to 0. Because the P-value is approximately zero, we reject the null hypothesis that
the coefficients corresponding to the four predictors we're adding to the model are all
zero.

Slide 11:
This slide presents the formal implementation of the test following the derivations
provided in the previous lesson. We wanted to test whether the coefficients of the four
predictors we're adding to the model are equal to zero. The F-statistic is the extra sum
of squares for regression from adding the four predictors to the model that already has
the controlling factors divided by the mean sum of square of errors of the full model.
The P-value is the probability of the right tail from the F-value of the F-distribution with
4 and 43 degrees. Because the P-value is approximately zero, we reject the null
hypothesis. We conclude that at least one predictor among income, years, public and
expend will be significantly associated with average SAT scores.

Summary:
To conclude, in this lesson we learned how to implement statistical inference using the
R statistical software for three different inference approaches, specifically, testing for
statistical significance of the regression coefficients, of the overall regression, and of a
subset of regression coefficients.
2.6. Regression Line & Prediction
In this lesson, we will learn about estimation and statistical inference of the regression
line in multiple linear regression. We will also learn about prediction of a new response,
again highlighting the difference between estimation and prediction in regression. I will
also illustrate these concepts with data examples.

Slide 3:

Similarly to simple linear regression, we will begin with x* which, in this case, is a vector
of values consisting of values for all predicting variables. We would like to estimate the
mean response, y given x*. The estimated regression line is the regression line where
we replace the beta coefficients with the estimated regression coefficients, more
specifically, the estimated regression line is β 0 hat plus β1 hat times x1* and so on. In
matrix format, we can write this as x* transpose times the vector of the estimated
coefficients beta hat.

Because the estimators of the beta coefficients are normally distributed, so is ŷ. If we


know the expected value and the variance of ŷ, We can use this normal distribution in
order to make statistical inferences on ŷ.
Slide 4:

The expectation of the estimated regression line or the mean response in x* is the sum
of the expectations of the estimators of betas times the corresponding predicting
values. Because estimators of betas are equal to the true parameters, that is, they are
unbiased estimators, the expectation of the mean response is the regression line itself,
thus an unbiased estimator.

The variance is also provided on the slide. The variance depends on the design matrix
through the inverse of X transpose X. If there is strong correlation between the
predictors, then the values in this matrix can be very large. Thus, the estimated
regression line will have high uncertainty under strong correlation or under near linear
dependence among the predicting variables. The variance also depends on the variance
of the error terms, sigma square, which is unknown.

If we replace the variance of the error terms with its estimator, the MSE, the resulting
sampling distribution of the estimated regression line becomes a t-distribution with n-p-
1 degrees of freedom, where the degrees of freedom will come from the sampling
distribution of the estimated variance of the error terms, discussed in the prior lesson.

Slide 5:

Similar to the derivation of the confidence interval for the regression coefficients, a
confidence interval for the mean response is centered at the estimated regression line
plus or minus the t-critical point times the standard deviation of the estimator. The
interval length depends on x*, but also on the design matrix through the inverse of the
matrix XTX. Thus, when we have correlation among the predictors, the confidence
intervals for the estimated regression line will be wide.

Slide 6:
Prediction is one of the objectives in regression analysis. While the predicted response
is derived similarly to the estimated regression line or mean response, prediction is not
the same as estimation.

This is not only due to the interpretation, but also due to the uncertainty level in the
predicted mean response. Specifically, the uncertainty in the estimation of the
regression line comes from the estimation alone of the regression coefficients. Whereas
for prediction, the uncertainty comes from the estimation of the regression coefficients
and from the newness of the new observation.

Slide 7:
How does this translate in terms of the variance of the predicted mean response? The
variance will consist of two components, one coming from the estimation of the
regression coefficients. The other is due to the prediction for a new setting represented
through x*, which is sigma squared under the assumption of constant variance. If we
add those two variances together, we obtain the variance of the predicted regression
line.

The difference between the estimated regression line and the predicted line is in the
addition of a sigma squared which is, again, due to the variability of a new
measurement.

Slide 8:

The confidence interval for the predicted mean response or regression line looks very
much like the confidence interval for the estimated mean response, except that now we
have an additional sigma squared hat in the variability of the predicted regression line.
Note that the predicted regression line is the same as the estimated regression line at
x*. However, the prediction confidence interval is wider than the estimation confidence
interval because of the higher variability in the prediction.

If we're interested in prediction intervals for ‘m’ different new x*s, then we'll need to
adjust the critical point for the joint prediction intervals. That is, for the joint or
simultaneous prediction intervals for a set of x*s of the predicting variables, we'll use a
different critical point as on the slide. This adjustment will make the prediction intervals
wider than if we were to consider only one prediction. I will note here however that the
R statistical software doesn’t use this approach when using the predict() command
hence you should keep in mind that the prediction intervals derived with R when applied
to multiple x*s will be narrower than the simultaneous intervals as introduced here.

Slide 9:

We will return now to the data example where we are interested in the relationship
between advertising expenditure and sales.

Slide 10:
The questions we will address here are:

● What is the average (mean) estimated sales and the corresponding standard
deviation across all offices with the same characteristics as those for the first office?
What is the 95% confidence interval for this mean response?
● What sales would you predict for the first office if its competitor’s sales would
increase at $303,000 assuming everything else is fixed? What is the standard
deviation of this prediction? What is the 95% prediction interval?
Slide 11:

To address the first set of questions, we will need to get the predictors’ data for the first
office, which is the first row in the data matrix, defined here as x*. For estimating the
standard deviation of the mean response for this x*, we will use the formula introduced
in a previous slide. To estimate the variance of the residuals (sigma squared hat), we
can use the summary of the fitted model. To construct the design matrix X, we can use
the R command ‘model.matrix’ with the input - the fitted model. We then assemble all
these together using the variance formula, then we take the square root of this value to
get the standard deviation.

To obtain the confidence intervals and the estimated mean response, we use the
predict() R command. In this command, we need to input the fitted model, the new
data, and we must specify the type of the interval. In this case, we want a "confidence"
interval since we are interested in the average sales across all offices with the same
characteristics as the first office.

Slide 12:
If we want the estimated mean response for sales, the value is 934.77 units of sales.

We can get the estimated standard deviation as derived here, and the lower and upper
bound confidence interval using the predict() R command.

How do we interpret this output? For offices with the same characteristics as the first
office, the average estimated sales will be on average of $934,770 with a lower bound
of $865,000 and an upper bound of about 1 million.
Slide 13:

To address the second set of questions, we will perform a similar implementation,


except that we are interested in prediction. For prediction, we need to change the
competitors’ sales in the data of the first office. We will change the value for the
competitor’s sales to 303 units because their sales increased to $303,000. To estimate
the standard deviation, we can use a similar approach as in the previous slide, except
that we need to add a sigma hat squared. We will use the predict() R command for the
prediction interval, with the interval type "prediction".

Slide 15:
This yields a mean of 911.05 units of sales with the standard deviation of 62.62, and the
prediction interval takes values between 775.94 and 1,046.16. This means that, if the
competitors’ sales were to increase by $303,000, the predicted sales would reduce by
$23,719. Because this is prediction, the standard deviation would increase as well.

Summary:
In this lesson, we learned about estimation of the regression line and prediction of new
response and illustrated these concepts using a data example.
2.7. Assumptions and Diagnostics
In this lesson, you will learn how to evaluate the assumptions of multiple linear
regression based on the statistical properties of the residuals. I'll also discuss
transformations to improve the fit of the regression.

Slide 3:
In multiple linear regression, the data consist of the response variable Y, and a set of p
predicting variables. The model is a linear relationship with respect to the predicting
variables, plus the error term:

The assumptions in multiple regression are:

● Linearity assumption, meaning the relationship between Y and X j is linear for all
predicting variables.
● Constant variance assumption, meaning the variance of the error terms is the
same across all error terms.
● Independence assumption, meaning the error terms are independent random
variables.
● Normality assumption, meaning the error terms are normally distributed.

Slide 4-5:
Here I will contrast the properties of the error terms and of the model residuals. The
expectation of the error terms is 0, and the variance is sigma squared. If we stack up
the error term into a vector, the expectation is a vector of zeros and the variance is a
covariance matrix equal to sigma squared times the identity matrix, meaning the error
terms are uncorrelated.

The residuals are the differences between observed and fitted values, and they are
used as proxies of the error terms. The expectation of the residuals is still the vector of
zeroes just like the error terms; this is from the first assumption, the error terms have
mean zero. However, the covariance matrix of the residuals is sigma squared times I
minus H, where big I is the identity matrix and H is the hat matrix, introduced in a
previous lesson. The hat matrix depends only on the design matrix X.

This means that the variance of the i-th residual term, epsilon hat i, is sigma squared
times 1 minus hii where hii is the i-th element on the diagonal of the hat matrix.

Thus, while the error terms have constant variance, the residuals do not. The variance
of epsilon i-hat again is sigma squared times 1 minus h ii , which depends on ‘i’, hence
not the same across all residuals since the h ii ’s are not equal.

Thus, if we want to use the residuals for evaluating the model assumptions, we need to
standardize them, specifically, divide the residuals by their standard deviation.

Slide 6-7:

To perform the residual analysis for evaluating the assumptions, we can use various
graphical displays, similarly to simple linear regression. For example, the scatterplots of
the residuals against each individual predicting variable can be used to evaluate
linearity. The plot of the residuals against the fitted values can be used to evaluate the
assumption of constant variance and of uncorrelated errors (a proxy for
independence). The normality plot and the histogram can be used to evaluate the
normality assumption. All in all, this means that we use diagnostics to evaluate
assumptions which is equivalent to Goodness-of-fit (GOF). I’ll come back to this later in
this course when we compare Goodness-of-fit with the performance of the regression
model.

I'll also point out a few important aspects of the residual analysis that you need to
remember when you evaluate the model assumptions. We evaluate the normality
assumption using the residuals, not the response variable. It is possible for
example that if you were to use the histogram of the response variable to evaluate
normality, you may find bi-modality in the distribution, which is not necessarily an
indication that the normality assumption does not hold but it could be an indication that
the response variable can be explained by a categorical variable, for example.

We also do not check whether the predicting variables are normally distributed.
However, if the distribution of a predicting variable is highly skewed, it is possible that
the linearity assumption with respect to that variable will not hold. Thus, you'll have to
consider transformations to improve the linearity between the response and the
predicting variable, but not to improve the normality of the predicting variable.

Slide 8:

Consider this plot of the residuals against one predicted variable. You can see that we
have a pattern here showing that there is a nonlinear relationship between X and Y. You
will have to evaluate the relationship between the response variable and each
quantitative predicting variable included in the model. If we have such pattern in the
scatter plot of one or more X’s versus Y, the linearity assumption does not hold.
Slide 9:

Here is another departure from the model assumptions, the constant variance
assumption. This plot shows an example of the residuals against the fitted variables, in
which the residuals show increasing variability as the fitted values increase.

This means that the sigma squared is not constant or that the assumption of constant
variance does not hold.

Slide 10:

This is a third example of a possible departure from the model assumptions. The
residuals now are clustered in two separated clusters, which means that the residuals
may be correlated due to some clustering effect, for example proximity in geography
where the observed responses may have been observed.

Keep in mind that residual analysis cannot be used to check for the
independence assumption. Recall, the assumption is independent errors, not
uncorrelated errors. But all we can assess with the residual analysis is uncorrelated
errors. If the data are from a randomized trial, the independence is implicit. But most
data in regression analysis are from observational studies and thus the independence
does not hold. In those cases, residual analysis is going to be used to assess
uncorrelated errors, not independent errors.

Slide 11:
For checking normality, we can use the quantile plot or normal probability plot,
plotting the theoretical quantiles of the normal distribution versus the empirical
quantiles of the residuals in such a way that the points should form a straight line.
Departures from the straight line indicate departures from normality.

If the residuals are normal, then the quantiles of the residuals will line up with the
normal quantiles, thus we should expect that they follow a straight line. Departure from
a straight line could be in the form of a tail, which is an indication of either a skewed
distribution, or a heavy-tail distribution. We reviewed this plot in Module 1 with some
examples.

Slide 12:
Another approach to check for normality is using the histogram plot. Histograms are
often used to evaluate the shape of a distribution. In this case, we would plot the
histogram of the residuals and will identify departures from normality. Examples of
departures from the normality assumption include skewedness in the shape of the
distribution, multi-modality, that is, when we have two or more modes in the
distribution, or gaps in the data. I suggest using both the normal probability plot and the
histogram approaches to evaluate normality.
Slide 13:

If some of the assumptions do not hold, then we interpret that the model fit is
inadequate, but it does not mean that the regression model is not useful. For
example, if the linearity does not hold with respect to one or more predicting variables,
then we could transform the predicting variables to improve the linearity assumption.
This is generally a trial-and-error exercise, although sometimes you may just need to fix
a curvature in the relationship, which could be done through using a power
transformation or the classic log transformation.

Slide 14:

What if the normality or constant variance assumption does not hold? Often,
we use a transformation that normalizes or variance-stabilizes the response variable.

That common transformation is a power transformation of y. This is called the Box-Cox


Transformation. If lambda is equal to 1, we do not transform. If lambda is equal to 0, we
use the normal logarithmic transformation. If lambda is equal to -1 use the inverse of y.
After transforming y, you will need to fit the model again and evaluate the residuals for
departures from assumptions. If the transformation(s) did not address these departures
from assumptions, you will need to consider other transformations. If you cannot
identify the appropriate transformation, you will need to consider a different modeling
approach.

Slide 15:
An important aspect in regression is the presence of outliers, which are data points
far from the data in x and/or y. Data points that are far from the mean of the x's are
called leverage points. A data point that is far from the mean of the x's and/or from
the y's is called influential point if it influences the regression model fit significantly.
They can change the value of the estimated parameters, the statistical significance, the
magnitude of the estimated parameters, or even the sign. It is important to note that an
outlier, including a leverage point, may or may not impact the regression fit
significantly, thus it may or may not be an influential point.

It is tempting to just discard outliers. But sometimes the outliers belong to the data.
Other times, there are good reasons for excluding subset of points when there are
errors in a data entry or in the experiment. When outliers belong to the data, you will
have to perform the statistical analysis with and without the outliers and infer on how
an outlier influences the regression fit.

Slide 16:
To identify outliers, we can compute the so-called Cook's distance, the distance
between the fitted values of the model including all the observations versus the fitted
values of the model discarding the i-th observation from the data used to fit the model.
The idea here is that the Cook's distance will measure how much the estimated
regression coefficients, their statistical significance, and the predictions change when
the i-th observation is removed.

A rule of thumb is that when the Cook's distance for a particular observation is larger
than 4 over n, it could be an indication of an outlier. Another rule of thumb is when the
Cook’s distance is close to 1. My rule of thumb is simply visualizing the Cook’s distance
plot and if a few observations show significantly larger Cook’s distances, I recommend
investigating the model with and without each individual potential outlier.

Note that outliers are those few observations that behave differently than the rest of
the data; when I say a few, that means, 1 to 3 observations. If you identify many
outliers, then they are not outliers, but they are observations from the tail of a heavy
tailed distribution, hence the normality assumption does not hold.

Summary:
To summarize, in this lesson, we learned about model diagnostics, including evaluation
of model assumptions and outliers as well as transformations to address violations of
model assumptions.

2.8. Model Evaluation and Multicollinearity


This lesson covers Model Evaluation and Multicollinearity. You will learn how to evaluate
the model performance and you will learn about the concept of Multicollinearity, which
is particularly important for Multiple Linear Regression when we have multiple
predictors in a model.
Slide 3:

A simple approach for evaluating linearity among multiple factors is the Correlation
Coefficient. The Correlation Coefficient is used to evaluate linear dependence,
specifically, the linear relationship between two variables. It could be between Y and X,
that is, between the response variable and a predicting factor, or it could be between
two different predicting variables. Since we can use the Correlation Coefficient to
evaluate whether we have a linear relationship between the response variable and the
predicting variables, we can use this coefficient to find a good transformation to
improve the linearity assumption if it doesn’t hold. We could try several transformations
for X, the predicting variable; we'll chose the transformation that will most improve the
Correlation Coefficient. We also can use the Correlation Coefficient to evaluate the
correlation between the predicting variables, for detecting (near) linear dependence
among the variables, or multicollinearity, as I'll discuss later in this lesson.

Slide 4:
Just like in simple linear regression, a common approach for evaluating the performance
of a Multiple Linear Regression model is the Coefficient of Determination. This is the so-
called the R2 which is 1- SSErrors / SSTotal.

We interpret R2 as the proportion of total variability in Y that can be explained


by the linear regression model.
Slide 5:

The R2 formula involves the so-called two sum of squares. Here I will recap the sum of
squares differentiated into sum of squared errors, sum of squares total, and sum
of squares for regression. Unfortunately, the field of statistics abounds in
inconsistent terminology and notation. This slide provides an account of different ways
these sums of squares are denoted or defined. For consistency, I will try to stay with the
same notation throughout the course although do keep in mind all these other
notations.

Slide 6:

This slide provides a summary of various approaches to evaluate model performance, or


the explanatory or predictive power of a linear model. A first approach to evaluate the
model is through the overall regression test or the F-test. For this test, the null
hypothesis is that all the regression coefficients except the intercept are 0 versus the
alternative that at least one is not 0. What this says is that, if we reject the null
hypothesis, we will conclude that at least one of the predicting variables explains the
variability in the response. I overviewed this test in a different lesson.

The coefficient of determination or R2 is another way to evaluate a linear model.


However, R2 increases as we add more predicting variables. Thus, if we want to
compare models with different numbers of predicting variables, we should use the
adjusted R2, because the adjusted R2 is adjusted for the number of predicting variables
in the model. When to use R2 versus R2 adjusted? When we're interested in explaining
the variability in the response, we use the R 2, when we want to compare models with
different number of predicting variables, we're going to use the adjusted R 2.

Please note that these are not Goodness-of-fit measures. Goodness-of-fit refers to the
goodness of the fit of the data with the model structure and assumptions. The measures
on this slide can be used to evaluate the performance of how well the predicting
variables explain the response under the linearity assumption.

Slide 7:

Another approach for evaluating the model performance is by evaluating the prediction
accuracy of the model for new responses. The general approach is as follows. In the first
step, we divide the data into training and testing data since we need to leave aside
some percentage of the data for ‘new responses’ that need to be predicted. There are
two general approaches to divide the data. One is random sampling, meaning that we
randomly allocate a percentage of the data to training, say 75%, and the remaining
percentage of data to testing. Note that each time we divide the data this way, we will
have a different dataset for training and testing. The other approach is called K-fold
cross validation, meaning that we divide the data into k chunks or parts of
approximately equal size then allocate k-1 of the folds to training and one fold to
testing.

In Step 2, we fit the regression model using the training data, providing the estimated
regression line.

In Step 3, we predict the “new” responses, corresponding to testing data using the
fitted model; note that while we observe the responses in the testing data, we pretend
we have not seen them before prediction so that we can compare the predicted
responses to the observed responses corresponding to the testing data. Here we use a
prediction accuracy measure to evaluate how well the assumed model predicts.
Examples of prediction accuracy measures are in the next slide.

Last, we apply Steps 1 to 3 multiple times, then average the prediction accuracy
measure over all repetitions.

Slide 8:
The most common reported measures of predicting accuracy are on this slide:

● Mean squared prediction error abbreviated MSPE and computed as the mean of
the squared differences between predicted and observed;
● Mean absolute prediction errors abbreviated MAE and computed as the mean of
the absolute values of the differences between predicted and observed;
● Mean absolute percentage error abbreviated MAPE and computed as the mean of
the absolute values of the differences scaled by the observed responses;
● Precision measure (or precision error) abbreviated PM here and computed as the
ratio between MSPE and the sum of square differences between the response
and the mean of the responses;
● Confidence Interval Measure abbreviated here as CIM and computed as the
number of predictions falling outside of the prediction intervals divided by the
number of predictions made.
These measures can be used in measuring predictive accuracy. But the endpoint is to
compare various prediction models.

Slide 9:
Let’s take a better look at some insights on which of the measures would work better in
the context of multiple linear regression:
MSPE is appropriate for evaluating prediction accuracy for a linear model estimated
using least squares, but it depends on the scale of the response data, and thus is
sensitive to outliers.

MAE is not appropriate for evaluating prediction accuracy of a linear model estimated
using least squares, and depends on scale, but it is robust to outliers.

MAPE is not appropriate to evaluate prediction accuracy of a linear model estimated


using least squares, but it does not depend on scale and it is robust to outliers.

The precision error is the best of all because it is appropriate for evaluating prediction
accuracy for linear models estimated using least squares and it does not depend on
scale. The precision measure is reminiscent of the R 2. It can be interpreted as a
proportion of the variability in the prediction versus the variability in the new data.

While MAE and MAPE are commonly used to evaluate prediction accuracy, I recommend
using the precision measure. This recommendation has a theoretical foundation, but the
intuition is that the regression model is estimated by minimizing the sum of least
squares hence the accuracy error shall match the estimation approach, thus it should
evaluate squared differences not absolute differences between predicted and observed
for a fair error measurement.

Last, Prediction Accuracy Evaluation is not the same as Goodness-of-fit!

Slide 10:

Now we are moving onto the concept of multicollinearity that I hinted to a few times
already in prior lessons. Recall that we assume that X TX is invertible, a condition needed
to get the estimated regression coefficients. XTX is not invertible if the columns of X are
linearly dependent, i.e., one predicting variable, corresponding to one column, is a
linear combination of the others.

Formally, if XTX is not invertible, it means that the estimated regression coefficients do
not exist, the standard error of the estimated regression coefficients beta is infinite and
the standard error for predictions is also infinite. Most often this would happen probably
due to a specification error where one or more predictors is redundant, for example, if
years and number of rings of a tree were included in a model for evaluating
characteristics of trees.

However, it is rarely the case to have exact collinearity. What we generally have is near
collinearity, meaning that the linear dependence is not exact but that linear
combinations of predicting factors are approximately 1. The result is that it may be
difficult to invert XTX. The bigger problem is that the standard errors will be artificially
large.

Slide 11:

From a practical point of view, multicollinearity can lead to many problems:

1. If one value of one of the predicting variables is changed only slightly, the fitted
regression coefficients can change dramatically. In statistical modeling, we say that the
model is sensitive to small changes in the input.

2. It can happen that the overall F statistic is significant, yet each of the individual t
statistics is not significant. That is, we will not be able to detect statistical significance
because the variance of the estimated coefficients would be artificially large. Another
indication of this problem is that the p-value for the F-test is considerably smaller than
those of any of the individual coefficient t-tests.
3. Another problem with multicollinearity comes from attempting to use the regression
model for prediction. In general, simple models tend to forecast better than more
complex ones, since they make fewer assumptions about what the future must look like.
That is, if a model exhibiting collinearity is used for prediction in the future, the implicit
assumption is that the relationships among the predicting variables, as well as their
relationship with the response variable, remain the same in the future. This is less likely
to be true if the predicting variables are collinear.

One problem that multicollinearity does not cause to any serious degree is inflation or
deflation of the R2 since adding unneeded variables cannot reduce R 2 (it can only leave
it roughly the same).

Slide 12:

An approach to diagnose collinearity is through the computation of the variance


inflation factor, computed for each predicting variable. If we considered the ‘j’
predicting variable, the variance inflation factor or VIF is equal to 1 divided by 1- R j2,
where this R2 with subscript j is the coefficient of variation or the R 2 of the regression of
the variable Xj as the response regressed on all other predicting variable.

How big of a VIF or variance inflation factor indicates a problem? We evaluate the
condition VIF < max(10, 1 / (1- R2model)). In this condition, the R2 of the model is the
coefficient of determination of the regression model including all observations. This is a
rule of thumb although you will need to use it as a guidance, like any other rule of
thumb, rather than to automatically detect multicollinearity.
Slide 13:

Here are the steps involved in the computation of VIF for the j-th predicting variable.
Perform a regression of the predicting variable Xj as the response onto the rest of the
predicting variables and compute the R2 of this regression. Compute the VIF with the
input of this R2. If the VIF is large, then we detect collinearity with respect to the j-th
predicting variable. Please note that collinearity does not simply mean that the j-th
variable is correlated with one other predicting variable. It means that it is a linear
combination of the rest of the predicting variables. Thus, this approach goes beyond
simply evaluating the correlation of pairs of predicting variables. It evaluates the
correlation between a predicting variable and linear combinations of the other
predicting variables.

Slide 14:
How do we interpret the VIF? VIF measures the proportional increase in the variance of
the estimated regression coefficient corresponding to the j-th predicting variable
compared to what it would have been if the predictive variables had been completely
uncorrelated. Multicollinearity will not cause a problem in the regression if the variance
of the corresponding regression coefficient is not significantly larger when we have
correlation among the predictive variables versus when we don't have correlation
among the predictive variables. A VIF of 1 (the minimum possible VIF) means the tested
predictor is not correlated with the other predictors. The higher the VIF,

• The more correlated a predictor is with the other predictors


• The more the standard error is inflated
• The larger the confidence interval
• The less likely that a coefficient will be evaluated as statistically significant
Again, it could be that the predicting variables are correlated, but it doesn't necessarily
mean that that will lead to a problem in the stability of the estimated regression
coefficients.

What can we do about multicollinearity? Don’t use all the variables; use variable
selection as you will learn in Module 4. Multicollinearity is just an extreme example of
the bias-variance tradeoff we face whenever we do regression. If we include too many
variables, we get poor predictions due to increased variability. Again, I will expand on
this in the next module. Stay tuned.

Summary:
In this lesson, I covered two important concepts in multiple linear regression. First, I
introduced classic approaches to evaluate model performance along with prediction
accuracy evaluation. Second, I presented the concept of multicollinearity, important
when applying multiple linear regression to models with many predicting variables. We
will practice with these concepts in the next lessons.
2.9. Model Evaluation: Data Examples
In this lecture, I will illustrate how to perform diagnostics of the assumptions and model
evaluation with a data example using the R statistical software.

Slide 3:
We'll return to the data example in which we are interested in the relationship between
advertising expenditure and sales in the presence of other predictive variables that
impact sales.

Slide 4:
The questions that we'll address in this example are as follows:

● Do the assumptions of multiple linear regression hold?


● If one or more assumptions do not hold, what can we do about it?
● Do we identify any outliers?
● What are the correlation coefficients between the quantitative predicting
variables? Is there any potential multicollinearity?
● What is the coefficient of determination? How do we interpret it?

Slide 5:
To evaluate the linearity assumption, one approach is to plot the scatterplot between
any two variables in the data, specifically, the scatterplots of the response versus the
predicting variables and the scatterplots of all the pairs of quantitative predicting
variables. We can do that using the command plot(). The input now is a matrix,
consisting of the column of the response variable and of the quantitative predicting
variables.

The output will look just like the plots in this slide. In this output, the first row of the
scatterplots includes the scatterplots of the sales versus all four quantitative predicting
variables: sales versus advertising, sales versus bonus amount, sales versus market
share, and sales versus the largest competitor’s sales. The other set of plots consists of
the scatter plots of the predicting variables: advertising versus the bonuses, market
share, and largest competitor, and so on.

What can we learn from these plots?


● There is a strong linear relationship between Sales versus advertising
expenditure;
● There is a weaker linear relationship between Sales versus amount of bonuses
● The scatterplot of Sales versus market share shows scattered data
● The scatterplot of Sales versus largest competitor sales also shows scattered
data
The other plots, the scatterplots of the predicting variables, can be used to evaluate
correlation between predicting variables. We're going to return to correlation between
predicting variables in the next lesson. Overall, according to this set of scatterplots, we
conclude that the linearity assumption holds for all predicting variables.

Slide 7:
Another approach to evaluate the linearity assumption is through the residual plots;
specifically, plot the residuals versus individual predicting variables. This is what I am
plotting here with this R code. First, I'm extracting the standardized residuals from the
model fit using the stdres() command in R. Then I'm dividing the display into quadrants
and plot each individual plot of predicting variable against the residuals. For each plot,
I'm adding the 0 line to display how scattered are the residuals against the 0 line. The
linearity assumption holds if the residuals are randomly scattered across the 0 line.

Slide 8:
These are the resulting plots. We can see that the residuals are scattered around the 0
line for all four predicting variables, which is an indication that the assumption of
linearity holds.

Slide 9:
These are the plots used to evaluate other assumptions as well as to identify outliers.
We can use the plot of fitted values versus residuals to evaluate uncorrelated errors and
constant variance. We will use the qqnorm plot and the histogram to evaluate the
normality assumption. I'm also providing here how to obtain the Cook's distances and
how to plot them to evaluate whether we have outliers.
Slide 10:

These are the resulting plots. The first plot is the residuals versus fitted. The residuals
are spread around the 0 line, an indication that both the constant variance and the
uncorrelated errors assumptions hold. The next plot is the normal probability plot; the
points do line up on a straight line. The histogram shows that the residuals have a
symmetric distribution, except that we see a gap around 40. The last plot is the plot of
the Cook's distances. Two values are larger than the other values. It would be important
here to evaluate whether those are influential points or not. That means we would need
to perform the regression analysis with and without each individual potential outlier and
evaluate possible changes in the estimated regression coefficients in terms of
magnitude, sign, and statistical significance.

Slide 11:
In order to compute the correlation between the predictive variables, we can use the
cor() Command in R which stands for the correlation matrix; to use this command, we
input the matrix including as columns the four predicted variables for which we are
interested to compute their correlation. The output is the matrix of the correlation
values.

For example, the value 0.418 is the correlation between advertising expenditure and
amount of ‘bonuses’. The maximum correlation among the predicting variables is 0.452,
not a strong correlation among the predicting variables. I will highlight that the
correlation matrix only evaluates dependence between individual predicting variables,
but it cannot be used to identify linear dependence between more than two variables
thus not an approach to evaluate multicollinearity.
Slide 12:

To detect multicollinearity we need to compute the VIFs. To compute the VIFs for each
individual predictor in R, we can use the vif() command, where the input is the fitted
model. In the output, the first column provides the VIF values derived using the formula
I provided in the previous lesson. We can compare these VIF values with the threshold,
the maximum between 10 and 1/(1- R2), which in this case is equal to 22.47. None of
the VIF values are larger than 22.47, which is an indication that we don't have
multicollinearity in this example.

Slide 13:
If we were interested to obtain the R2, we can use the summary of the fitted model. The
coefficient of determination in this data example is 0.955, which means that the linear
regression model explains 95.5% of the variability in the sales. This is uncommonly high
R2. Most often the R2 is lower than 0.5 in practice.

Summary:
To summarize, in this lesson, I illustrated how to apply model evaluation as well as to
evaluate multicollinearity using the R statistical software. This lesson concludes the set
of lessons covering fundamental concepts for multiple linear regression. In the
remaining lessons of this module, I will apply these concepts using two data examples.
2.10 Ranking States by SAT Performance: Exploratory &
Regression Analysis
In this lesson, I will illustrate multiple regression with an example related to ranking
states by SAT performance. Specifically, I'll focus on the exploratory data analysis and
on illustrating the applicability of the testing procedure for subsets of regression
coefficients.

Slide 3:
Two researchers examined compositional and demographic variables to understand to
what extent state characteristics were tied to SAT scores. Research questions to be
addressed using these data are:

 Which variables are associated with the state SAT scores?


 How do the states rank with respect to the SAT performance?

Slide 4:
In the first example, the response variable is the state average SAT score.

The predicting variables are as follows.

● Takers, the percentage of total eligible students in the state who took the exam.
● Rank, the median percentile of ranking of test takers within their secondary
school classes.
● Income, the median income of families of test takers, in hundreds of dollars.
● Years, the average number of years that a test taker has had in social sciences,
natural sciences, and humanities.
● Public, a percentage of test takers who attended public schools.
● Expenditure, a state expenditure on secondary school in hundreds of dollars
per student.

Slide 5:
Back in 1982, not all colleges required SAT for admission, particularly those in the
Midwest. This resulted in that the states with high average SAT scores had low
percentages of takers. The reason is that only the best students planning to attend
college out of state took the SAT exams. As the percentage of takers increased for other
states, so did the likelihood that the takers included lower-qualified students.
Thus, in this example, two variables can be used to control for this bias selection, the
percentage of students taking SAT, or the ‘takers’ factor, and the median percentile of
ranking of test takers within their secondary school classes, or the ‘rank’ factor.

Slide 6:
We'll first read the data in R using the read.table() command in R. We can check the
dimensionality of the data file, consisting of 50 rows, each row corresponding to one
state.

Slide 7:
Exploratory data analysis allows us to explore the variables in the dataset before
beginning any formal analysis. For exploratory data analysis in this example, we'll first
examine the predicting variables through plotting their histograms as illustrated in the R
commands on the slide. With a histogram, we can see the general range of the data,
shape such as skewness, outliers, gaps, and other distributional shape characteristics.

We also plot the scatterplot matrix, specifically the scatterplots of all variables in the
data, accompanied by the correlation matrix. Again, this would give us a first look into
the linearity assumption as well as dependency between pairs of the predicting
variables.

Slide 8:

The histograms for the SAT scores and for all the other six quantitative predictors are on
this slide. The state average SAT score histogram (shown in black) displays a bi-modal
distribution. This is potentially indicating the clustering of states depending on whether
the colleges in the states require SATs or not, thus, potentially due to the bias selection.
We can also see that the histogram for the ‘takers’ factor (shown in red) has clearly two
clusters, which may explain the modality in the SAT score as well. We also see some
potential leverage points. Alaska has almost double the amount of secondary schooling
expenditure compared to all the other states. Similarly, Louisiana has very few students
taking SAT who have come from public schools.

Slide 9:

We can also evaluate the relationships between all the variables using the scatterplot
matrix. Generally, we're looking for trends here. Does the value of one variable tend to
affect the value of another? If so, is their relationship linear? The scatterplot matrix
shows clear relationship between SAT, the response variable, and takers and rank,
which are the two controlling variables. Interestingly, Alaska shows up as a high value in
expenditure while it has a rather average SAT score despite its very high levels of
spending.

Since subtle trends are often difficult to identify in the scatterplot matrices, sometimes
a correlation matrix can be useful. From the correlation matrix for these data, we note
that both the income and the years variables have moderately strong positive
correlations with the response variable SAT. Their respective correlations are 0.58 and
0.33, indicating that higher levels of income and years of education in science and
humanities are generally associated with higher trends in the SAT scores. However, this
does not imply causation. Each of these trends may be nullified or even reversed when
accounting for the other variables in the model.

Slide 10-11:

The R command used to fit a multiple linear regression model is lm(), with the input
including the response variable, in this case, the state average SAT score, and the
predicting variables joined by the plus sign. I included a portion of the output of the
model fit on the slide.

The output not only provides the estimated coefficients, but also statistical inference on
the statistical significance of the coefficients. For example, among the regression
coefficients, those that are statistically significant are for the ‘rank’, for ‘years’, and for
the expenditure predicting variables at the significance level 0.05.

In the lower part of the output, we find information about the estimated standard
deviation of the error terms, which is 26.34 with 43 degrees of freedom, corresponding
to n-p-1. The R squared is 0.87, meaning 87.8% of the variability in the SAT is explained
by the model.

We can also find information on the F-test for the overall regression, which is 51.91. The
p-value is very small, indicating that at least one of the variables in the model has
explanatory power on the variability of the SAT scores.

Slide 12:
We can use the anova() command for the decomposition of the sum of squared
regression into the extra sums of squared of regression, from adding one predictive
variable at a time to the model, as we learned in the lesson where I introduced the
testing procedure for a subset of regression coefficients.

Note that the order in which the predicted variables enter the model is important. The
anova() command gives the sum of squares for regression explained by the first
variable, the ‘taker’ variable, then the extra sum of squares for regression due to
adding the second variable, the ‘rank’ variable, then the third variable, ‘income’,
conditional on the first and second variables to be in the model and so forth.

For example, the extra sum of squares due to adding income to the model, which
includes takers and rank is 2,858, and the extra sum of squares for adding the
predictive variable ‘years’ to the model that includes takers, rank and income is 16,080.

For the SAT data example, we would like to test whether discarding income, years,
public and expenditure variables result in a similar predictive power as the model
including these variables. That is, we would like to test whether any of these variables
will improve the predictive power of the model, when added to the model including
takers and rank, the controlling factors. For this, we compute the F-value of the ratio of
two components. The numerator is the extra sums of squares for regressions due to
adding these four variables to the model, divided by four, which is the number of
predictive variables we're adding. The denominator in the partial F-test is the mean sum
of squared errors or MSE of the full model. We compute the p-value as the right tail from
the F-value of the F-distribution, with 4 and 43 degrees of freedom.

The resulting p-value is approximately equal to zero. Thus, we conclude that at least
one other predicting factor among the four predictors: income, years, public, and
expenditure, will be significantly associated to the state average SAT score.
Slide 14:

Let's overview once more this test. What we're testing here is the null hypothesis that
the regression coefficients corresponding to the four predictors are 0 versus the
alternative hypothesis that at least one of those coefficients is not 0.

How was the F statistic computed again? The numerator is the extra sum of squares
from regression due to adding income, public, years, and expenditure to the model that
already includes takers and rank, divided by 4. The denominator is the mean sum of
square error of the full model. The p-value is computed as the right tail for the F-
distribution with 4 and 43 degrees of freedom. The mathematical derivations here
directly correspond to the implementation in the previous slide.

Summary:

In summary, in this lesson, I illustrated the applicability of multiple linear regression by


accounting for bias selection. We also learned how to evaluate the predictive power of a
subset of predicting variables in the presence of controlling factors for bias selection.
We will continue this example in the next lesson.
2.11. Ranking States by SAT Performance: Model Fit
In this lesson, we'll perform the residual analysis for the SAT Performance example. We
will also use the regression analysis to rank states based on their average SAT
controlling for the bias selection.

Slide 3:

To review, we evaluate the following assumptions graphically. Constant variance and


uncorrelated errors by plotting the response or fitted values versus residuals.
Linearity using the predicting variables versus the residuals. In these plots, we seek a
random pattern around the 0 line. Normality using histogram and the normal
probability plot. And Outliers, using the cook distance plot.

Slide 4:

We can obtain the residuals as provided in the first R command line on this slide; note
again that we will need to perform the residual analysis based on the standardized
residuals. The next command line is used to get the Cook's distances used to identify
outliers. The set of plots of interest are the scatterplot of the response variable or fitted
values versus residuals, the scatterplots of the quantitative predicted variables versus
the residuals, the histogram and normal probability plot, and last, the plot of the Cook’s
distance.

Slide 5:
Here are the resulting plots. In the first plot of the residuals versus SAT scores, there is
a clustering in the residuals with a grouping of the residuals into two clusters, possibly
an indication of correlated response data due to the bias selection being still present. It
is possible that the controlling factors may not have controlled for the bias selection
fully using the linear model. The second plot in the first row is the plot of takers or the
percentage of students tested versus residuals; in this plot, there is a clear pattern with
higher values on the edges and lower values in the center. This pattern is an indication
of nonlinearity with respect to this predictor. Thus, we'll need to consider a
transformation of the ‘takers’ factor. In fact, the separation in the residuals in the first
plot could be due to this nonlinear relationship. The QQ plot indicates that the residuals
have heavy tails. The histogram displays this as well. We don’t identify outliers in the
residuals, as we might have expected. Recall that Alaska had a large expenditure, but it
does not show as being an influential point based on this plot.

To address the departures from the linearity assumption, I will next fit a model with the
log-transformed predicting variable ‘takers’.

Slide 6:
This slide includes the output from the model with the transformed ‘takers’ variable.
Without the transformation, takers was not statistically significantly associated with the
SAT score given all other predicting variables in the model.

Now with the transformation, it is statistically significant at the significance level of


0.05. The p-value is 0.02. However, now the predicting variable rank is not statistically
significant anymore.

The R2 improves slightly from 87.8% for the model without transformation to 89%. The
standard deviation of the error term decreases slightly also.

Slide 7:

The linearity assumption holds now for the controlling variables. However, we still see
some clustering in the residuals versus SAT scores, although this clustering is weaker
than for the model without transformation. The distribution of the residuals is again
heavy-tailed.

To review, the transformation has improved the linearity assumption, but we still have
heavy tailed residuals, and the Cook’s distances show that Alaska may be an outlier,
potentially an influential point for this model.

Slide 8:

One of the objectives of this analysis is to rank the states by SAT scores. The ranking
without accounting for the bias selection will rank the states with lower percentage of
takers and higher median class rank at the top, but it doesn't necessarily mean that
these states perform best in terms of state average SAT because of the bias selection of
the students taking SAT. Thus, instead of ranking by actual SAT score, we rank by how
far they fall above or below their fitted regression line value, using the residuals from
the model with only the two controlling factors.

To do so, we're still going to fit the model including the controlling factors. Then we
obtain the order of states by the residuals of this model. We will then compare the
ranking before and after the correction for the selection bias.

Slide 9:
This slide compares the ranking with and without the correction of the bias selection.
How dramatically the ranking shifts once we control for the variables ‘takers’ and
‘rank’? The ‘old rank’ column provides the ranking without the correction for the bias
selection.

For example, after controlling for bias selection, Connecticut moved from 35th to 1 st,
and Massachusetts moved from 41st to 4th.

Now on the bottom ranked states, after controlling for the selection bias, Mississippi
moved from 16th to 46th and Arkansas sled from 12th to 43rd.

Slide 13:

Let's now review some additional findings for this data analysis. Given all other
predictors in the model, percent of students taking SAT from a public school and family
income of test takers are not statistically significantly associated to SAT score. Given all
other predictors in the model, a $100,000 increase in the expenditure on secondary
school results only in 2.56 points increase in the SAT score. In contrast, given all other
predictors in the model, one additional year that test takers had in social sciences,
natural sciences, and humanities led to 17.2 points increase in SAT score.

Importantly, after the transformation of the predicting variable ‘takers’, the predictors in
the model explain close to 90% of the variability in SAT score.

Ranking changes significantly after controlling for the bias selection factors. For
example, Connecticut moved up to be the 1st from 35th, Massachusetts to 4th from
41st, and New York to 5th from 36th.

Summary:

To conclude this data example, I also illustrated the applicability of multiple linear
regression in ranking accounting for bias selection.
2.12. Predicting Demand for Rental Bikes: Exploratory Data
Analysis
In this lesson, I'll illustrate multiple linear regression with a prediction of bike share
demand. In this lesson, I will introduce the data example, along with exploratory
analysis based on visual analytics and I’ll begin by fitting the regression model.

Slide 3:
Bike sharing systems are of great interest due to their important role in traffic
management and ever-increasing number of people choosing it as their preferred mode
of transport. In this study, we will address the key challenge of demand forecasting for
such bikes using two-year historical data corresponding to years 2011 and 2012 in
Washington D.C., one of the first bike sharing programs in the US. The data were
provided by the UCI Machine Learning Repository.

I would like to also acknowledge the support from several Master of Analytics students,
some of them from the online program, in preparing this example. Their names are
provided in this slide. I hope more examples in this course will be prepared with your
support.

Slide 4:

Despite the steady growth in bike sharing programs, one of the key challenges faced is
to estimate the demand for bikes and allocate resources accordingly as the usage rates
vary from one bike sharing to another. Thus, in this study, we will model demand for
bikes, which is the response variable.
The variation in usage could be due to multiple factors, some of which are the prevalent
weather conditions. We can expect that passengers are more likely to choose bike rides
on days when the weather is pleasant without snowfall and/or heavy winds. Another
important factor is time during the day since we should expect differences in demand
throughout the day. In this study, predicting variables include environmental and
seasonal or periodical factors such as weather conditions, precipitation, day of week,
season, hour of the day among others. The details of attributes are listed on the slide.

Slide 5:

The data consists of 17,379 observations, a large sample size. I will come back to this
aspect in a different lesson. We evaluate the distribution of the response variable using
the histogram plot.

The distribution of the demand for bikes is skewed, particularly with a large number of
zeros.

Slide 6:
Next, we will explore how the demand for bike shares differs across the qualitative
predicting variables in this study. For example, here I am showing the side-by-side
boxplots of the demand for bikes by the hour of the day. From this plot, we learn that
the number of bike shares between midnight and 6am are extremely low, which is in
line with the expectation that not many people will be commuting during these hours.
The majority activity as expected is focused between 7am and 11pm, peaking at 8am
and 5pm.

Slide 7:

Here I am showing the side-by-side boxplots by the season and by the weather
condition variables separately. From these plots, we learn that the number of bikes
shared during winter is the lowest and it decreases as the weather becomes
unfavorable. While there is some variation by season and by weather condition, to
assess whether these patterns are indeed statistically significant, not only a matter of
randomness, we would need to perform an ANOVA of bike demand versus each of the
two factors.

Slide 8:
Here I am showing the scatterplot by wind speed, considering wind speed to be
quantitative. We can see from this plot that the count of rental bikes seems to decrease
as windspeed increases. I will add here that the relationship does not seem to be linear.

Slide 9:

These are the scatter plots for two other quantitative variables, temperature and
humidity, along with the marginal linear regression line. The count of shared bikes
seems to decrease as humidity increases although the demand varies within similar
ranges at varying humidity levels. Moreover, the count of rental bikes seems to increase
as temperature increases however with much wider variability at larger temperature
levels.

Slide 10:
Here I am providing the R code for dividing the data into testing and training data. We
will next fit the lm() model (on the training data). Later we will evaluate the prediction
for the testing data. In this code, I also convert the qualitative variables into factors.

Slide 11:

Part of the output of the fitted regression model is on the slide. This model was fitted on
the entire data, not only on the training data. Most of the p-values are small, indicating
statistical significance of the regression coefficients. Only a few dummy variables, for
example those corresponding to the month-of-the-year qualitative variable, show a lack
of statistical significance. In the exploratory analysis, we have seen that some
qualitative variables may not marginally explain the variation in the bike share demand.
However, in this linear model, almost all variables seem to be statistically significant. I
pointed out earlier that these data consist of a relatively large number of observations.
In such a case, it is possible to identify the effect of inflated statistical significance as I
will expand in a different lesson.
Slide 12:
Next, we are identifying those predicting variables with p-values larger than the
significance level 0.05, which include dummy variables of the month qualitative
variable, specifically for months February, April, June, July, August, November, and
December, indicating that the demand is not statistically significantly different than
January (baseline), given all other predicting variables in the model.

Summary:
To conclude, in this lesson, I illustrated multiple linear regression analysis with another
data example pertaining to demand of bike share. I will continue this example in the
next two lessons focusing on Goodness-of-fit, statistical inference, and prediction.
2.13. Predicting Demand for Rental Bikes: Regression Analysis
In this lesson, I'll illustrate multiple linear regression by evaluating Goodness-of-fit of
the model via residual analysis for the bike share data example.

Slide 3:

Let’s go back to the fitted model for the bike share demand presented in the previous
lesson. I am providing here the partial R output. I will highlight that models with many
qualitative variables have also many parameters to be estimated because each one will
introduce several dummy variables as predicting variables in this example.

Based on the output, the estimated standard deviation of the error terms is 101.7 and
the estimated variance will be the squared of that value. The number of degrees of
freedom is 17,327. The R2 is 0.6884 or 68.8% of the variability in the demand for bikes
is explained by the linear model including the temporal and climate factors.
Slide 4:

I will digress for one slide here to review again the concept of coding qualitative factors.
The first approach on this slide is by converting the qualitative variable into dummy
variables. For example, for weather condition, we have 3 different labels thus 3 different
dummy variables. If we do not include an intercept, as in the first model fit called ‘fit.1’,
then we can include all 3 dummy variables as predicting variables.

The fitted model is provided here. In the R output, each dummy variable has its
individual row since it is a predicting variable on its own. The output does not provide a
row for the intercept since we didn’t include an intercept.

A second approach is to consider a model with intercept but including only two dummy
variables. The resulting model is called ‘fit.2’ here.

The output of this model includes an intercept and the first two dummy variables as
provided in the model. In this example, we chose to have the last dummy variable, or
the 3rd weather condition as the baseline.

A third approach called ‘fit.3’ is to convert the weather condition categorical variable
into a factor and fit the model with a weather condition factor rather than individual
dummy variables.

From the model output, we can see that for this model, R selects weather condition 1 as
the baseline. In the output, we have weather conditions 2 and 3 as dummy variables in
the model, but not weather condition 1.

All four model implementations for the same qualitative variable are equivalent! So it
doesn’t matter which implementation you use.
Slide 6:

Here I am exploring the standardized residual versus fitted values. From this plot, we
find that the constant variance assumption does not hold; the variability in the residuals
increases as the fitted values increase also, the so-called megaphone effect. Moreover,
the residuals, at low y values, seem to follow a straight-line pattern; this linear pattern
may suggest that the response variable stays constant for a range of predictor values.
This fact is also reaffirmed from the hourly graph in the exploratory data analysis where
we see nearly constant response values for hours 0-6. In another model, we will be
omitting the data for hours between 0 and 6 and compare the fitted models with or
without the 0-6 hours data.

Slide 6:
Here I am assessing the linearity assumption by plotting the residuals against four
quantitative variables. Note that we only evaluate the linearity assumption with respect
to quantitative predicting variables.

From these plots, we infer that the residuals do not vary with any of the quantitative
predicting variables.

Slide 7:

Next, I am exploring the residuals to evaluate the normality assumption using the
histogram and the qq normal plot. From these two plots, we can observe that the
distribution of the residuals is approximately symmetric but with heavy tails, indicating
that the distribution of the residuals looks more like a t-distribution rather than a normal
distribution. I will also highlight that the distribution of the bike share demand is skewed
as discussed in the previous lesson, but the distribution of the residuals is rather
symmetric. This points, once more, that when we evaluate the normality assumption,
we do so on the residuals rather than the response variable.
Slide 8:

Last, I am exploring the presence of outliers using the Cook’s distance as shown on this
slide. There is one observation with a Cook’s Distance noticeably higher than the other
observations. However, its Cook’s distance is close to 0.004, hence small, suggesting
that there are likely no outliers. I will also note that the sample size for this data
example is rather large, hence if we were to compare with the threshold 4/n, we would
identify several outliers. Thus, avoid using this rule of thumb, particularly for data
examples with large sample sizes.

Slide 9:

When I evaluated the Goodness-of-fit using the residual analysis, I pointed out that the
assumption of constant variance does not hold. We learned in the previous lessons that
when this assumption does not hold, we could try to use a variance-stabilizing
transformation of the response variable. One common such transformation is the Box
Cox transformation. Here I am applying the boxcox() command in R with the initial
model as the input.

The optimal power is equal to 0.22 in this example. However, when the response data
consist of count data per unit time, for example, the number of bikes per hour, a
theoretically recommended transformation is the square root, which would correspond
to the 0.5 power transformation. Hence, I will consider this transformation even though
it is not the optimal 0.22 power transformation.

Slide 10:

By using the square root transformation, a smaller number of regression coefficients are
not statistically significant, with one dummy variable corresponding to week-of-the-day.
The R2 increased from 0.68 for the model without transformation to 0.78 in the model
with transformation indicating a larger proportion of the variability being explained by
the model with transformed response.

We find that VIFs of the season, month, temp, atemp factors are greater than the
threshold used to indicate multicollinearity (max(10, 1/(1-R 2))). Thus, we have a problem
of multicollinearity in this linear model. We should not use all the predictors in the
model due to this multicollinearity. This is not surprising since we should expect some
level of multicollinearity between seasonal and climate factors. Please note that it is not
correct to simply discard predicting variables if their corresponding p-values are
indicating a lack of statistical significance (that means, the p-values are not small). Also,
do not remove the predicting variables with high VIFs. You will need to perform a
rigorous variable selection as provided in Module 4 of this course.
Slide 11:

These are the residual plots for the model with the transformation. The constant
variance assumption is still violated. The transformation has not improved the
Goodness-of-fit even though the model performance is better with respect to the
coefficient of determination. Again, you will need to be careful in interpreting R 2. A
higher R2 doesn’t guarantee a better fit model.

Slide 12:
Another model to consider is one where we would remove the low demand data
between midnight and 6am. The model provided on this slide uses the reduced data
and applies also the transformation to the response variable. The R 2 has, however,
decreased from the model fitted with the full data. The set of variables with coefficients
that are identified not to be statistically significant is again similar to that of the first
model.
Slide 13:

These are the plots for the residual analysis for the model with the reduced data. The
increasing variability of the residuals with the fitted values is still present although to a
lesser extent. Moreover, the distribution of the residuals is similar to that from the
previous model, the model with the full data and transformed response.

To conclude, the constant variance assumption is still violated even for the model
without the low demand data. The implication of the constant variation assumption
violation is that the uncertainty in predicting bike demand when in high demand will be
higher than estimated using the multiple regression model. In the next Module of this
course, we will learn another regression model called Poisson regression, which models
count data, for example, number of bikes rented per hour; this model allows for non-
constant variance and hence possibly more appropriate for the data considered in this
study.

Summary:
To summarize, in this lesson, we learned how to implement Goodness-of-fit analysis
using visual analytics of the model residuals with the data example pertaining to
demand of bike share.
2.14. Predicting Demand for Rental Bikes: Prediction &
Interpretation
In this lesson, I will illustrate the implementation of prediction accuracy measures, and I
will compare the three models introduced in the previous lesson using multiple
approaches for evaluating the prediction accuracy.

Slide 3:

We derive predictions by first splitting the data into training data and testing data,
where the training data are used to fit the model and the testing data are used to
evaluate the prediction accuracy of the fitted model. On this slide, I am providing the R
commands for preparing the splitting of the data into training and test data, applied to
the model without transformation, with transformation and with full data including all
hours of the day, and last, with transformation but the reduced data, excluding 0 to 6
hours. We then apply the three models to the training data and use the test data for
prediction using the ‘predict()’ command.
Slide 4:

On the slide is the output from the predict() command applied to the test data for
model1 only. Note that I am only providing the output for a subset of the test data. In
this output, we have three columns. The ‘fit’ column provides the predicted response
and ‘lwr’ and ‘upr’ provide the lower and upper bounds of the prediction intervals. We
can see that for all predictions, the prediction intervals are quite wide, indicating high
uncertainty in the predictions.

Slide 5:

But how good are those predictions? We can compare the predictions derived from
applying the predict() command based on the model applied to the training data to the
responses in the testing data. In the real world, we do not have the new responses at
that time of making the predictions, and thus we cannot evaluate the prediction
accuracy for the new data. But here, we first pretend we do not have the responses in
the testing data. Then we apply the prediction to those responses and compare with
those observed. The most common reported measures of prediction accuracy are:

● Mean squared prediction error abbreviated MSPE, computed as the mean of the
square differences between predicted and observed;
● Mean absolute prediction errors abbreviated MAE, computed as the mean of the
absolute values of the differences between predicted and observed;
● Mean absolute percentage error abbreviated MAPE, computed as the mean of the
absolute values of the differences scaled by the observed responses;
● Precision error abbreviated PM, computed as the ratio between MSPE and the
sum of square differences between the response and the mean of the responses;
● Confidence Interval Measure abbreviated CIM, computed as the number of
predictions falling outside of the prediction intervals divided by the number of
predictions.

I introduced these measures in a previous lesson along with their “appropriateness” in


evaluating prediction in linear regression models. To recall, the precision measure is the
best of all because it is appropriate for evaluating prediction accuracy for the linear
models estimated using least squares and it does not depend on scale of the data.
While MAPE is commonly used to evaluate prediction accuracy, I recommend instead
using the precision measure.

The R code for the implementation of the prediction measures is on the slide. First, we
start by deriving the predictions and the prediction intervals then define a series of
functions used throughout the prediction part of the code, including a function for each
accuracy measure. Then we apply these functions to derive the prediction accuracy
measures for model1. Similar R code applies for the other two models.

The results from the application of this code are on the slide. From this result, the MSE
value is quite large however that doesn’t say anything about how accurate this
particular model is, since MSE depends on the magnitude of the data. The same for the
MAE measure. The MAPE is 2.72, which is quite large, meaning that the average
difference between the forecasted value and the actual value is 272%. The precision
measure is 0.31, meaning that the variability in the prediction is 31% of the variability
in the new response data. This is a reasonably good value since it is smaller than 1.
Last, we find that approximately 6% of the new response data is outside of the 95%
prediction intervals.
Slide 6:

However, the prediction accuracy derived from one fitted model is highly dependent
upon the subset of observations used for training and testing. Thus, the prediction
accuracy values from the previous slide will change as we change the data split
between training and testing data. Hence it is recommended to apply the same
procedure multiple times, that is, for different splits of the data, then average the
accuracy measure across all the repetitions. Practically, we would apply a random data
split, 80% going to training and 20% going to testing data randomly multiple times, say
for 100 iterations. Each time we have a data split, we apply the same procedure, fit the
model on the training data and evaluate predictions on the testing data then estimate
prediction accuracy.

To do so, I wrote a function called ‘pred_fun’ as provided on the slide. Note that this
function only applies to model 1; a revised version accounting for the transformation in
models 2 and 3 is provided in the R code accompanying this data example.

Then I set a matrix ‘pred1_meas’, including the prediction accuracy measures across all
100 iterations and apply the average for each measure across the 100 iterations.

Here I am comparing the values of the accuracy measures based on one iteration
versus an average across all 100 iterations. We can see that the values don’t change
much. Again, this comparison is for model1 only. Similar results are for the other two
models.
Slide 7:

In the lesson where I introduced the concept of prediction evaluation, I discussed two
approaches to split the data. One approach is the random sampling approach, where we
split the data randomly. The other approach is the so-called k-fold cross validation,
when we split the data into k folds then use k-1 folds for fitting the model and the
remaining fold for prediction. The R code on the slide compares the random and the k-
fold splitting approaches.

The R command highlighted performs random sampling with 20% going to the testing
data hence test=0.2 and is applied for 100 iterations, so n=100.

The next R command highlighted here performs k-fold cross validation with 10 folds.
Note that it’s important what k is set to; a low value of k leads to biases in the
estimation of prediction accuracy, and a high value of k leads to higher variability in the
performance metrics of the model (where performance here is evaluated based on the
prediction accuracy). Thus, it is very important to use the appropriate value of k for the
model; usually a k between 5 and 10 is commonly used.

Next, we fit all models for the different data splits.

We then apply the ‘get_pred_meas’ function that obtains the prediction given the
corresponding testing data then outputs the accuracy measures. Again, this function
applies to model 1 only; a different version of this function for models 2 & 3 is in the
accompanying R code of this data example. We apply this function across all 100
repetitions of the random and the 10-fold divisions.

The results here compare the average prediction results for the two approaches. The
accuracy measures are again quite similar.
Slide 8:

On this slide, I am comparing two prediction error measures along with the R 2 and the
adjusted R2 across the three models. The model with the transformed response
outperforms the other models in terms of predictive power as reflected in the Precision
Measure and in terms of the variability explained as reflected by the R2. Interestingly,
removing low demand data does not improve the model performance in terms of
predictive power. While the models can be compared in terms of their predictive power,
the non-constant variance assumption is violated across all three models, at a lesser
degree for the third model. In the next module, I will introduce another modeling
approach, the so-called generalized linear model, particularly the Poisson regression,
which could be used to model count data, such as count of bikes rented. We will thus
return to this data example in the next module.

Summary:
To summarize, in this lesson, you learned how to implement evaluation of models based
on their prediction accuracy. I will continue this data example by looking into statistical
significance under large sample size.

2.15. Predicting Demand for Rental Bikes: P-values & Sample Size
I will conclude this module with the data example on prediction of bike share by
illustrating the so-called statistical significance problem when applying regression to
large sample size data.
Slide 3:

I will begin with a very simple example to illustrate the inflated statistical significance
idea due to large sample size data. In this example, I am considering the very simple
problem of statistical inference using hypothesis testing for the mean parameter of data
from a normal distribution. From basic statistics, for the two-sided test with the null
hypothesis that the mean is equal to zero, the p-value is as provided on the slide. The p-
value is a function of the sample size, and it decreases with square root of n (the
sample size). For large sample sizes, square root of n will impact the p-value, in the
sense that will make the p-value artificially small, hence inflate the statistical
significance.

This means that conclusions based on small-sample statistical inferences using p-values
when we have large sample size data can be misleading. A p-value measures the
distance between the data and the null hypothesis on the parameter of interest, for
example, mean parameter in this case. The distance is typically measured in units of
standard deviations. Consistent estimators have standard deviations that shrink as the
sample size increases. With a very large sample, the standard deviation becomes
extremely small, so that even minuscule distances between the estimate and the null
hypothesis become statistically significant. The message here is that large samples Can
Make the Insignificant...Significant!
Slide 4:

The p-value problem under large sample sizes applies in the context of statistical
significance of the regression coefficients in regression analysis. Similar to what was
discussed in the previous slide, the variance of the estimated regression coefficients
depends on the square root of the sample size. The p-value is used as a measure of
statistical significance for the regression coefficients; if the p-value is small (for
example, smaller than 0.01 or 0.05), it indicates that the corresponding regression
coefficient is statistically significant, for example. This is however a misleading
conclusion when applied to large sample data.

Slide 5-6:

But large sample size is both a curse and a blessing in statistical inference. The
approach I will introduce next shows the blessing of large sample data. The approach
uses the so-called idea of sub-sampling, meaning sampling a small percentage of the
data randomly, say 10-20%, if the sample size is very large. For the sub-sampled data,
we can apply the regression analysis, estimate the regression coefficients, and obtain
the p-values since now the p-values are derived for a smaller sample data. We can
repeat the sub-sampling and model fit many times, say 100 times. The output from this
approach will consist of estimated regression coefficients and the corresponding p-
values from applying the regression model to each of the data sub-sample. That is, if we
sub-sampled the data 100 times, we will have 100 sets of the estimated coefficients.
Using this output, we can then get the so-called empirical distributions of the regression
coefficients and a range of p-values for each regression coefficient, which can be used
to make inference on the statistical significance of the regression coefficients.

First, statistical significance, or lack of it, can be identified based on the distribution of
the p-values; specifically, if the empirical distribution is approximately uniform between
0 and 1, then we do not have statistical significance. Second, statistical significance (or
lack of it) can be identified based on the confidence interval of the regression coefficient
derived from the empirical distribution. I will illustrate this approach in the next slides.

Slide 7:

This slide provides the R implementation of the approach I described in the previous
slide. Here the number of sub-samples is B equal to 100.

The sub-sample percentage is 40% of the initial sample size. We will explore different
percentages of the data, say 20%, later in this lesson to better understand the
implications of setting this tunning parameter.
Last, I set here the significance level to be alpha equal to 0.01 and we consider the p-
values smaller than or equal to this significance level to indicate statistical significance.

Slide 8:

The output for the sub-sampling approach from the previous slide is provided on the
right. In this output, I am reporting the estimated coefficients and the initial p-values
from fitting the model using the entire data. To this table, I added another column
consisting of the frequency (the number) of p-values smaller than the significance level
alpha=0.01 across the 100 repetitions. As you note from this table, I didn’t include all
the predicting factors. In fact, I only included those with a frequency of p-values
indicating statistical significance larger or equal than 95, that is 95% of the time. In this
analysis, I deemed those regression coefficients with 95% or more of the p-values
smaller than or equal to 0.01 to be statistically significant. Note that this percentage is
also a tuning parameter. As this percentage gets smaller, we identify more regression
coefficients to be statistically significant. We will explore the implications of this tuning
parameter again later in this lesson.

Next, I am plotting all p-values by the corresponding predicting factors. Each predicting
factor will show the variability among its corresponding p-values across 100 repetitions.
Slide 9:

This is the matrix plot of the p-values for the regression coefficients deemed to be
statistically significant according to the rule implemented in the previous slide. For
those regression coefficients identified to be statistically significant, the p-values across
the 100 subsamples are very small.

Slide 10:

On this slide I consider lack of statistical significance coded here as the regression
coefficients with 85% or less of the p-values smaller than the significance level. Among
those regression coefficients, the dummy variables corresponding to month-of-the-year
and weekday-of-the-week are in this group. The last column in the table corresponds to
the number of p-values across the 100 sub-samples that are smaller than the
significance level 0.01. We can see that for most regression coefficients, the number of
p-values smaller than the significance level is small, indicating lack of statistical
significance.

Again, we plot the matrix plot of the p-values of each predicting factor identified not to
be statistically significant.

Slide 11:

The corresponding matrix plot of the p-values is on the slide. We can see now that the
theoretical result discussed earlier holds, specifically, that we expected distribution of
the p-values corresponding to a lack of statistical significance to be approximately
uniform.

Slide 12:

Here we explore the implications of one tunning parameter, the percentage of the sub-
sampled data. So far, we considered 40% of the data to sub-sample but now I’m
reducing it to 20% as it is possible that the sample size corresponding to 40% of the
data (about 7000 data points) to be too large, with the statistical significance still being
inflated at this large sample size. The results on the slide are the coefficients for which
the number of p-values smaller than the significance level 0.01 is small, specifically
smaller than the threshold 85; that is 85% of the times, the corresponding coefficient is
not statistically significant. We compare these results for 40% subsample versus 20%
subsample.

We see that as we reduce the percentage of the sub-sampled data, we identify more
regression coefficients that are not statistically significant. This could indicate that a
smaller sub-sample may be appropriate for this analysis. Across all the new regression
coefficients identified as not being statistically significant, only one, specifically
weekday5, corresponding to Friday, has a large number of significant p-values, about
81. This coefficient could be considered on the borderline of statistical significance as it
also doesn’t appear on the list derived for the 40% of the sub-sample data. As
suggested by this analysis, you should explore multiple percentages of sub-samples and
pay attention to the frequency of statistical significance.

Slide 13:
Here are a few insights based on this analysis of the statistical significance of the
regression coefficients. Most coefficients remain statistically significant across 95% of
the sub-samples, suggesting that these regression coefficients are statistically
significant. Statistical significance is not supported for most of month and weekday
dummy variables as well as for temperature and windspeed factors given that other
relevant factors, such as season and weather situation are in the model. While the 85%
cutoff was used for the frequency of p-values smaller than (or equal to) the significance
level 0.01, other lower cut-offs, such as 50%, can be used. Additionally, the procedure
should be applied for different percentages of sub-samples as it may impact the
identification of inflated statistical significance. Other tuning parameters that need to be
varied are the number of sub-samples. I recommend a more thorough analysis
evaluating the sensitivity to these parameters in detecting statistical significance in
such studies.

Summary:
To summarize, in this lesson, I illustrated an important aspect in regression analysis, the
problem of inflated statistical inference due to large sample size. This lesson concludes
the data analysis of the bike share demand prediction and concludes Module 2.

You might also like