Stat 353 Study Guide
Stat 353 Study Guide
Stat 353 Study Guide
Nils DM
December 19, 2020
(a) V (c) = 0
(b) V (cX) = c2 V (X)
(c) V (aX + b) = a2 V (X)
(d) V (X1 + X2 ) = V (X1 ) + V (X2 ) + 2Cov(X1 , X2 )
(e) V (aX1 + bX2 ) = a2 V (X1 ) + b2 V (X2 ) + 2abCov(X1 , X2 )
1
2 Probability Distributions
1. Normal Distribution
5. Gamma Distribution
R∞
(a) Γ(K) = tk−1 e−t dt
0
2
3 Distribution Theory
A selected list of useful identities are:
(n − 1)s2
∼ χ2n−1
σ2
x ± Y = χ2m±n
Z
T = p x ∼ tK
K
1
7. If x ∼ Fm,n , then x
∼ Fn,m
3
4 Regression Analysis
1. Regression Analysis: A set of statistical processes for estimating the
relationship between a dependent variable (y, response, labels) in one
or more independent variables (x, features, predictors).
y = β0 + β1 x +
where:
E(Y |X) = β0 + β1 x
V (Y |X) = V (β0 + β1 x + ) = V () = σ 2
4
4. Important Terms and Their Formulas
The sum over all squared differences between the observations and
their overall mean value y
(b) Residual Sum of Squares:
n
X n
X
SSE/RSS/SSRes = e2i = (yi − ybi )2
i=1 i=1
The sum of the squared residuals (Or the error sum of squares).
It is a measure of the discrepancy between the data and an esti-
mation model. A small value represents a close fit.
(c) Residual Mean Square:
SSRes
M SRes =
n−2
5
(g) The Coefficient of Determination R2 :
SSE
R2 = 1 −
SST
The measure of variance in the dependent variable that is pre-
dictable from the independent variable. Values close to zero are
not optimal, values closer to one are more optimal.
5. Errors and Residuals: Two closely related and easily confused mea-
sures of deviation.
(a) V (βb0 ):
1 x2
σ2( + )
n Sxx
(b) V (βb1 ):
σ2
Sxx
7. Confidence Interval: A way of computing a range of possible values
for an unknown parameter such as the mean.
6
5 Testing Hypotheses
σ2
1. Since V (βb1 ) = Sxx
n
(yi −ybi 2 )
P
2 i=1
\ σ
b M SRes n−2
V (βb1 ) = = = P
n
Sxx Sxx
(xi − x)
i=1
Therefore:
v
u P n
(y −yb 2 )
u i=1 i i
q r u
\ σ M SRes n−2
V (βb1 ) = √
b u
S.E.(βb1 ) = = = uP
Sxx Sxx t n
(xi − x)2
i=1
2. Similarly:
1 x 2
V (βb0 ) = σ 2 ( + )
n Sxx
Therefore:
v
u n (y − yb )2
uP
r i i
1 x 2 u i=1 1 x
u
S.E.(β0 ) = M SRes ( +
b ) = u ( +P
n )2
n Sxx n−2 n
(xi − x)2
t
i=1
7
6 Testing Significance of Regression
1. For a model:
y = β0 + β1 x +
We say x and y have no linear relationship if the slope β1 = 0
We can test for significance in the following two ways:
(a) Method 1:
βb1
t= ∼ tn−2
S.E.(βb1 )
(b) Method 2:
M SR
F = ∼ F1,n−2
M SRes
where:
SSR SSRes
M SR = ∗
, M SRes =
1 n−2
∗Denotes DF. This is a general formula that can be extended to Multiple Regression
7 Interval Estimation
1. In order to estimate a 100(1 − α)% confidence interval for β0 , β1 for
t-tests on H0 : βj = 0
8
8 Regression Through the Origin
1. Regression through the origin implies that the intercept (β0 ) is equal
to 0 when x = 0.
y = β1 x +
(c) Therefore:
n
P
xi yi
i=1
β1 = P
b
n
x2i
i=1
= f (y1 , . . . , yn )
= f (y1 )...f (yn )
n
Y 1 1
= (2πσ 2 )− 2 exp{− 2 (yi − β0 − β1 xi )2 }
i=1
2σ
3. The log-likelihood:
9
4. The MLE β̃ = (β˜0 , β˜1 ) satisfies:
∂` ∂S
= =0
∂βj β̃ ∂βj
β̃
5. Therefore:
M LE(β̃) = LSE(β̃)
∂`
=0
∂σ 2
σ˜2
Which means:
n
(yi − β˜0 − β˜1 xi )2
P
i=1
σ̃ 2 =
n
SSRes
=
n
n−2 2
= σ
b
n
σ̃ 2 is not unbiased for σ 2
10
10 Multiple Linear Regression
1. Multiple Linear Regression: A regression model with:
Examples:
Y = β0 + β1 x1 + β2 x2 + (two regressors)
Or = β0 + β1 x1 + β2 x2 +
= β0 + β1 x1 + β2 x2 + (x1 = x, x2 = x2 )
= E(Y ) = β0 + β1 x1 + βx2
Y = β0 + β1 x + β2 x2 + . . . + βp xp +
(Polynomial regression model with p regressors)
Y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + (x3 = x1 x2 )
3. Some models are non-linear but are intrinsically linear since they
can be transformed into linear models.
11
4. For a model of type:
Let:
5. We can model this with matrix notation for the general formula:
Y = xβ +
where:
Y1
Y = ...
Yn n×1
1 x11 x12 . . . x1K
1 x21 x22 . . . x2k
x = .. ..
. ... ... ... .
1 xn1 xn2 . . . xnK n×(K+1)
β0 1
β1 2
β = .. = ..
. .
βK (K+1)×1
n n×1
12
10.1 LSE of β
1. Similar to simple linear regression, with multiple linear regression, we
estimate β,
b by minimizing SSE (residual sum of squares/error sum of
squares) denoted:
Xn
S(β) = 2i = T
i=1
2. After we expand S(β), take the derivative, set equal to zero and solve,
we find that the Least Squares Estimator for β:
βb = (xT x)−1 xT y
e = y − yb = y − Hy = (I − H)y
6. TheExpectation for β:
b
E(β)
b =β
7. The Covariance of βb
b = σ 2 (xT x)−1
Cov(β)
13
8. The Trace of a k × k matrix A as the sum of the diagonal elements:
k
X
trace(A) = aii
i=1
Hypothesis Testing
1. We can use hypothesis testing for multiple linear regression as follows:
H0 : β1 = β2 = . . . = βk = 0
H1 : βj 6= 0 for at least one j
H0 : βj = 0 H1 : βj 6= 0
βbj
t=
S.E.(βbj )
y = xβ +
(a) y = n × 1
(b) x = n × (K + 1)
(c) β = (K + 1) × 1
(d) = n × 1
14
Where n is the number of rows in the response vector y In our reduced
model we repartition our dimensions as:
let p = K + 1,
r≥1
y (n×1) = xn×p β (p×1) + (n×1)
β1(p−r)×1
= x1n×(p−r) x2n×r +
β2r×1
We can then rewrite our reduced model as:
y = x1 β 1 +
H0 : β2 = 0 vs H1 : β2 6= 0
15
For example, in the model:
y = β0 + β1 x1 + β2 x2 + β3 x3 +
The following:
SSR (β1 |β0 , β2 , β3 ), SSR (β2 |β0 , β1 , β3 ), SSR (β3 |β0 , β1 , β2 )
measure the contributions to SSR from β1 , β2 , β3 Respectively when
the other two parameters are already in the model.
Multiple R2
1. Recall:
SSE
R2 = 1 −
SST
To prevent overfitting, we modify R2 to take into account using multiple
regressors. Unlike regular (or multiple) R2 , we use Adjusted R2 :
SSRes
2 (n−p) SSRes n − 1
Radj =1− SST
=1− ( )
(n−1)
SST n − p
Unlike R2 which increases monotonically, Radj
2
increases only to a point
after which it begins to decrease.
2
Radj ≤ R2
16
Confidence Intervals in Multiple Linear Regression
1. For a model:
y = xβ +
Where:
∼ N (0, σ 2 In×n )
The Least Squares Estimator (SSE) of β: βb = (xT x)−1 xy ∼ Np (β, σ 2 (xT x)−1 )
SSRes
b2 = M SRes =
σ (n−p)
is an unbiased estimator of σ 2
Where:
Note: the 1+ results in a slightly larger S.E. value and thus a larger
interval.
17
6. Simultaneous CI’s for all or some βj ’s
Therefore:
(βb − β)T xT x(βb − β)
b =
F (β) Fp,n−p
pM SRes
is a 100(1 − α)% CI for β
18
11 Model Adequacy Checking
1. In a linear model we assume:
(a) E(y|x) is linear in x
(b) E() = 0
(c) V () = σ 2 which is not depend on x
(d) Cov(i , j ) = 0 for i 6= j
(e) ∼ N (0, σ 2 )
Violations of one or more of the above rules may invalidate our infer-
ences and render the model adequate for the data.
We perform model adequacy checking by checking for violations of
the above assumptions.
2. The main tool for model adequacy checking is to examine the residuals.
There are several types of residuals such as:
(a) Raw Residuals: The residuals we are used to seeing so far:
ej = yj − ybj , j = 1, 2, . . . , n
Where:
i. yj : observed value
ii. ybj : fitted value using the model
19
(c) Studentized Residuals:
Since V (e) = σ 2 (I − H)
V (ej ) = σ 2 (1 − hjj )
∴
Vb (ej ) = σb2 (1 − hjj ) = M SRes (1 − hjj )
Studentized Residual:
ej
rj = p , j = 1, 2, . . . , n
M SRes (1 − hjj )
V (rj ) ≈ 1
If |rj | > 3, then rj is a possible outlier
(d) Press Residuals: The idea is to fit the model omitting one obser-
vation and to then compare that observation to its inferred fitted
value. Fitting this many models is cumbersome, therefore we can
calculate press residuals as:
ej
e(j) = , j = 1, 2, . . . , n
1 − hjj
σ2
V (e(j) ) =
(1 − hjj )
3. Residual Plots: All five assumptions that are assumed with a linear
model can be checked by residual plots. They can also be used to check
for outliers. We can construct a residual plot as follows:
Deviation from this line can potentially and subjectively suggests non-
linearity
20
(a) Fit the linear model
(b) Extract the residuals
(c) Calculate fitted values: yb = response(y) − residuals()
(d) plot(fitted, residuals)(b
y , )
If the variance is constant, the range of the residuals should fall within
a uniform band.
A funnel shaped graph is an indication of non-constant variance
A quadratic shape is a sign of the missing regressor (such as x2 )
An unusually large (in absolute value) residual suggests and outlier.
Remove, and refit the model.
X ni
m X ni
m X
X
2
(yij − yi ) + (yi − ybi )2
i=1 j=n i=1 j=n
= SSP E + SSLOF
(c) Where SSP E is the SS due to pure error and SSLOF is the SS due
to lack of fit.
21
(d) We then derive the test statistic:
SSLOF
(m−2)
F0 = SSP E
∼ Fm−2,n−m
(n−m)
8. Lack of fit is not due to the presence of outliers, using robust regression
does not resolve the lack of fit problem.
22
12 Transformations and Weightings to Cor-
rect Model Inadequacies
1. Some model inadequacies can be addressed using data transformation
and weighting.
2. When we plot residuals versus fitted values and find the variance is not
constant as yb increases, this is a sign that transformations or weightings
are needed.
Relationship Transformation
2 0
σ : constant y = y ( no transformation )
√
2
σ : E(Y ) y0 = y
σ 2 : [E(Y )] 2 y 0 = ln(y)
1
σ 2 : [E(Y )] 3 y0 = y− 2
√
σ 2 : E(Y )(1 − E(Y )) y 0 = sin−1 ( y)
5. Do not compare R2 for the two models since the response variable is
different. R2 is good for assessing the fit of different models when they
are all from the same response.
y = β0 eβ1 x
log y = log β0 + β1 x1 + log
23
7. Since the precise nonlinear relationship between y and x may not be
known, we can choose a transformation by:
Then choosing the value for λ that minimizes SSRes within the CI.
But S0 (β) is not optimal when the variance is not constant. So we need
to take into account a modified error sum of squares which considers
V ar() = σ 2 V
10. The basic idea is to scale the linear model within matrix A:
A(y = xβ +
Ay = Axβ + A
(a) y n = Ay
(b) xn = Ax
(c) n = A
24
∴ Our new linear model:
y n = xn β + n
Where V (n ) = σ 2 I
11. If we define:
Sn (β) = (y − xn β)T (y n − xn β)
We call the minimizer of Sn (β) the:
Where:
(a) K = K T
(b) K > 0
(c) K × K = K 2 = V
1
(d) K = V 2
If we let A = K −1 , then
1
n = A = K −1 = V − 2
Also since:
E(n ) = 0,
V ar(n ) = K −1 V ar(n )K −1
1 1
= V − 2 σ2V V − 2
= σ2I
1
Now, A = K −1 = V − 2 , A2 = V −1
25
13. E(β)
b = β (unbiased)
b = σ 2 (xT V −1 x)−1
14. V ar(β)
βb = (xT A2 x)−1 xT A2 y
βb = (xT V −1 x)−1 xT V −1 y
And since βb is unbiased and has smaller variance in the original LSE
βb0 it minimizes:
S(β) = (y − xβ)T (y − xβ)
βb = (xT wx)−1 xT wy
Where:
1
wi =
v(i )
We use the W LSE when the errors i are uncorrelated but V ar(i )
depends on i.
26
19. When i ’s are uncorrelated and hence WLSE is appropriate. We can
find W by:
(a) E() = 0
(b) V ar() = σ 2 I
(c) βb = (xT x)−1 xT y
(d) E(β)
b =β
b = σ 2 (xT x)−1
(e) V ar(β)
E(β̃) = E(β̃)
⇒ cE(y) = β
⇒ cxβ = β
⇒ cxββ −1 = ββ −1
⇒ cx = I
21. Theorem: The LSE βb is the best linear unbiased estimator (BLUE)
under Model I in that:
V ar(β̃) ≥ V ar(β)
b for any β̃
(a) y = xβ +
(b) E() = 0
(c) V ar() = σ 2 V 6= σ 2 I
27
For Model II, the GLSE is:
βb = (xT V −1 x)−1 xT V −1 y
for any β̃
(a) Leverage point: A remote point in the x space, but it lies almost
on the regression line.Leverage points don’t affect the regression
coefficient bubble have a dramatic effect on model summary statis-
tics such as R2 and the standard errors of the regression coefficient.
(b) Influence point: A remote point in the y space (e.g. a data
point underneath a cluster of other data points). Influence points
have a noticeable impact on model coefficients and that it ”pulls”
the regression model in its direction.
(c) Leverage and influence point: A point that is remote in both
x and y axes.
28
∴ n n
X 1X p
hii = p and hii =
i=1
n i=1 n
If hii > 2 np , then (xi , yi ) is a leverage point
3. We can measure for influence using Cook’s Distance for data point
of the form (xi , yi ), i = 1, 2, . . . , n as follows:
(a) Calculate β:
b the LSE based on all n points.
(b) Calculate βb(i) : the LSE based on all but the ith point ((xi , yi )
deleted)
Idea: If (xi , yi ) are not influential then βb and βb would be similar.
(i)
b T M (βb − β)
(βb(i) − β) (i)
Di (M, C) =
C
Di > F0.5,p,n−p ≈ 1
Is influential
4. Robust Fit Residuals can also be used to find data points that are
influential as follows:
(a) Fit βc
R a robust estimate of β
(b) σc
R a robust estimate of σ
R = y−y
(d) ec c R the robust residuals
ec
Ri
A point (xi , yi ) is influential if σ
b
29
14 Polynomial Regression
Polynomial models are useful when curvilinear effects are present in the true
response function. They are also useful as approximating functions to un-
known and possibly very complex nonlinear relationships. In this sense, the
polynomial model is just the Taylor series expansion of the unknown function.
1. Polynomial Regression Models: A subclass of linear regression
models of the general form y = xβ + . Some examples of polynomial
regression models with one regressor are:
(a) y = β0 + β1 x + : (first order)
(b) y = β0 + β1 x + β2 x2 + : (second order)(quadratic)
(c) y = β0 + β1 x + β2 x2 + β3 x3 + : (third order)(cubic)
Polynomial regression models with two regressors are:
(a) y = β0 + β1 x1 + β2 x2 + (first order)
(b) y = β0 + β1 x1 + β2 x2 + β3 x21 + β4 x1 x2 + β5 x22 + (second order)
2. We can fit a cubic model in R as:
x2 = x · x
x3 = x · x · x
fit = lm(y ∼ x + x2 + x3 )
y = β 0 + β 1 x + β 2 x2 + . . . + β k xk +
30
5. Backward Selection: An approach to model building where we ap-
propriately fit the highest order model and then delete terms, one a
time starting with the highest order, until the highest order remaining
term has a significant t statistic
Q-Q plots are useful for comparing models of varying degrees (more
optimal models will have residual plots that appear morel linear).
7. Ill-Conditioning:
31
Important: Generally we require the function values and the first
k − 1 derivatives to agree at the knots,of the spine is a continuous
function with k − 1 continuous derivatives
(a) k = 3.
(b) h knots t1 < t2 < . . . < th .
(c) Continuous first and second derivative
where: (
(x − ti )if x > ti
(x − ti )+
0 if x ≤ ti
What this function intuitively does is ”activates” the third order poly-
nomial portion of the function when x enters into a specific domain of
x such that x > ti
Or:
3
X h
X
y= β0j xj + βk z k +
j=0 k=1
Where:
32
13. General Cubic Spline: A cubic spline with no continuity of any kind
imposed on S(x) denoted:
3
X h X
X 3
j
E(y) = S(x) = β0j x + βij (x − ti )3+
j=0 i=1 j=0
Note that S(x), S 0 (x) and S 00 (x) are not necessarily continuous at t. To
determine whether imposing continuity restrictions reduces the quality
of the fit, test the hypotheses:
33
(a) Moving Average and Kernel Smoothing: An approach that
uses a weighted average of the data.
Suppose x1 < x2 < . . . < xn are equally spaced. Let:
yi−1 + yi + yi+1
ybi = , i = 2, 3, . . . , n − 1
3
Is a three point moving average.
Let ỹi be the kernel smoother estimate of the ith response, then:
n
X
ỹi = wij yj
j=1
n
P
where wij = 1
j=1
Ex.
1 1 1
ybi = yi−1 + yi + yi+1
4 2 4
Is a weighted average of adjacent points
Spacing not necessary for weighted averages
wij can be generated using a kernel function K(·) satisfying:
i. K(t) ≥ 0
R∞
ii. K(t)dt = 1
−∞
iii. K(−t) = K(t)
∴ K(t) is is symmetric density function with respect to 0. We can
calculate wij as follows:
x −x
K( j b i )
wij = P
n
K( x` −x
b
i
)
`=1
34
The loess procedure then uses the weights in the neighbourhood to
generate a weighted least-squares estimate of the specific response.
The weights for the weighted least-squares portion of the estimate
are based on the distance of the points used in the estimation from
the specific location of interest.
Most software packages use the tri-cube weighting function as its
default: " #
x0 − xj
W
∆(x0 )
Where:
i. x0 : the specific location of interest
ii. ∆(x0 ): the distance the farthest point in the neighbourhood
lies from the specific location of interest
35
15 Variable Selection and Model Building
The problem is there can be many regressors under consideration and not
all of them are needed. We want to identify the ”best subset” for the linear
model.
Big data occurs when there are more potential regressors than observations
(more columns than rows).
3. General Strategy:
(a) Fit the full model (the one with all regressors)
(b) Perform a thorough analysis of fit: R2 , F -tests, t-tests etc.
(c) Determine if transformation(s) are needed
(d) Use t-tests to edit the model (remove non- significant variables)
(e) Repeat step (b) for edited model
36
4. Consequence of misspecification: Suppose the true model is:
y = xβ +
where x is n × (K + 1)
K = the number of regressors.
We can split the model into to two chunks:
y = xp β p + x r β r +
where:
(a) xp = (x1 , . . . , xp−1 ) (first p − 1 regressors)
(b) xr = (xp , . . . , xk ) (last r regressors)
(c) (p − 1) + r = K
βbp
(d) LSE: βb = = (xT x)−1 xT y
βr
b
Comparisons:
βbp βp
βb = is unbiased for β =
βr
b βr
But β˜p is not usually unbiased for βp (which is bad)
However, V ar(βbp ) ≥ V ar(β˜p ) (which is good)
Main takeaway: We can tell when it might be acceptable to use β˜p
based on the MSE criterion:
If we are analyzing a model θ,
b then:
37
5. The list of criterion used in variable/model selection are:
(a) R2 : The coefficient of determination.
SSR
R2 =
SST
For a model with p-coefficients (p − 1 regressors):
SSR(p)
Rp2 =
SST
As p increases, so does Rp2 (overfitting).
k
We can graph the R2 of all p−1 models and select our model
based off the value where adding further regressors see’s no realis-
tic improvement in R2 (turns flat). This method is not commonly
used.
2
(b) Radj : Non increasing function of p. This function forms an arch.
2
We choose p0 as the model that maximizes Radj,p , where:
2 n−1
Radj =1− (1 − Rp2 )
n−p
(c) M SRes :The residual mean square. Let:
SSRes (P )
M SRes (p) =
n−p
we choose p to minimize M SRes (p)
Important: (b) and (c) are equivalent since:
2 n−1
Radj,p =1− M SRes (p)
SST
(d) Mallow’s Cp Statistic: A standardized total mean square:
SSRes (p)
Cp = − n + 2p
σb2
σb2 is from the full model. This works since, if we let p0 be the
optimal value of regressors, then:
SSRes (p0 )
E(Cp0 ) ≈ E( − n + 2p) = p0
σ2
Therefore, we want to choose the model where Cp is as close to p
as possible.
38
(e) Akaike’s Information Criterion (AIC): The goal is to mini-
mize:
SSRes (p)
AICp = n log ( ) + 2p
n
where 2p is a penalty term
(f) Bayesian Information Criterion (BIC): The goal is to mini-
mize:
SSRes (p)
BICp = n log ( ) + p log n
n
where p log n is a penalty term
(g) P RESSp :
y = β0 + βj + xj + , j = 1, 2, . . . , 5
y = β0 + β1 x3 + β2 xj + , j = 1, 2, 4, 5
39
v. Suppose in the last step F2,3 was chosen. Then, examine all
three variable models that contain x2 , x3 :
y = β0 + β1 x3 + β2 x2 + β3 xj + , j = 1, 4, 5
F4 = Min{F1 , F2 , F3 , F4 , F5 }
40
7. Strategy for Variable Selection and Model Building:
41
16 Indicator Variables
1. Quantitative Variables: metrics such as temperature, distance, height
(continuous)
4. There are different methods for fitting models with qualitative vari-
ables:
(a) Method 1: Divide the observations into groups for each combi-
nation of categorical variables. Then fit a simple model for each
group.
42
(b) Method 2: Assume common slope and common error variance.
We define the models as:
y = β0 + β1 x1 + (A)
y = (β0 + β2 ) + β1 x1 + (B)
y = β0 + β1 x1 + β2 x2 + (C)
lm(y ∼ x1 + x1 + x2 )
y = β0 + β1 x1 + (A)
y = (β0 + β2 ) + (β1 + β3 )x1 + (B)
y = β 0 + β 1 x1 + β 2 x2 + β 3 x1 x2 +
= β0 + β1 x1 + β2 x2 + β3 x3 + (C)
where x3 = x1 x2
Again, using a dummy variable x2 to represent the qualitative
variable
We can fit this model in R as:
lm(y ∼ x1 + x2 + x3 )
43
5. One categorical variable with three or more levels: Suppose we
have a qualitative variable with three categories:
0 if i ∈ group A
x2 = 1 if i ∈ group B
2 if i ∈ group C
x2 x3
A 0 0
AB 1 0
O 0 1
44