Stat 353 Study Guide

STAT 353 Study Guide
Nils DM
December 19, 2020
1 Expectation and Variance

1. Just like with numbers, we can also store expected (mean) values in
vectors:
E(X1 ) µ1
→
− .. ..
E( X ) = . = .
E(Xn ) µn
2. We can also represent the variance and covariance between multiple

random variables as:
V (X1 ) Cov(X1 , X2 ) ... Cov(X1 , Xn )

Cov(X2 , X1 ) V (X2 ) ... Cov(X2 , Xn )
.. .. ..
. . V (X3 ) .
Cov(Xn , X1 ) Cov(Xn , X2 ) ... V (Xn )
3. The main rule for Expectation is:

→
− →
−
E(a X + b) = aE( X ) + b
4. Rules for Variance:
(a) V (c) = 0
(b) V (cX) = c2 V (X)
(c) V (aX + b) = a2 V (X)
(d) V (X1 + X2 ) = V (X1 ) + V (X2 ) + 2Cov(X1 , X2 )
(e) V (aX1 + bX2 ) = a2 V (X1 ) + b2 V (X2 ) + 2abCov(X1 , X2 )
1
2 Probability Distributions
1. Normal Distribution
(a) Denoted: N (µ, σ 2 ) with random variable X

(b) Range: −∞ < x < ∞
(c) Expectation (mean): µ
(d) Variance: σ 2
(e) Pdf:
1 1 2
f (x) = √ e− 2σ2 (x−µ)
σ 2π
(f) Standard Normal Random Variable: Z ∼ N (0, 1)
where:
x−µ
Z=
σ
2. T Distribution (with k DF)
(a) Denoted: tK with random variable TK

(b) Range: −∞ < x < ∞
(c) Expectation (mean): E(T ) = 0
K
(d) Variance: V (T ) = K−2
, for K > 2
(e) t∞ = N (0, 1)
3. Chi-Squared Distribution (with k DF)
(a) Denoted χ2K

(b) Range: x > 0
(c) Expectation (mean): K
(d) Variance: 2K
4. F Distribution (with (m,n) DF)

χ2
m
(a) Denoted F = m
χ2
n
∼ Fm,n
n
(b) Range: x > 0
5. Gamma Distribution
R∞
(a) Γ(K) = tk−1 e−t dt
0
2
3 Distribution Theory
A selected list of useful identities are:
1. If Z1 , Z2 , . . . , Zn are iid N (0, 1) then:

n
X
Z12 + Z22 + . . . + Zn2 = Zi2 ∼ χ2n
i=1
If Z1 , Z2 , . . . , Zn are not N (0, 1), we can convert them to N (0, 1). If

X1 , X2 , . . . , Xn are independent and N (µ, σ 2 ) then:
n
X xi − µ 2
( ) = χ2n
1
σ i
2. If X1 , X2 , . . . , Xn are iid and N (µ, σ 2 ) then:
(n − 1)s2
∼ χ2n−1
σ2
3. If x ∼ χ2m , Y ∼ χ2n and x and Y are independent, then:
x ± Y = χ2m±n
4. If Z ∼ N (0, 1), x ∼ χ2K and Z and Y are independent then:
Z
T = p x ∼ tK
K
5. If X1 , X2 , . . . , Xn are iid and N (µ, σ 2 ) then:

x−µ
T = ∼ tn−1
√s
n
6. If x ∼ χ2m , Y ∼ χ2n and x and Y are independent, then:

x χ2m
m m
f= Y
= χ2n
∼ Fm,n
n n
1
7. If x ∼ Fm,n , then x
∼ Fn,m
3
4 Regression Analysis
1. Regression Analysis: A set of statistical processes for estimating the
relationship between a dependent variable (y, response, labels) in one
or more independent variables (x, features, predictors).
2. Simple Linear Regression: A linear regression model with only one

independent variable of the form:
y = β0 + β1 x +
where:
(a) y: The output (response)

(b) β0 : The intercept
(c) β1 : The slope
(d) x: The independent variable (predictor)
(e) : The irreducible error
i. E() = 0
ii. V () = σ 2
Alternatively we can write:
E(Y |X) = β0 + β1 x
V (Y |X) = V (β0 + β1 x + ) = V () = σ 2
3. Least-Squares Estimation: To estimate β0 and β1 :

n
1
P
(a) Compute: x = n
xi
i=1
n
1
P
(b) Compute: y = n
yi
i=1
n
P n
P
(c) Compute: Sxy = (yi − y)(xi − x) = xi yi − nxy
i=1 i=1
n n
(xi − x)2 = x2i − nx2
P P
(d) Compute: Sxx =
i=1 i=1
Sxy
(e) Compute the Slope: βb1 = Sxx
(f) Compute the Intercept: βb0 = y − βb1 x
4
4. Important Terms and Their Formulas
(a) Total Sum of Squares:

n
X
SST = (yi − y)2
i=1
The sum over all squared differences between the observations and
their overall mean value y
(b) Residual Sum of Squares:
n
X n
X
SSE/RSS/SSRes = e2i = (yi − ybi )2
i=1 i=1
The sum of the squared residuals (Or the error sum of squares).
It is a measure of the discrepancy between the data and an esti-
mation model. A small value represents a close fit.
(c) Residual Mean Square:
SSRes
M SRes =
n−2
(d) Regression Sum of Squares:

n
X
SSR = (ybi − y)2
i=1
A measure of how well the regression model represents the model

data. Once again, lower values are better.
(e) Total:
SST = SSE + SSR
(f) Mean Squared Error:
n
1X
M SE/M SD = (yi − ybi )2
n i=1
A measure of the quality of an estimator, it is always non-negative

in values closer to 0 are better.
5
(g) The Coefficient of Determination R2 :
SSE
R2 = 1 −
SST
The measure of variance in the dependent variable that is pre-
dictable from the independent variable. Values close to zero are
not optimal, values closer to one are more optimal.
5. Errors and Residuals: Two closely related and easily confused mea-
sures of deviation.
(a) Residual: Denoted ei = (yi − ybi ). The difference between the

observed value and the estimated value of interest (such as the
sample mean)
Residuals sum to 0
(b) Error: The deviation of the observed value from the unobserv-
able true value of a quantity of interest (such as the population
mean).
6. Variance for βb0 and βb1 (Slope and Intercept):
(a) V (βb0 ):
1 x2
σ2( + )
n Sxx
(b) V (βb1 ):
σ2
Sxx
7. Confidence Interval: A way of computing a range of possible values
for an unknown parameter such as the mean.
8. Prediction Interval: A type of confidence interval used with predic-

tions and regression analysis. It is a range of values that predicts the
value of the new observation based on your existing model.
9. Standard Error (S.E.): The average distance from the observed

value from the regression line. Smaller values represent a better fit.
Approximately 95% of the observation should fall within ±2 × S.E.,
which is eight quick approximation of a 95% prediction interval.
6
5 Testing Hypotheses
σ2
1. Since V (βb1 ) = Sxx
n
(yi −ybi 2 )
P
2 i=1
\ σ
b M SRes n−2
V (βb1 ) = = = P
n
Sxx Sxx
(xi − x)
i=1
Therefore:
v
u P n
(y −yb 2 )
u i=1 i i
q r u
\ σ M SRes n−2
V (βb1 ) = √
b u
S.E.(βb1 ) = = = uP
Sxx Sxx t n
(xi − x)2
i=1
2. Similarly:
1 x 2
V (βb0 ) = σ 2 ( + )
n Sxx
Therefore:
v
u n (y − yb )2
uP
r i i
1 x 2 u i=1 1 x
u
S.E.(β0 ) = M SRes ( +
b ) = u ( +P
n )2
n Sxx n−2 n
(xi − x)2
t
i=1
3. Hypothesis Testing for Intercept:

To test H0 : β0 = β00 vs β1 6= β00 , where β00 = 0:
Use Test Statistic:

βb0 − β00
t=
S.E.(βb0 )
Under H0 , t ∼ tn−2 and p-value = 2P (tn−2 > |tobs |)
4. Hypothesis Testing for Slope:
To test H0 : β1 = β10 vs β1 6= β10 , where β10 = 0 :
Use Test Statistic:

βb1 − β10
t=
S.E.(βb1 )
Under H0 , t ∼ tn−2 and p-value = 2P (tn−2 > |tobs |)
7
6 Testing Significance of Regression
1. For a model:
y = β0 + β1 x +
We say x and y have no linear relationship if the slope β1 = 0
We can test for significance in the following two ways:
(a) Method 1:
βb1
t= ∼ tn−2
S.E.(βb1 )
(b) Method 2:
M SR
F = ∼ F1,n−2
M SRes
where:
SSR SSRes
M SR = ∗
, M SRes =
1 n−2
∗Denotes DF. This is a general formula that can be extended to Multiple Regression
Methods 1 and 2 are equivalent.
7 Interval Estimation
1. In order to estimate a 100(1 − α)% confidence interval for β0 , β1 for
t-tests on H0 : βj = 0
(a) βb0 ± t α2 ,n−2 · S.E.(βb0 )

(b) βb1 ± t α2 ,n−2 · S.E.(βb1 )
8
8 Regression Through the Origin
1. Regression through the origin implies that the intercept (β0 ) is equal
to 0 when x = 0.
2. Therefore the correct model is:
y = β1 x +
3. With regression through the origin, the slope is estimated as:

n n
2i =
P P
(a) S(β1 ) = (yi − β1 xi )
i=1 i=1
n
∂S(β1 ) P
(b) ∂β1
=0⇒ xi (y − β1 xi ) = 0
i=1
(c) Therefore:
n
P
xi yi
i=1
β1 = P
b
n
x2i
i=1
This can be extended for confidence intervals and tests
9 Maximum Likelihood Estimation

1. A linear model:
Yi ∼ N (β0 + β1 xi , σ 2 )
2. For multiple Yi ’s that are independent, the joint PDF:
= f (y1 , . . . , yn )
= f (y1 )...f (yn )
n
Y 1 1
= (2πσ 2 )− 2 exp{− 2 (yi − β0 − β1 xi )2 }
i=1
2σ
3. The log-likelihood:
= `(β0 , β1 , σ 2 ) = lnf (y1 , . . . , yn )

n
n n 1 X
= −( )ln(2π) − ( )lnσ 2 − 2 (yi − β0 − β1 xi )
2 2 2σ i=1
9
4. The MLE β̃ = (β˜0 , β˜1 ) satisfies:
∂` ∂S
= =0
∂βj β̃ ∂βj
β̃
5. Therefore:
M LE(β̃) = LSE(β̃)
6. The MLE for σ 2 satisfies:
∂`
=0
∂σ 2
σ˜2
Which means:
n
(yi − β˜0 − β˜1 xi )2
P
i=1
σ̃ 2 =
n
SSRes
=
n
n−2 2
= σ
b
n
σ̃ 2 is not unbiased for σ 2
10
10 Multiple Linear Regression
1. Multiple Linear Regression: A regression model with:
(a) More than one regressor (xi )

(b) The expectation E(Y ) is linear in β0 , β1 , . . . , βp
Examples:
Y = β0 + β1 x1 + β2 x2 + (two regressors)
Or = β0 + β1 x1 + β2 x2 +
= β0 + β1 x1 + β2 x2 + (x1 = x, x2 = x2 )
= E(Y ) = β0 + β1 x1 + βx2
E(Y )is quadratic in x but linear in βj0 s
Y = β0 + β1 x + β2 x2 + . . . + βp xp +
(Polynomial regression model with p regressors)
2. A multiple linear regression model of the form:
Y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + (x3 = x1 x2 )
We say there is an interaction between x1 and x2
3. Some models are non-linear but are intrinsically linear since they
can be transformed into linear models.
E(Y ) = β0 eβ1 x1 +β2 x2

log (E(Y ) = log (β0 ) + β1 x1 + β2 x2
11
4. For a model of type:
Yi = β0 + β1 xi1 + β2 xi2 + . . . + βk xik + i
Let:
(a) Yi : The response for the ith subject

(b) xi1 : The value of the regressor x1 For the ith subject
..
.
(c) xik : The value of the regressor xk For the ith subject
(d) i : i.i.d ∼ N (0, σ 2 )
5. We can model this with matrix notation for the general formula:
Y = xβ +
where:
Y1
Y = ...
Yn n×1
1 x11 x12 . . . x1K
1 x21 x22 . . . x2k
x = .. ..
. ... ... ... .
1 xn1 xn2 . . . xnK n×(K+1)
β0 1
β1 2
β = .. = ..
. .
βK (K+1)×1
n n×1
12
10.1 LSE of β
1. Similar to simple linear regression, with multiple linear regression, we
estimate β,
b by minimizing SSE (residual sum of squares/error sum of
squares) denoted:
Xn
S(β) = 2i = T
i=1
2. After we expand S(β), take the derivative, set equal to zero and solve,
we find that the Least Squares Estimator for β:
βb = (xT x)−1 xT y
Important note: In order to obtain β, b we need the columns of x to

be linearly independent i.e. the rank of (xT x) = K + 1
3. The hat matrix is defined as the matrix that converts values from the
observed variable into estimations obtained the least squares method.
It is denoted:
H = x(xT x)−1 xT
Important Properties of the hat matrix are:
(a) H 2 = H
(b) (I − H)2 = (I − H)
(c) xT e = 0
4. The fitted value for multiple linear regression is denoted:
yb = xβb = x[(xT x)−1 xT y] = Hy
5. This means we can express residuals as:
e = y − yb = y − Hy = (I − H)y
6. TheExpectation for β:
b
E(β)
b =β
∴ βb is an unbiased estimator for β
7. The Covariance of βb
b = σ 2 (xT x)−1
Cov(β)
13
8. The Trace of a k × k matrix A as the sum of the diagonal elements:
k
X
trace(A) = aii
i=1
9. The estimation for σ 2

SSRes
σb2 =
n−p
where n − p = trace(I − H)
This estimation is unbiased
Hypothesis Testing
1. We can use hypothesis testing for multiple linear regression as follows:
H0 : β1 = β2 = . . . = βk = 0
H1 : βj 6= 0 for at least one j
2. When the null hypothesis H0 is rejected, we can test H0 = βj = 0 for

j = 1, 2. . . , k to find out which βj ’s are significant:
H0 : βj = 0 H1 : βj 6= 0
We can derive the test statistic as follows:
βbj
t=
S.E.(βbj )
3. We can test the significance for a subgroup of βj ’s by using a

partial F test:
For a multiple linear regression model:
y = xβ +
Remember that the matrix dimensions are:
(a) y = n × 1
(b) x = n × (K + 1)
(c) β = (K + 1) × 1
(d) = n × 1
14
Where n is the number of rows in the response vector y In our reduced
model we repartition our dimensions as:
let p = K + 1,
r≥1
y (n×1) = xn×p β (p×1) + (n×1)
β1(p−r)×1
= x1n×(p−r) x2n×r +
β2r×1
We can then rewrite our reduced model as:
y = x1 β 1 +
To see if the reduce model would suffice, we test:
H0 : β2 = 0 vs H1 : β2 6= 0
In the full model

4. We can re-evaluate the test statistic with an altered F test as:
(SSR (β)−SSR (β1 ))
r
F = SSRes (β)
∼ Fr,n−p
(n−p)
p-value = p(Fr,n−p > F1obs )

Where:
(a) r: the number of regressors subtracted from the reduced model
(b) n − p: n − K − 1
(c) β1 : the reduced model containing only x1 β1
(d) β: the full model
5. Extra sum of squares: If we start with the reduced model which has
SSR (β1 ), then adding β2 of the model will increase the SSR to SSR (β).
We do note this increase as:
SSR (β2 |β1 ) = SSr (β) − SSr (β1 )
The partial F -test statistic is:

SSR (β2 |β1 )
r
F = SSRes (β)
(n−p)
15
For example, in the model:
y = β0 + β1 x1 + β2 x2 + β3 x3 +
The following:
SSR (β1 |β0 , β2 , β3 ), SSR (β2 |β0 , β1 , β3 ), SSR (β3 |β0 , β1 , β2 )
measure the contributions to SSR from β1 , β2 , β3 Respectively when
the other two parameters are already in the model.
Multiple R2
1. Recall:
SSE
R2 = 1 −
SST
To prevent overfitting, we modify R2 to take into account using multiple
regressors. Unlike regular (or multiple) R2 , we use Adjusted R2 :
SSRes
2 (n−p) SSRes n − 1
Radj =1− SST
=1− ( )
(n−1)
SST n − p
Unlike R2 which increases monotonically, Radj
2
increases only to a point
after which it begins to decrease.
2
Radj ≤ R2
Testing General Linear Hypothesis

1. To test a hypothesis of the form:
H0 : T β = 0 vs H1 : T β 6= 0
Use the following example:
Full model: y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 +
If we want to test:
H0 : β2 = 2β1 vs H1 : β2 6= 2β1
β0
β1
H0 ⇔ 2β1 − β2 = 0 ⇔ [0, 2, −1, 0, 0] β2 = 0
β3
β4
Reduced model under H0 :
y = β0 + β1 x1 + 2β1 x2 + β3 x3 + β4 x4 +
y = β0 + β1 (x1 + 2x2 ) + β3 x3 + β4 x4 +
16
Confidence Intervals in Multiple Linear Regression
1. For a model:
y = xβ +
Where:
∼ N (0, σ 2 In×n )
The Least Squares Estimator (SSE) of β: βb = (xT x)−1 xy ∼ Np (β, σ 2 (xT x)−1 )
SSRes
b2 = M SRes =
σ (n−p)
is an unbiased estimator of σ 2
2. Let cp×p = (cjj )p×p = (xT x)−1 Then

p p
S.E.(βbj ) = M SRes · cjj = σb2 · cjj
And a 100(1 − α)% CI for βj is:
βbj ± t α2 ,n−p × S.E.(βbj )
Where:
(a) n: The number of observations

(b) p: The degrees of freedom
3. The 100(1 − α)% CI on mean response y0 at x0 = (1, x01 , . . . , x0K )

is:
yb0 ± t α2 ,n−p × S.E(yb0 )
y0 = E(y|x = x0 ) = x0 T β
Point estimator: yb0 = x0 T β
b
4. The standard error of yb0 is:

q q
S.E.(yb0 ) = Vb (yb0 ) = σb2 x0 T (xT x)−1 x0
5. The 100(1 − α)% prediction interval at x = x0

q
yb0 ± t α
2
,n−p × σb2 (1 + x0 T (xT x)−1 x0 )
Note: the 1+ results in a slightly larger S.E. value and thus a larger
interval.
17
6. Simultaneous CI’s for all or some βj ’s
(βb − β)T xT x(βb − β)

∼ χ2p
σ2
This result is independent of the fact that:
SSRes
∼ χ2n−p
σ2
Which leads to:
χ2p
p
χ2n−p
∼ Fp,n−p
(n−p)
Therefore:
(βb − β)T xT x(βb − β)
b =
F (β) Fp,n−p
pM SRes
is a 100(1 − α)% CI for β
7. Multicollinearity: refers to a situation in which two or more ex-

planatory variables in a multiple regression model are highly linearly
related. Values of multicollinearity greater than 10 are considered po-
tential problems.
18
11 Model Adequacy Checking
1. In a linear model we assume:
(a) E(y|x) is linear in x
(b) E() = 0
(c) V () = σ 2 which is not depend on x
(d) Cov(i , j ) = 0 for i 6= j
(e) ∼ N (0, σ 2 )
Violations of one or more of the above rules may invalidate our infer-
ences and render the model adequate for the data.
We perform model adequacy checking by checking for violations of
the above assumptions.
2. The main tool for model adequacy checking is to examine the residuals.
There are several types of residuals such as:
(a) Raw Residuals: The residuals we are used to seeing so far:
ej = yj − ybj , j = 1, 2, . . . , n
Where:
i. yj : observed value
ii. ybj : fitted value using the model
The properties of raw residuals are:

i. The sum of ej = 0
ii. The sum of ej xj` = 0 for j = 1, 2, ..., n and ` = 1, 2, ..., K
iii. SS Res = The sum of e2j
SSRes
iv. M SRes = n−p
b2
=σ
The ej ’s are not independent but when there are any more obser-
vations and parameters they may be viewed as independent.
(b) Standardized Residuals:

ej
dj = √ , j = 1, 2, . . . , n
M SRes
dj has mean = 0 and an approximate variance of 1
If |dj | > 3, observation j is a potential outlier
19
(c) Studentized Residuals:
Since V (e) = σ 2 (I − H)
V (ej ) = σ 2 (1 − hjj )
∴
Vb (ej ) = σb2 (1 − hjj ) = M SRes (1 − hjj )
Studentized Residual:
ej
rj = p , j = 1, 2, . . . , n
M SRes (1 − hjj )
V (rj ) ≈ 1
If |rj | > 3, then rj is a possible outlier
(d) Press Residuals: The idea is to fit the model omitting one obser-
vation and to then compare that observation to its inferred fitted
value. Fitting this many models is cumbersome, therefore we can
calculate press residuals as:
ej
e(j) = , j = 1, 2, . . . , n
1 − hjj
σ2
V (e(j) ) =
(1 − hjj )
3. Residual Plots: All five assumptions that are assumed with a linear
model can be checked by residual plots. They can also be used to check
for outliers. We can construct a residual plot as follows:
(a) Assume the observations w1 , w2 , . . . , wn ∼ N (µ, σ 2 )

(b) Calculate Pi = n1 (i − 12 )
(c) Lineup wj ’s in ascending order
(d) Plot all (Φ−1 (Pi ), wi ) pairs where Φ−1 denotes the inverse of the
normal CDF.
(e) If Φ−1 (Pi ) and wi or from the same distribution (norma), the plot
should be linear.
Deviation from this line can potentially and subjectively suggests non-
linearity
4. Residuals vs fitted values: Another way to check if our linear as-

sumptions are being upheld is to graph the fitted (x) values against the
residuals (y) from the linear model as follows:
20
(a) Fit the linear model
(b) Extract the residuals
(c) Calculate fitted values: yb = response(y) − residuals()
(d) plot(fitted, residuals)(b
y , )
If the variance is constant, the range of the residuals should fall within
a uniform band.
A funnel shaped graph is an indication of non-constant variance
A quadratic shape is a sign of the missing regressor (such as x2 )
An unusually large (in absolute value) residual suggests and outlier.
Remove, and refit the model.
5. We can also plot:
(a) Residual vs Regressor: A uniform band is ideal

(b) Residual vs Time: Shows autocorrelation of the responses if
there is a violation of the independence assumption
6. Lack of Fit: The failure to include terms/regressors we should have.

Assuming that the errors are independent and ∼ nN (0, σ 2 ). In order
to test for lack of fit, we meet multiple observations of y at one or more
levels of x.
7. For a model with m regressors and n multiple observations (greater

than m), calculate lack of fit as follows:
(a) Calculate SSRes

ni
m X
X
(yij − ybi )2
i=1 j=n
, where ybj = βb0 + βb1 xi .

(b) Partition SSRes into two components:
X ni
m X ni
m X
X
2
(yij − yi ) + (yi − ybi )2
i=1 j=n i=1 j=n
= SSP E + SSLOF
(c) Where SSP E is the SS due to pure error and SSLOF is the SS due
to lack of fit.
21
(d) We then derive the test statistic:
SSLOF
(m−2)
F0 = SSP E
∼ Fm−2,n−m
(n−m)
p-value = p[Fm−2,n−m > F0 ]
8. Lack of fit is not due to the presence of outliers, using robust regression
does not resolve the lack of fit problem.
22
12 Transformations and Weightings to Cor-
rect Model Inadequacies
1. Some model inadequacies can be addressed using data transformation
and weighting.
2. When we plot residuals versus fitted values and find the variance is not
constant as yb increases, this is a sign that transformations or weightings
are needed.
3. Some common σ 2 → E(Y ) relationships and their respective transfor-

mations are:
Relationship Transformation
2 0
σ : constant y = y ( no transformation )
√
2
σ : E(Y ) y0 = y
σ 2 : [E(Y )] 2 y 0 = ln(y)
1
σ 2 : [E(Y )] 3 y0 = y− 2
√
σ 2 : E(Y )(1 − E(Y )) y 0 = sin−1 ( y)
4. In order to decide which transformation to use (how does σ 2 depend

on E(Y )), we can rely on:
(a) Data type

(b) Empirical relationship
5. Do not compare R2 for the two models since the response variable is
different. R2 is good for assessing the fit of different models when they
are all from the same response.
6. We can also perform model transformations to linearize the model.A

nonlinear model is intrinsically linear if you can be linearized through
a transformation:
y = β0 eβ1 x
log y = log β0 + β1 x1 + log
If we assume that /0 is iid, N (0, σ 2 ), the transformed model can be

analysed as a simple linear model with no new issues. But be careful
with interpretation of the results since the scale of yand/orx may have
changed.
23
7. Since the precise nonlinear relationship between y and x may not be
known, we can choose a transformation by:
(a) trial and error

(b) using information about the variables
8. The Box-Cox transformation involves represent the transformation of

y as y λ and sweeping across different values of lambda in order to find
the value that minimizes SSRes .
If we graph the SSRes (y) values against values of λ (x), construct a
horizontal line that represents a 95% CI using the formula:
χ2
0.05,1
∗
SS = SSRes (λ)e
b n
Then choosing the value for λ that minimizes SSRes within the CI.
9. Generalized and Weighted Least Squares: If we assume we have

a model:
y = xβ + , with E() = 0
But V ar() = σ 2 V 6= σ 2 I (non-constant variance)
The original LSE β : βb0 = (xT x)−1 xT y minimizes:
n
X
T
S0 (β) = (y − xβ) (y − xβ) = 2i
i=1
But S0 (β) is not optimal when the variance is not constant. So we need
to take into account a modified error sum of squares which considers
V ar() = σ 2 V
10. The basic idea is to scale the linear model within matrix A:
A(y = xβ +
Ay = Axβ + A
So that A has variance V (A) = σ 2 I

Notation: (n denotes ”new”)
(a) y n = Ay
(b) xn = Ax
(c) n = A
24
∴ Our new linear model:
y n = xn β + n
Where V (n ) = σ 2 I
11. If we define:
Sn (β) = (y − xn β)T (y n − xn β)
We call the minimizer of Sn (β) the:
Generalized least squares estimator (GLSE).
Where:
βb = (xTn xn )−1 xTn y n

= (xT AAx)−1 (Ax)T (Ay)
= (xT A2 x)−1 xT A2 y
12. We need to find a matrix A that satisfies V (A) = σ 2 I.

Since the variance is greater than zero V > 0, there exists a matrix K
such that:
(a) K = K T
(b) K > 0
(c) K × K = K 2 = V
1
(d) K = V 2
If we let A = K −1 , then
1
n = A = K −1 = V − 2
Also since:
E(n ) = 0,
V ar(n ) = K −1 V ar(n )K −1
1 1
= V − 2 σ2V V − 2
= σ2I
1
Now, A = K −1 = V − 2 , A2 = V −1
25
13. E(β)
b = β (unbiased)
b = σ 2 (xT V −1 x)−1
14. V ar(β)
15. Recall that V ar(βb0 ) = σ 2 (xT x)−1 xT V x(xT x)−1

Fact:
(xT V −1 x)−1 ≤ (xT x)−1 xT V x(xT x)−1
∴ GLSE is better
16. Since A2 = V −1 :
βb = (xT A2 x)−1 xT A2 y
βb = (xT V −1 x)−1 xT V −1 y
And since βb is unbiased and has smaller variance in the original LSE
βb0 it minimizes:
S(β) = (y − xβ)T (y − xβ)
17. Suppose V = diag(v11 , v22 , . . . , vnn ), vii > 0, Let:

−1
v11 ... 0
1 1 1 .. . .
w = diag( , ,..., )= . .
v11 v22 vnn −1
0 vnn
Then W = V −1 . The GLSE for this case:
βb = (xT wx)−1 xT wy
is called the weighted LSE (WLSE)

18. The WLSE minimizes:
n
X
Sw (β) = wi (yi − xTi β)2
i=1
n
X
= wi 2i
i=1
Where:
1
wi =
v(i )
We use the W LSE when the errors i are uncorrelated but V ar(i )
depends on i.
26
19. When i ’s are uncorrelated and hence WLSE is appropriate. We can
find W by:
(a) Fit the model using the original LSE

(b) Computer residuals i
(c) Determine if the variance is constant by plotting the errors against
yi , xi etc.
(d) Depending on how the errors vary, choose an appropriate trans-
formation and model the weights accordingly.
20. Best Linear Unbiased Estimator (BLUE):

Model I: y = xβ +
(a) E() = 0
(b) V ar() = σ 2 I
(c) βb = (xT x)−1 xT y
(d) E(β)
b =β
b = σ 2 (xT x)−1
(e) V ar(β)
Consider the general linear unbiased estimator: β̃ = cy where:
E(β̃) = E(β̃)
⇒ cE(y) = β
⇒ cxβ = β
⇒ cxββ −1 = ββ −1
⇒ cx = I
21. Theorem: The LSE βb is the best linear unbiased estimator (BLUE)
under Model I in that:
V ar(β̃) ≥ V ar(β)
b for any β̃
22. For Model II:
(a) y = xβ +
(b) E() = 0
(c) V ar() = σ 2 V 6= σ 2 I
27
For Model II, the GLSE is:
βb = (xT V −1 x)−1 xT V −1 y
is the BLUE that is:

V ar(β̃) ≥ V ar(β)
b
for any β̃
13 Diagnostics for Leverage and Influence

The last few sections have been about analysing the model, this section is
about analysing the data
1. When fitting a model, the location of observations along the x, y axis

can play an important role in determining the regression coefficients.
Extreme observations are referred to as outliers and are identified by an
usually large residuals. These observations can also affect the regression
results. Some examples of outliers are:
(a) Leverage point: A remote point in the x space, but it lies almost
on the regression line.Leverage points don’t affect the regression
coefficient bubble have a dramatic effect on model summary statis-
tics such as R2 and the standard errors of the regression coefficient.
(b) Influence point: A remote point in the y space (e.g. a data
point underneath a cluster of other data points). Influence points
have a noticeable impact on model coefficients and that it ”pulls”
the regression model in its direction.
(c) Leverage and influence point: A point that is remote in both
x and y axes.
2. We can measure for leverage using the hat matrix as follows:
H = x(xT x)−1 xT = [hii ]n×n

hii = xTi (xT x)−1 xi
If we take the trace of the hat matrix we get:
trace(H) = trace(x(xT x)−1 xT )

= trace((xT x)−1 xT x)
= trace(Ip ) = p
28
∴ n n
X 1X p
hii = p and hii =
i=1
n i=1 n

If hii > 2 np , then (xi , yi ) is a leverage point
3. We can measure for influence using Cook’s Distance for data point
of the form (xi , yi ), i = 1, 2, . . . , n as follows:
(a) Calculate β:
b the LSE based on all n points.
(b) Calculate βb(i) : the LSE based on all but the ith point ((xi , yi )
deleted)
Idea: If (xi , yi ) are not influential then βb and βb would be similar.
(i)
(c) Select values for M and C Usual choices:

i. M = xT x
ii. C = pM SRes
(d) Calculate Cook’s Distance for each data point Di :
b T M (βb − β)
(βb(i) − β) (i)
Di (M, C) =
C
(e) A large Di indicates (xi , yi ) has a large influence on βb

(f) The rule of thumb is:
Di > F0.5,p,n−p ≈ 1
Is influential
4. Robust Fit Residuals can also be used to find data points that are
influential as follows:
(a) Fit βc
R a robust estimate of β
(b) σc
R a robust estimate of σ
(c) ybR = xβc

R the robust fitted value
R = y−y
(d) ec c R the robust residuals
ec
Ri
A point (xi , yi ) is influential if σ
b
29
14 Polynomial Regression
Polynomial models are useful when curvilinear effects are present in the true
response function. They are also useful as approximating functions to un-
known and possibly very complex nonlinear relationships. In this sense, the
polynomial model is just the Taylor series expansion of the unknown function.
1. Polynomial Regression Models: A subclass of linear regression
models of the general form y = xβ + . Some examples of polynomial
regression models with one regressor are:
(a) y = β0 + β1 x + : (first order)
(b) y = β0 + β1 x + β2 x2 + : (second order)(quadratic)
(c) y = β0 + β1 x + β2 x2 + β3 x3 + : (third order)(cubic)
Polynomial regression models with two regressors are:
(a) y = β0 + β1 x1 + β2 x2 + (first order)
(b) y = β0 + β1 x1 + β2 x2 + β3 x21 + β4 x1 x2 + β5 x22 + (second order)
2. We can fit a cubic model in R as:
x2 = x · x
x3 = x · x · x
fit = lm(y ∼ x + x2 + x3 )
In general, the k th -order polynomial model in one variable is:
y = β 0 + β 1 x + β 2 x2 + . . . + β k xk +
If we set xj = xj , j = 1, 2, . . . , k, then the above equation becomes a

multiple linear regression model in the k regressors x1 , x2 , . . . xk and
allows us to fit the model as though it was linear.
3. Model Selection: Since there are many different models to choose
from (polynomial degree p, number of regressors, etc.) We need a
systematic way to find the optimal model.
The main objective is to keep p (order/degree) as low as possible (par-
simony).
4. Forward Selection: An approach to model building where we suc-
cessively fit models of increasing order until the t test for the highest
order term is nonsignificant.
30
5. Backward Selection: An approach to model building where we ap-
propriately fit the highest order model and then delete terms, one a
time starting with the highest order, until the highest order remaining
term has a significant t statistic
Q-Q plots are useful for comparing models of varying degrees (more
optimal models will have residual plots that appear morel linear).
6. Extrapolation: Any assumptions beyond the range of the data can

be unpredictable due to sudden changes that can occur in nonlinear
functions.
7. Ill-Conditioning:
(a) Type I (p too high): As the order of the polynomial increases

(xT x)−1 calculations become inaccurate and considerable error
may be introduced to the parameter estimates.
(b) Type II (x range too narrow ): If the values of x are limited to
a narrow range, there can be significant ill-conditioning or multi-
collinearity in the columns of X.
For example if x varies between 1 − 2, x2 varies between 1 − 4
which could create strong multicollinearity between x and x2
When p ≥ n, xT x is always singular.
8. Splines: A special function defined piecewise by polynomials. Splines

of order k where k = 3 are usually adequate for most practical prob-
lems.
9. Piecewise Polynomial Splines: Sometimes low order polynomials

provide a poor fit to the data and increasing the order does not sub-
stantially improve the situation.
This occurs when the function behaves differently in different parts of
the range of x.
A solution is to divide the range of x design and fit an appropriate
curve into each segment.
10. Knots: The joint points of the pieces of the spline.

i.e. The points that divide the pieces of the spline.
Variables (x, y) where x ∈ [a, b] has knots:
a ... ... ... b

t0 t1 . . . tn tn+1
31
Important: Generally we require the function values and the first
k − 1 derivatives to agree at the knots,of the spine is a continuous
function with k − 1 continuous derivatives
11. Cubic Spline: a spline where:
(a) k = 3.
(b) h knots t1 < t2 < . . . < th .
(c) Continuous first and second derivative
Can be written as:

3
X h
X
j
E(y) = S(x) = β0j x + βi (x − ti )3+
j=0 i=1
where: (
(x − ti )if x > ti
(x − ti )+
0 if x ≤ ti
What this function intuitively does is ”activates” the third order poly-
nomial portion of the function when x enters into a specific domain of
x such that x > ti
12. Fitting a Spline model: Determining knots is outside the scope of

this course. Once knots are determined, we can fit a model as follows:
3
X h
X
yi = β0j xji + βi (x − tk )3+ + i
j=0 k=1
Or:
3
X h
X
y= β0j xj + βk z k +
j=0 k=1
Where:
y1 xji (x1 − tk )3+

j
y2 x2 (x2 − tk )3+
y = .. xj = . z k = .. , k = 1, 2, . . . , h
. .. .
yn xjn (xn − tk )3+
32
13. General Cubic Spline: A cubic spline with no continuity of any kind
imposed on S(x) denoted:
3
X h X
X 3
j
E(y) = S(x) = β0j x + βij (x − ti )3+
j=0 i=1 j=0
With h = 1 knot, a < t1 < b:
S(x) =β00 + β01 x + β02 x2 + β03 x3 +

β10 (x − t1 )0+ + β11 (x − t1 )1+ + β12 (x − t1 )2+ + β13 (x − t1 )3+
Note that S(x), S 0 (x) and S 00 (x) are not necessarily continuous at t. To
determine whether imposing continuity restrictions reduces the quality
of the fit, test the hypotheses:
(a) H0 : β10 = 0 (tests the continuity of S(x))

(b) H0 : β10 = β11 = 0 (tests the continuity of S(x), S 0 (x))
(c) H0 : β10 = β11 = β12 = 0 (tests the continuity of S(x), S 0 (x), S 00 (x))
(d) H0 : β10 = β11 = β12 = β13 = 0
(tests whether the cubic spline fits the data better than a single
cubic polynomial over the range of x)
For each additional knot, we need four more terms in S(x).

When we have h knots, the number of parameters (terms) we need is
4(h + 1).
Be careful, ill-conditioning of xT x can occur when h is large.
14. Polynomial and Trigonometric Terms: If the scatter diagram in-

dicates that there may be some periodicity or cyclic behaviour in the
data, adding trigonometric terms the model may be very beneficial in
that a model with fewer terms may result in if only polynomial terms
are used.
The model for a single regressor x is:
d
X r
X
i
y = β0 + βi x + [δj sin(jx) + γj cos(jx)] +
i=1 j=1
15. Nonparametric Regression: The basic idea is to develop a model-

free basis for predicting response over the range of the data.
33
(a) Moving Average and Kernel Smoothing: An approach that
uses a weighted average of the data.
Suppose x1 < x2 < . . . < xn are equally spaced. Let:
yi−1 + yi + yi+1
ybi = , i = 2, 3, . . . , n − 1
3
Is a three point moving average.
Let ỹi be the kernel smoother estimate of the ith response, then:
n
X
ỹi = wij yj
j=1
n
P
where wij = 1
j=1
Ex.
1 1 1
ybi = yi−1 + yi + yi+1
4 2 4
Is a weighted average of adjacent points
Spacing not necessary for weighted averages
wij can be generated using a kernel function K(·) satisfying:
i. K(t) ≥ 0
R∞
ii. K(t)dt = 1
−∞
iii. K(−t) = K(t)
∴ K(t) is is symmetric density function with respect to 0. We can
calculate wij as follows:
x −x
K( j b i )
wij = P
n
K( x` −x
b
i
)
`=1
Where b is a positive constant

(b) Locally Weighted Regression (Loess): A kernel regression,
loess uses the data from the neighbourhood around the specific
location.
Typically, the neighbourhood is defined as the span, which is the
fraction of the total points used to four neighbourhoods.
A span of 0.5 indicates the closest half of the total data points is
used as the neighbourhood.
34
The loess procedure then uses the weights in the neighbourhood to
generate a weighted least-squares estimate of the specific response.
The weights for the weighted least-squares portion of the estimate
are based on the distance of the points used in the estimation from
the specific location of interest.
Most software packages use the tri-cube weighting function as its
default: " #
x0 − xj
W
∆(x0 )
Where:
i. x0 : the specific location of interest
ii. ∆(x0 ): the distance the farthest point in the neighbourhood
lies from the specific location of interest
35
15 Variable Selection and Model Building
The problem is there can be many regressors under consideration and not
all of them are needed. We want to identify the ”best subset” for the linear
model.
Big data occurs when there are more potential regressors than observations
(more columns than rows).
1. The Variable Selection Problem: Finding the ”best” subset of

regressors.
2. Variable selection does not always equal model selection.

When choosing between models x1 , x2 vs. x1 , x2 , x4 ,
variable selection = model selection.
But model selection also includes:
polynomial vs. linear vs. trigonometric etc.
3. General Strategy:
(a) Fit the full model (the one with all regressors)
(b) Perform a thorough analysis of fit: R2 , F -tests, t-tests etc.
(c) Determine if transformation(s) are needed
(d) Use t-tests to edit the model (remove nonsignificant variables)
(e) Repeat step (b) for edited model
Further iteration may be needed to deal with outliers/influential points

(robust regression or removal).
36
4. Consequence of misspecification: Suppose the true model is:
y = xβ +
where x is n × (K + 1)
K = the number of regressors.
We can split the model into to two chunks:
y = xp β p + x r β r +
where:
(a) xp = (x1 , . . . , xp−1 ) (first p − 1 regressors)
(b) xr = (xp , . . . , xk ) (last r regressors)
(c) (p − 1) + r = K
βbp
(d) LSE: βb = = (xT x)−1 xT y
βr
b
If we set the reduced (misspecified) model:

y = xp β p +
(a model that does not include the last r regressors)

Then the LSE: β˜p = (xT xp )−1 xT y
p p
Comparisons:
βbp βp
βb = is unbiased for β =
βr
b βr
But β˜p is not usually unbiased for βp (which is bad)
However, V ar(βbp ) ≥ V ar(β˜p ) (which is good)
Main takeaway: We can tell when it might be acceptable to use β˜p
based on the MSE criterion:
If we are analyzing a model θ,
b then:
M SE(θ) b + bias2 (θ)

b = V ar(θ) b
where bias(θ) b = E(θ)b −θ

If bias(β˜p ) is small. The MSE of β˜p might be smaller than that of βbp
due to the small variance of β˜p which implies it’s a good idea to remove
the last r variables.
37
5. The list of criterion used in variable/model selection are:
(a) R2 : The coefficient of determination.
SSR
R2 =
SST
For a model with p-coefficients (p − 1 regressors):
SSR(p)
Rp2 =
SST
As p increases, so does Rp2 (overfitting).
k

We can graph the R2 of all p−1 models and select our model
based off the value where adding further regressors see’s no realis-
tic improvement in R2 (turns flat). This method is not commonly
used.
2
(b) Radj : Non increasing function of p. This function forms an arch.
2
We choose p0 as the model that maximizes Radj,p , where:
2 n−1
Radj =1− (1 − Rp2 )
n−p
(c) M SRes :The residual mean square. Let:
SSRes (P )
M SRes (p) =
n−p
we choose p to minimize M SRes (p)
Important: (b) and (c) are equivalent since:
2 n−1
Radj,p =1− M SRes (p)
SST
(d) Mallow’s Cp Statistic: A standardized total mean square:
SSRes (p)
Cp = − n + 2p
σb2
σb2 is from the full model. This works since, if we let p0 be the
optimal value of regressors, then:
SSRes (p0 )
E(Cp0 ) ≈ E( − n + 2p) = p0
σ2
Therefore, we want to choose the model where Cp is as close to p
as possible.
38
(e) Akaike’s Information Criterion (AIC): The goal is to mini-
mize:
SSRes (p)
AICp = n log ( ) + 2p
n
where 2p is a penalty term
(f) Bayesian Information Criterion (BIC): The goal is to mini-
mize:
SSRes (p)
BICp = n log ( ) + p log n
n
where p log n is a penalty term
(g) P RESSp :
6. Computational Methods for Variable Selection:
(a) All possible regressors: In a model with K regressors, there

are 2K possible models.
2
Basic Idea: Fit all 2K models and compute their Rp2 , Radj,p , Cp
K
etc. to identify the best model among the 2 possible models.
Algorithms are available to efficiently handle the scenario as long

as K ≤ 30
(b) Forward Selection:

Add regressors one at a time until there are no more significant
regressors is left to add.
Suppose we have a model with five regressors:
y = β0 + βj + xj + , j = 1, 2, . . . , 5
i. For each step, compute Fin = Fα,1,n−p (since p will change)

Alternatively we can set Fin to be constant. e.g. Fin = 4
ii. Compute the partial F-statistic for each Fj where H0 : βj = 0
Ex. F3 = β0 + β3 x3 +
iii. Choose: Max{F1 , F2 , F3 , F4 , F5 } such that Fj ≥ Fin
This implies Fj is in the model
iv. Suppose in the last step F3 was chosen. Then, examine all
two variable models that contain x3 :
y = β0 + β1 x3 + β2 xj + , j = 1, 2, 4, 5
39
v. Suppose in the last step F2,3 was chosen. Then, examine all
three variable models that contain x2 , x3 :
y = β0 + β1 x3 + β2 x2 + β3 xj + , j = 1, 4, 5
vi. Continue until Max{Fj } < Fin .

This implies that there are no more significant regressors to
add.
The problem with forward selection is that it never tests model
performance without a regressor once it’s been added.
(c) Backward Selection:
Start with the full model and begin by identifying the least
significant regressor
i. Let Fi denote the partial F-statistic:
[SSRes (reduced)−SSRes (f ull)]
1
Fi = SSRes (F ull)
(n−p)
Where: SSRes (reduced) is the model that does not contain xi

ii. Suppose x4 is the least significant:
F4 = Min{F1 , F2 , F3 , F4 , F5 }
iii. If F4 is greater than a pre-selected Fout then all regressors are

significant and none can be dropped.
If the least significant regressor is significant, then the rest
must be significant as well.
iv. If F4 is less than a pre-selected Fout , we drop x4 from our
model.
v. Repeat this process until all remaining regressors are signifi-
cant (Fi > Fout ) and none can be dropped.
(d) Stepwise Selection:
combine forward and backward selections
i. Add one regressor from all of those not in the model.
(using Fin )
ii. Then, try to delete one regressor already in the model.
(using Fout )
iii. Stop no more regressors can be added and deleted.
40
7. Strategy for Variable Selection and Model Building:
(a) Variable selection tools:

i. R2 , Radj
2
, M SRes , Cp , AIC, BIC, P RESS
ii. All subset regression, forward selection, backward selection,
stepwise selection
(b) Model building Considerations:
i. Model checking: residual plots, Q – Q plot, VIH
ii. Leverage point, influential points, outliers
iii. Transformation of variables
iv. If polynomial regression, order of the model
v. Piecewise polynomial, kernel regression
8. Linear Model Building Flowchart:
(a) Step 1: Fit the full model

(b) Step 2: Perform residual analysis
(c) Step 3: If there are outliers/need of transformation, remove out-
liers and/or perform transformation. Then go back to Step 1. If
not needed, continue
(d) Step 4: Variable/model selection
If K is not too large, all subset regressions preferred.
(e) Step 5: Analyse selected models
(f) Step 6: Make recommendations
Important: It is best to try to keep more than one model in

contention while performing analysis. You never know what
shortcomings of a model that initially appear optimal may arise
during analysis.
41
16 Indicator Variables
1. Quantitative Variables: metrics such as temperature, distance, height
(continuous)
2. Qualitative Variables: metrics such as blood type, names, countries

(categorical )
3. We can represent qualitative variables as quantitative variables by con-

verting them into numerical representations.
Tool Type Value

A 0
B 1
4. There are different methods for fitting models with qualitative vari-
ables:
(a) Method 1: Divide the observations into groups for each combi-
nation of categorical variables. Then fit a simple model for each
group.
y = a + bx1 + A , N (0, σA2 )

y = c + dx1 + B , N (0, σB2 )
The result is totally independent of analysis for type A and type

B data.
The disadvantage of this method is it does not allow for the as-
sumption of equal slopes or equal error variance between sub mod-
els.
42
(b) Method 2: Assume common slope and common error variance.
We define the models as:
y = β0 + β1 x1 + (A)
y = (β0 + β2 ) + β1 x1 + (B)
y = β0 + β1 x1 + β2 x2 + (C)
Where x2 is a dummy variable such that:

(
0 if i ∈ group A
x2 =
1 if i ∈ group B
This allows C to represent both A and B with one expression

since:
i. C = A when x2 = 0
ii. C = B when x2 = 1,
We would fit the model inside R as:
lm(y ∼ x1 + x1 + x2 )
We can test H0 : β2 = 0 to see if we can further assume no

difference in intercept.
Method 2 is used when we assume common slopes
(c) Method 3: What if we assume only σA2 = σB2 (slopes may differ).
y = β0 + β1 x1 + (A)
y = (β0 + β2 ) + (β1 + β3 )x1 + (B)
y = β 0 + β 1 x1 + β 2 x2 + β 3 x1 x2 +
= β0 + β1 x1 + β2 x2 + β3 x3 + (C)
where x3 = x1 x2
Again, using a dummy variable x2 to represent the qualitative
variable
We can fit this model in R as:
lm(y ∼ x1 + x2 + x3 )
Test H01 : β3 = 0 to assess the assumption of equal slope

Test H02 : β2 = 0 to assess the assumption of equal intercept
If we reject, intercept/slopes are different
43
5. One categorical variable with three or more levels: Suppose we
have a qualitative variable with three categories:

0 if i ∈ group A

x2 = 1 if i ∈ group B

2 if i ∈ group C

We then end up the following models:

y = β0 + β1 x1 + (A)
y = (β0 + β2 ) + β1 x1 + (B)
y = (β0 + 2β2 ) + β1 x1 + (C)
But since the slopes are multiples with each other, the lines are equally
spaced, an unnecessary constraint.
6. For each category of qualitative variable, you need k − 1 dummy vari-
ables.
Ex. If a variable represents three different blood types, we need (2−1) =
2 variables:
x2 x3
A 0 0
AB 1 0
O 0 1
For the above coding, the equal slope model is:

y = β0 + β1 x1 + β2 x2 + β3 x3 +
For:
yi ∈ A, x2 = x3 = 0 =⇒ y = β0 + β1 x1 +
yi ∈ B, x2 = 1, x3 = 0 =⇒ y = (β0 + β2 ) + β1 x1 +
yi ∈ C, x2 = 0, x3 = 1 =⇒ y = (β0 + β3 ) + β1 x1 +
All three models have the same slope but have different intercepts with-
out the constraint of them needing to be equally spaced (they are still
parallel).
We can fit this model in R as:
lm(y ∼ x1 + x2 + x3 )
44

Stat 353 Study Guide

Uploaded by

Copyright:

Available Formats

Stat 353 Study Guide

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat 353 Study Guide

Uploaded by

Copyright:

Available Formats

STAT 353 Study Guide

1 Expectation and Variance

2. We can also represent the variance and covariance between multiple

V (X1 ) Cov(X1 , X2 ) ... Cov(X1 , Xn )

3. The main rule for Expectation is:

4. Rules for Variance:

(a) Denoted: N (µ, σ 2 ) with random variable X

(a) Denoted: tK with random variable TK

3. Chi-Squared Distribution (with k DF)

(a) Denoted χ2K

4. F Distribution (with (m,n) DF)

(b) Range: x > 0

1. If Z1 , Z2 , . . . , Zn are iid N (0, 1) then:

If Z1 , Z2 , . . . , Zn are not N (0, 1), we can convert them to N (0, 1). If

2. If X1 , X2 , . . . , Xn are iid and N (µ, σ 2 ) then:

3. If x ∼ χ2m , Y ∼ χ2n and x and Y are independent, then:

4. If Z ∼ N (0, 1), x ∼ χ2K and Z and Y are independent then:

5. If X1 , X2 , . . . , Xn are iid and N (µ, σ 2 ) then:

6. If x ∼ χ2m , Y ∼ χ2n and x and Y are independent, then:

2. Simple Linear Regression: A linear regression model with only one

(a) y: The output (response)

Alternatively we can write:

3. Least-Squares Estimation: To estimate β0 and β1 :

(f) Compute the Intercept: βb0 = y − βb1 x

(a) Total Sum of Squares:

(d) Regression Sum of Squares:

A measure of how well the regression model represents the model

A measure of the quality of an estimator, it is always non-negative

(a) Residual: Denoted ei = (yi − ybi ). The difference between the

6. Variance for βb0 and βb1 (Slope and Intercept):

8. Prediction Interval: A type of confidence interval used with predic-

9. Standard Error (S.E.): The average distance from the observed

3. Hypothesis Testing for Intercept:

Use Test Statistic:

Use Test Statistic:

Methods 1 and 2 are equivalent.

(a) βb0 ± t α2 ,n−2 · S.E.(βb0 )

2. Therefore the correct model is:

3. With regression through the origin, the slope is estimated as:

This can be extended for confidence intervals and tests

9 Maximum Likelihood Estimation

2. For multiple Yi ’s that are independent, the joint PDF:

= `(β0 , β1 , σ 2 ) = lnf (y1 , . . . , yn )

6. The MLE for σ 2 satisfies:

(a) More than one regressor (xi )

E(Y )is quadratic in x but linear in βj0 s

2. A multiple linear regression model of the form:

We say there is an interaction between x1 and x2

E(Y ) = β0 eβ1 x1 +β2 x2

Yi = β0 + β1 xi1 + β2 xi2 + . . . + βk xik + i

(a) Yi : The response for the ith subject

Important note: In order to obtain β, b we need the columns of x to

yb = xβb = x[(xT x)−1 xT y] = Hy

5. This means we can express residuals as:

∴ βb is an unbiased estimator for β

9. The estimation for σ 2

2. When the null hypothesis H0 is rejected, we can test H0 = βj = 0 for

We can derive the test statistic as follows:

3. We can test the significance for a subgroup of βj ’s by using a

Remember that the matrix dimensions are:

Yi = β0 + β1 xi1 + β2 xi2 + . . . + βk xik + i

If we assume that /0 is iid, N (0, σ 2 ), the transformed model can be

So that A has variance V (A) = σ 2 I

12. We need to find a matrix A that satisfies V (A) = σ 2 I.