Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
44 views

Introduction To Mathematical Modeling: Simple Linear Regression

Uploaded by

Meher Md Saad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Introduction To Mathematical Modeling: Simple Linear Regression

Uploaded by

Meher Md Saad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction to mathematical modeling

IDIR Yacine
yacine.idir@nyu.edu
January 2024

Without data, you’re just another


person with an opinion
— William Edwards Deming

Simple Linear Regression


Introduction
Let’s start with an example to set the stage. For public health reasons, we
are interested in the concentration of ozone O3 in the air (in micro-grams per
milliliter). In particular, we want to know if it’s possible to explain the maximum
daily ozone level by the midday temperature T12 . The data are:

Temperature at 12h 23.8 16.3 27.2 7.1 25.1 27.5 19.4 19.8 32.2 20.7
O3 max 115.4 76.8 113.8 81.6 115.4 125 83.6 75.2 136.8 102.8

Table 1: 10 daily data of temperature and ozone.

From a practical point of view, the purpose of this regression is twofold:


• to adjust a model to explain O3 based on T12 ;
• to predict the O3 values for new T12 values.

Before any analysis, it is interesting to represent the data, as in Figure 1.1.

1
Figure 1: 10 daily data of temperature and ozone.

To analyze the relationship between xi (temperature) and yi (ozone), we will


search for a function f such that:

yi ≈ f (xi ).
To refine the search, it is necessary to provide a criterion that quantifies the
quality of the fit of the function f to the data. In this context, we will use the
sum of the squares of the differences between the observed values and those
predicted by the function F, which is assumed to be the unknown true function.
The mathematical problem can then be written as follows:
n
X
arg min L(yi − f (xi )),
f ∈F
i=1

where n represents the number of available data points (size of the sample) and
L(·) is the loss function or cost function.

0.1 Modeling
In many situations, a natural first approach is to assume that the variable to
explain is a linear function of the explanatory variable, that is to say to search
for f in the set F of affine functions of R in R. This is the principle of simple
linear regression. We suppose that we have a sample of n points (xi , yi ).

Definition 1 (Simple Linear Regression Model) A simple linear re-


gression model is defined by an equation of the form:

∀i ∈ {1, . . . , n}, yi = β1 + β2 xi + εi

The quantities εi , although small, ensure that the points are not exactly aligned
on a straight line. These are called errors (or noise) and are supposed to be

2
random. For the model to be relevant to our data, we must nevertheless impose
assumptions on them. Here are the ones we will use in the future:

(H1) : E[εi ] = 0 for all i


(H2) : Cov(εi , εj ) = σ 2 δij for all (i, j)

Errors are therefore supposed to be centered, have constant variance (homoscedas-


ticity), and be uncorrelated with each other (δij is the Kronecker delta, i.e.,
δij = 1 if i = j, 0 if i ̸= j). Note that the simple linear regression model from
Definition 1 can still be written in vector form:

Y = β1 1 + β2 X + ε,

where:
• the vector Y = [y1 , . . . , yn ]′ is random of dimension n,
• the vector 1 = [1, . . . , 1]′ is the vector of Rn whose components all equal
1,
• the vector X = [x1 , . . . , xn ]′ is the given explanatory variable (non-random),
• the coefficients β1 and β2 are the unknown (but not random) parameters
of the model,
• the vector ε = [ε1 , . . . , εn ]′ is the vector of residuals ( random vector).
This vector notation is particularly convenient for the geometric interpreta-
tion of the problem. We will return to Section 1.3 on the constant use in linear
regression modeling. Note that we have already established some notations.

0.2 Ordinary Least Squares


Given the points
Pn (xi , yi ), the goal is now to find an affine function f such that
the quantity i=1 L(yi − f (xi )) is minimal. To determine f , it is necessary to
specify the cost function L. The following cost functions are commonly used:
• the quadratic cost L(u) = u2 ,
• the absolute cost L(u) = |u|.
The quadratic cost L(u) = u2 is privileged over other costs, but the choice
depends on the situation at hand. We then speak of the estimation method by
least squares (terminology due to Legendre in his 1805 article on the determi-
nation of comet orbits).

Definition 2 (Ordinary Least Squares Estimators) We call the Ordi-


nary Least Squares Estimators (abbreviated as OLS) the values minimizing the

3
quantity:
n
X
S(β1 , β2 ) = (yi − β1 − β2 xi )2 .
i=1

In other words, the straight line of least squares minimizes the sum of the squares
of the vertical distances of the points (xi , yi ) from the straight line y = β1 +β2 x.

0.2.1 Calculation of the estimators for β1 and β2


The function of two variables S is a quadratic function and its minimization
poses no problem, as we will now see.
Proposition 1 (Estimators β1 and β2 ) The OLS estimators can be ex-
pressed as:
β̂1 = ȳ − β̂2 x̄,
Pn Pn
i=1 (xi − x̄)(yi − ȳ) (xi − x̄)yi
β̂2 = Pn 2
= Pi=1n 2
(x
i=1 i − x̄) i=1 (xi − x̄)
Proof:

The first method consists of noticing that the function S(β1 , β2 ) is strictly
convex, so it admits a minimum at a unique point (β̂1 , β̂2 ), which is determined
by setting the partial derivatives of S to zero. We obtain the ”normal equations”:
n
∂S X
= −2 (yi − β̂1 − β̂2 xi ) = 0
∂β1 i=1

n
∂S X
= −2 xi (yi − β̂1 − β̂2 xi ) = 0
∂β2 i=1

The first equation gives:


n
X n
X
nβ̂1 + β̂2 xi = yi
i=1 i=1

from which we immediately deduce:

β̂1 = ȳ − β̂2 x̄,

4
where x̄ and ȳ are the empirical means of x and y respectively. The second
equation gives:
n
X n
X n
X
β̂1 xi + β̂2 x2i = xi yi
i=1 i=1 i=1

and by replacing β̂1 by its expression (1.1), we have:


Pn Pn Pn P
x y − i=1 xi i=1 yi /n (xi − x̄)(yi − ȳ)
β̂2 = i=1Pn i i 2 Pn = P
x
i=1 i − ( x
i=1 i )2 /n (xi − x̄)2

(1.2)
Expression of β̂2 assumes that the denominator (xi − x̄)2 is non-null. Or
P
this can happen if all the xi are equal, a situation without interest for our
problem because we then exclude it a priori.
Remarks:
1. The relation β̂1 = ȳ − β̂2 x̄ shows that the OLS line passes through the
center of gravity of the cloud of points (xi , yi ).

2. The expressions obtained for β̂1 and β̂2 show that these two estimators
are linear with respect to the vector Y = [y1 , . . . , yn ]′ .

3. The estimator β̂2 can also be written as (exercise):


P
(xi − x̄)εi
β̂2 = β2 + P
(xi − x̄)2
If this decomposition is not interesting for the effective calculation of β̂2 ,
because it involves unknown quantities β2 and εi , it is however essential to prove
the theoretical properties of the estimators (bias and variance). Its advantage
is to highlight the source of the ’errors’ εi .
Before proceeding, note that the calculation of the OLS estimators can only
be determined if the full rank condition applies to the matrix X, which we
will see later will serve in the explanation of the statistical properties of these
estimators.

0.2.2 Some properties of the estimators β̂1 and β̂2


Under the only hypotheses (H1) and (H2) of centering, decorrelation and ho-
moscedasticity of the errors of the model, we can already give some properties
of the estimators β̂1 and β̂2 of the least squares.

Theorem 1 If β̂1 and β̂2 are the ordinary least squares estimators from
Definition 2, then under the assumptions (H1) and (H2), β̂1 and β̂2 are unbiased
estimators of β1 and β2 , respectively.

5
That is, we have:
E[β̂1 ] = β1 , E[β̂2 ] = β2 .

Proof:

We start from the writing (1.3) for β̂2 :


P
(xi − x̄)εi
β̂2 = β2 + P .
(xi − x̄)2
In this expression, the errors εi are random, and since they are centered, we
deduce that E[β̂2 ] = β2 . For β̂1 , we start from the expression:

β̂1 = ȳ − β̂2 x̄,


from which we derive:
E[β̂1 ] = E[ȳ] − E[β̂2 ]x̄ = β1 + β̂2 x̄ − β̂2 x̄ = β1 .
We can also express variances and covariance of our estimators.
Theorem 2 (Variances and covariance) The variances of the estimators
are:
σ 2 x2i x̄2 σ2
P  
2 1
Var(β̂1 ) = P = σ + P and Var( β̂ 2 ) = P ,
n ( (xi − x̄)2 ) n (xi − x̄)2 (xi − x̄)2
while their covariance is:
σ 2 x̄
Cov(β̂1 , β̂2 ) = − P .
(xi − x̄)2

Proof. We start again from the expression of β̂2 used in the proof of non-
bias: P
(xi − x̄)εi
β̂2 = β2 + P ,
(xi − x̄)2
where the errors εi are uncorrelated and of the same variance σ 2 so the variance
of the sum equals the sum of the variances:
(xi − x̄)2 σ 2 σ2
P
Var(β̂2 ) = P 2 =P .
( (xi − x̄)2 ) (xi − x̄)2

6
Moreover, the covariance between β̂1 and εi is written as:
  σ 2 (x̄ − x̄)
Cov(β̂1 , εi ) = Cov ȳ − β̂2 x̄, εi = P = 0,
n (xi − x̄)2

from which comes the variance of β̂1 :

σ2 x̄2 σ 2
P 
wi yi
Var(β̂1 ) = Var P − β̂2 x̄ = +P − 2Cov(β̂1 , β̂2 ),
wi n (xi − x̄)2

that is to say:

σ2 x̄2 σ 2 σ 2 x̄2 σ2 x̄2 σ 2


Var(β̂1 ) = +P 2
−P 2
= +P .
n (xi − x̄) (xi − x̄) n (xi − x̄)2

Finally, for the covariance of the two estimators:


  σ 2 x̄
Cov(β̂1 , β̂2 ) = Cov ȳ − β̂2 x̄, β̂2 = −Var(β̂2 )x̄ = − P .
(xi − x̄)2

Remarks:
1. We have seen that the OLS line passes through the center of gravity of
the cloud of points (x̄, ȳ). Suppose that β̂2 is positive, then it is clear that
when we increase the slope, the ordinate at the origin decreases and vice
versa, we thus find the negative sign for the covariance between β̂1 and
β̂2 .
2. In inferential statistics, the variance of an estimator typically decreases
inversely proportionally to the sample
√ size, that is to say in 1/n. In other
terms, its precision is generally 1/ n. This does not apply if we consider,
for example, the expression obtained for the variance of β2 :

σ2
Var(β̂2 ) = P .
(xi − x̄)2

To understand that everything happens as usual, it suffices to consider


that the xi are non-random, with standard deviation σ. In very general
cases, the denominator is of the order nσ 2 and we find again the variance
in 1/n.

7
0.2.3 Calculation of residuals and residual variance

Figure 2: Representation of individuals.

In R2 (space of variables xi and yi ), β̂0 is the intercept on the y-axis and β̂1
the slope of the fitted line. This line minimizes the sum of squared vertical
distances of the points from the cloud to the fitted line. Let ŷi = β̂0 + β̂1 xi be
the ordinate of the point of the least squares line at abscissa xi , or fitted value.
The residuals are defined by (see figure 1.2):

ei = yi − ŷi = yi − β̂0 − β̂1 xi = (yi − ȳ) − β̂1 (xi − x̄).

(1.4) By construction, the sum of residuals is null:


X X X X
ei = (yi − ŷi ) = (yi − β̂0 ) − β̂1 (xi − x̄) = 0.

Note now that the variances and covariance of the estimators β̂0 and β̂1 , estab-
lished in the previous section are not practical because they involve the variance
σ 2 of the errors, which is generally unknown. However, we can give an unbiased
estimator thanks to the residuals.

Pn
Theorem 4 (Unbiased estimator of σ 2 ) The statistic σ̂ 2 = 2
i=1 ei /(n−
2) is an unbiased estimator of σ 2 .

0.2.4 Prediction
One of the purposes of regression is to make predictions, that is, to predict the
response variable y in the presence of a new value of the explanatory variable
x. So let xn+1 be a new value, for which we want to predict yn+1 . The model
remains the same:

yn+1 = β1 + β2 xn+1 + εn+1

8
with E[εn+1 ] = 0, Var(εn+1 ) = σ 2 or Cov(yn+1 , εn+1 ) = 0 for i = 1, . . . , n.
It is natural to predict the corresponding value via the adjusted model:

ŷn+1 = β̂1 + β̂2 xn+1 .


Two types of errors will taint our prediction: the first is due to the non-
knowledge of εn+1 , the second to the uncertainty on the estimators β̂1 and β̂2 .
Proposition 2 (Prediction Error) The prediction error ε̂n+1 = (yn+1 −
ŷn+1 ) satisfies the following properties:

E[ε̂n+1 ] = 0

(xn+1 − x̄)2
 
2 1
Var(ε̂n+1 ) = σ 1 + + P
 .
n (xi − x̄)2
Proof. For the expectation, it suffices to use the fact that εn+1 is centered
and the estimators β̂1 and β̂2 are unbiased:

E[ε̂n+1 ] = E[yn+1 ] − E[β̂1 ] − E[β̂2 xn+1 ] + E[εn+1 ] = 0.


We obtain the variance of the prediction error by noting that yn+1 depends
only on εn+1 while ŷn+1 depends on the other errors (εi ) thus:

Var(ε̂n+1 ) = Var(yn+1 − ŷn+1 ) = Var(yn+1 ) + Var(ŷn+1 ) = σ 2 + Var(ŷn+1 ).

Let’s calculate the second term:

Var(ŷn+1 ) = Var(β̂1 + β̂2 xn+1 ) = Var(β̂1 ) + x2n+1 Var(β̂2 ) + 2xn+1 Cov(β̂1 , β̂2 )

σ2 (xn+1 − x̄)2 2xn+1


= + P 2
+ − 2xn+1 x̄
n (xi − x̄) n
(xn+1 − x̄)2
 
2 1
=σ + P .
n (xi − x̄)2
Thus, the total becomes:

(xn+1 − x̄)2
 
1
Var(ε̂n+1 ) = σ 2 1 + + P .
n (xi − x̄)2
As such, the variance increases the further xn+1 is from the center of gravity
of the cloud. In other words, prediction is more perilous when xn+1 is ”far” from
the average x, because the variance of the prediction error can be very large!
This is intuitively understood by the fact that one more observation shifts the
center of gravity of the cloud and impacts the prediction error.

9
0.3 Geometric Interpretations
0.3.1 Representation of Variables
If we approach the problem from a vector perspective, we have two vectors at
our disposal: the vector X = [x1 , . . . , xn ]′ of observations for the explanatory
variable and the vector Y = [y1 , . . . , yn ]′ of observations for the variable to
be explained. These two vectors belong to the same space Rn : the space of
variables.
If we scale the vector 1 = [1, . . . , 1]′ , we first see by the Pythagorean theorem
which assumes xi are not equal, the vectors 1 and X are not collinear: they
generate a plane in Rn we will denote by M (X). One can project orthogonally
onto the vector Y on the space M (X), dimension 2, notation ProjM (Y ). Figure
1.3, for example because by M (X), there is a unique decomposition of the form
Y = β1 1 + β2 X. By definition of the orthogonal projection, Y is the unique
vector of M (X) minimizing the Euclidean distance ∥Y − Y ′ ∥, which comes back
to the same as minimizing its square. By definition of the Euclidean norm, this
quantity is:
n
X
′ 2
∥Y − Y ∥ = (yi − (β̂1 + β̂2 xi ))2 ,
i=1
which brings us back to the method of ordinary least squares. We deduce
from this that β̂1 = β1 , β̂2 = β2 and Ŷ = [ŷ1 , . . . , ŷn ]′ , with the expressions of
β̂1 , β̂2 and Ŷ seen previously.

Figure 3: Representation of the projection in the space of variables.

Otherwise, in Rn , β1 and β2 are interpreted as the coordinates of the orthog-


onal projection of Y on the subspace of Rn generated by 1 and X (see Figure
1.3).
Remarks:
1. This geometric vision of things may seem a bit abstract, but it is in fact
the fecond approach to understanding multiple regression, as we will see
in the following chapters.

10
0.3.2 The Coefficient of Determination R2
We keep the notations from the previous paragraph, with Y ′ = [ŷ1 , . . . , ŷn ]′ the
orthogonal projection of the vector Y on M (X) and

ê = Y − Y ′ = [e1 , . . . , en ]′
the vector of residuals already encountered in section 1.2.3. The Pythagorean
theorem gives us directly:

∥Y − ȳ∥2 = ∥Y ′ − ȳ∥2 + ∥ê∥2

n
X n
X n
X
(yi − ȳ)2 = (ŷi − ȳ)2 + e2i
i=1 i=1 i=1

SST = SSE + SSR,


where SST (respectively SSE and SSR) represents the total sum of squares
(respectively explained by the model and residual). This can be seen as a
typical decomposition of variance. It also allows introducing the coefficient of
determination in a natural way.
Definition 3 (Coefficient of Determination R2 ) The coefficient of de-
termination R2 is defined by:

SSE ∥Y ′ − ȳ∥2 ∥ê∥2 SSR


R2 = = 2
=1− =1− .
SST ∥Y − ȳ∥ ∥Y − ȳ∥2 SST

- If R2 = 1, the model explains everything, that is to say that yi = β1 + β2 xi


for all i: the points of the sample
P are perfectly aligned on the least squares line;
- If R2 = 0, it means that (yi − ȳ)2 = 0, so ȳ = y for all i. The model of
linear regression is then totally unsuited since Y is orthogonal to M (X);
- If R2 is close to zero, this means that Y is almost orthogonal to M (X), the
linear regression model poorly explains the variable Y , x does not explain y well
(at least not in an affine way).

In general, the interpretation is as follows: the linear regression model ex-


plains over 100 × R2 % of the total variance of the data.

Remarks:
1. R2 can also be seen as the square of the empirical correlation coefficient
between xi and yi (see exercise 1.2):

Pn !2
2 − x̄)(yi − ȳ)
i=1 (xi
R = pPn pPn = ρ2x,y .
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)

11
1 Case of Gaussian Errors
Better than the expressions for estimators and those of their variances, one could
know their laws: this would allow, for example, to obtain confidence regions and
to carry out hypothesis tests. In this optimal case, it is first necessary to make a
hypothesis force on our model, namely to specify the law of errors. We suppose
here that the errors are Gaussian. The hypotheses (H1) and (H2) become then:

(H1) εi ∼ N (0, σ 2 )
(H2) εi are mutually independent
The simple regression model becomes a parametric model, where the param-
eters (β1 , β2 , σ 2 ) are values in R × R × R+ . The law of εi is known, the laws of
yi are deduced:

∀i ∈ {1, . . . , n}, yi ∼ N (β1 + β2 xi , σ 2 ),


and the yi are mutually independent since the εi are. We can therefore
calculate the likelihood of the sample and the estimators that maximize this
likelihood. This is the object of the next section.

1.1 Maximum Likelihood Estimators

The likelihood is:


 n n
!
2 1 1 X
L(β1 , β2 , σ ) = √ exp − 2 (yi − β1 − β2 xi )2
2πσ 2 2σ i=1
 n  
1 1
= √ exp − 2 S(β1 , β2 ) ,
2πσ 2 2σ
which gives for the log-likelihood:
n 1
log L(β1 , β2 , σ 2 ) = − log(2πσ 2 ) − 2 S(β1 , β2 ).
2 2σ
We want to maximize this quantity with respect to the three variables
(β1 , β2 , σ 2 ). The two first variables only appear in the term −S(β1 , β2 ), which
must be minimized. But we already know that this quantity is minimal when

12
we consider the least squares estimators, that is for β̂1 = β1 and β̂2 = β2 .
The maximum likelihood estimators of β1 and β2 are equal to the least squares
estimators.
That being said, it remains simply to maximize log L(β̂1 , β̂2 , σ 2 ) with respect
to σ 2 . Let’s calculate the derivative with respect to σ 2 :

n
∂ log L(β̂1 , β̂2 , σ 2 ) n 1 n 1 X
= − + S( β̂ 1 , β̂ 2 ) = − + (yi − β̂1 − β̂2 xi )2 .
∂σ 2 2σ 2 2σ 4 2σ 2 2σ 4 i=1

From which we deduce that the estimator of the maximum likelihood of σ 2


is different from the estimator of σ 2 seen previously and is:
n
2 1X 2
σ̂ml = e .
n i=1 i

The estimator of the maximum likelihood of σ 2 is therefore biased. We have


2
an effect E[σ̂ml ] = n−1 2
n σ , but the bias becomes more negligible as the number
of observations increases.

1.1.1 Laws of Estimators and Confidence Regions


We will now see how the previous laws intervene in our estimators. To facilitate
the reading of this part, let’s fix the following notations:

−σ 2
c= P
(xi − x̄)2
 P 2 
2 2 xi
σ̂1 = σ P
n( (xi − x̄)2 )

σ2
σ̂22 = P
(xi − x̄)2
1 X 2
σ̂ 2 = ei
n−2
 P 2 
2 2 xi
σ̂1 = σ̂ P
n( (xi − x̄)2 )
σ̂ 2
σ̂22 = P
(xi − x̄)2
The quantities σ̂12 and σ̂22 correspond to the estimators of the variances of
β1 and β2 respectively.
Properties 1 (Laws of Estimators with known Variance) The laws
of the MCO estimators with known variance σ 2 are:

13
P 2
1 σ12
       
β̂1 β1 2 1 xi /n −x̄ c
β̂ = ∼N , σ V where V = P = 2
β̂2 β 2 (xi − x̄)2 −x̄ 1 σ c σ22

Remark. These properties, like those to come, are not easier to show in the
context of simple linear regression as in the case of multiple linear regression.
This is why we will report the proofs in Chapter 3.
The problem with the above properties is that they involve the theoretical
variance σ 2 , generally unknown. The natural way to proceed is to replace it
by its estimator σ̂ 2 . The laws intervening in the estimators are thus slightly
modified.
Properties 2 (Laws of Estimators with Estimated Variance) The
laws of the MCO estimators with estimated variance σ̂ 2 are:

β̂1 − β1
(i) ∼ tn−2 , where tn−2 is a Student’s t-distribution with (n−2) degrees of freedom.
σ̂1

β̂2 − β2
(ii) ∼ tn−2 .
σ̂2

1.1.2 Prediction
In terms of prediction in the case of Gaussian errors, the results obtained in
section 1.2.4 for the expectation and the variance are still valid. Moreover,
since yn+1 is linear in β1 , β2 and εn+1 , we can specify its law:

(xn+1 − x̄)2
  
2 1
yn+1 − ŷn+1 ∼ N 0, σ 1 + + P .
n (xi − x̄)2
Once again we do not know σ 2 and we estimate it by σ̂ 2 . As yn+1 − ŷn+1 and
2
σ̂ are independent, we can state a result giving confidence intervals for yn+1 :
Proposition 3 (Law and confidence interval for prediction) With the
previous notations and hypotheses, we have:
yn+1 − ŷn+1
q 2
∼ Tn−2 ,
σ̂ 1+ 1
+ Pn+1 −x̄)2
(x
n (xi −x̄)

from which we deduce the following confidence interval for yn+1 :


s
1 (xn+1 − x̄)2
ŷn+1 ± tn−2,(1−α/2) σ̂ 1 + + P .
n (xi − x̄)2
This reaffirms the already made remark: the further the point to predict is
from the abscissa xn+1 from the average x̄, the larger the confidence interval
will be.

14
2 Example
We will now deal with the 50 daily data presented in Annex I. The explanatory
variable is the concentration of ozone, denoted O3, and the explanatory variable
is the midday temperature, denoted T12. The data are processed with the R
statistical software.

> a <- lm(O3 ~ T12)


> summary(a)
Call:
lm(formula = O3 ~ T12)

Residuals:
Min 1Q Median 3Q Max
-45.256 -15.326 -3.461 17.634 40.072

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.4150 13.0584 2.406 0.0200 *
T12 2.7010 0.6266 4.311 8.04e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.5 on 48 degrees of freedom
Multiple R-Squared: 0.2791, Adjusted R-squared: 0.2641
F-statistic: 18.58 on 1 and 48 DF, p-value: 8.041e-05

The output from the software gives the estimated values β̂1 and β̂2 of the
parameters, their standard errors, the statistics of the tests under the hypothe-
sis H0 : βi = 0. We reject H0 for both estimated parameters.

Your example:
- Generate some Vector X with any distribution.
- Generate a normal vector Y, with mean a · X + b and variance σ 2 .
- Code the function S(a,b).
- Minimize this function with respect to a and b using the function optim.
- Compare your result with the function ”lm”.

3 Exercises
Exercise 1.1 (MCQ) 1. In a simple regression, if the R2 value is 1, are the points
aligned?
A. Yes

15
B. No
C. It depends;

2. Does the OLS regression line of a simple regression always pass through
the point (x̄, ȳ)?
A. Always;
B. Never;
C. Sometimes;

We have performed a simple regression, we receive a new observation x and


we calculate the corresponding prediction ŷn . The variance of the predicted
value is minimal when:

1. A.xN = 0;

2. B.xN = x̄; ;
3. C.N orelation.

Exercise 1.2 (R2 and empirical correlation) Recall the formula defining
the coefficient of determination R2 and develop it to show that it is equal to the
square of the empirical correlation between x and y, denoted rxy , that is to say
that we have:
Pn !2
2 2 − x̄)(yi − ȳ)
i=1 (xi
R = rxy = pPn pPn
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)

Exercise 1.3 (Weights of fathers and sons) The statistical study below
deals with the weights of fathers and their firstborn son.
Father 65 63 67 64 68 62 70 66 68 67 69 71
Son 68 66 68 65 69 66 68 65 71 67 68 70
Here are the numerical results we have obtained:

X X X X X
si = 800 s2i = 53418 si fi = 54107 fi = 811 fi2 = 54849.

1. Calculate the least squares line of the weights of the sons as a function of
the weight of the fathers.

2. Calculate the least squares line of the weight of the fathers based on the
weight of the sons.

16
3. Show that the product of the slopes of the two lines is equal to the square
of the correlation coefficient between the si and fi (or the coefficient of
determination).

Exercise 1.4 (Height of a tree) We wish to express the height (in feet)
of a tree based on its diameter x in inches using a simple linear model. For
this purpose, we have measured 20 couples (diameter,height) and carried out
the following calculations: x̄ = 4.53, ȳ = 8.65

1 X 1 X 1 X
(xi − x̄)2 = 10.97 (yi − ȳ)2 = 2.24 (xi − x̄)(yi − ȳ) = 3.77.
20 20 20

We denote by β̂0 + β̂1 x the regression line. Calculate β̂0 and β̂1 .

1. Comment on the quality of the adjustment of the data to the model.


Express the coefficient of determination r2 and the empirical correlation
r.

2. Provide a measurement unit for the quality of the adjustment.

We are given the standard error estimates of β̂0 , β̂1 , and β̂2 , which are 0.162,
0.162, and 0.05, respectively. We assume the errors εi are Gaussian, centered,
with the same variance and independent. Test H0 : β1 = 0 against H1 : β1 ̸= 0.
Why is this test interesting in our context? What do you think of the result?

Exercise 1.5 (Regression Line and Outlier Points)


Twelve people are enrolled in a training program. At the beginning of the
training, the trainees take a test A to assess their skills. At the end of the
training, they take a test B of the same level. The results are given in the
following table:
Test A 3 4 6 7 9 10 9 11 12 13 15 4
Test B 8 9 10 13 15 14 13 16 13 19 6 19

1. Represent the point cloud. Determine the regression line. Calculate the
coefficient of determination R2 . Comment on it.
2. Two trainees seem to stand out from the others. Remove them and de-
termine the regression line on the remaining ten points. Calculate the
coefficient of determination R2 . Comment on it.

17
Exercise 1.6 (The Height of Eucalyptus)
We wish to explain the height y (in meters) of a tree as a function of its circum-
ference x (in centimeters) at 1.30m from the ground. We recorded n = 1429
pairs (xi , yi ), the cloud of points being represented in Figure 1.8. We obtained
(x̄, ȳ) = (47.3, 21.2), and:

X X X
(xi − x̄)2 = 102924 (yi − ȳ)2 = 8857 (xi − x̄)(yi − ȳ) = 26466

Figure 4: Point cloud for eucalyptus.

1. Calculate the least squares line for the model y = β1 +β2 x+e and represent
it on Figure 1.8.
2. Calculate the coefficient of determination R2 . Comment on the quality of
the fit of the data to the model.
Pn
With these estimators, the sum of squared residuals is i=1 (yi − ŷi )2 = 2052.
If we assume that the errors εi are Gaussian, centered, independent, and have
the same variance σ 2 , deduce an unbiased estimator of σ 2 .
Pn
2 (yi − ŷi )2
σ̂ = i=1
n−2
Provide an estimator σ̂ 2 of the variance of β̂1 .

Exercise 1.7 (Forrest Gump for ever)


The term ”threshold frequency” of an amateur athlete refers to the heart rate
reached after three quarters of an hour of sustained running. This is measured

18
using a cardio-frequency meter. We seek to know if age influences the threshold
frequency. We have data for 20 pairs of age (xi ), threshold frequency (yi ). We
obtained (x̄, ȳ) = (35, 170) and:

X X X
(xi − x̄)2 = 1991 (yi − ȳ)2 = 189.2 (xi − x̄)(yi − ȳ) = −195.4

1. Calculate the least squares line for the model y = β1 + β2 x + e.


2. Calculate the coefficient of determination R2 . Comment on the quality of
the fit of the data to the model.
Pn
3. With these estimators, the sum of squared residuals is i=1 (yi − ŷi )2 =
170. Assuming that the errors εi are Gaussian, centered, independent,
and have the same variance σ 2 , deduce an unbiased estimator of σ 2 .
Pn
i=1 (yi− ŷi )2
σ̂ 2 =
n−2

4. Provide an estimator σ̂22 of the variance of β̂2 .

Exercise 1.8 (Comparison of estimators)


We consider the following statistical model:

yi = βxi + ei , i = 1, . . . , n,
where we assume that the errors ei are such that E[ei ] = 0 and Cov(ei , ej ) =
δij σ 2 .

1. Going back to the definition of least squares, show that the least squares
estimator of β is
Pn
xi yi
β̂ = Pi=1
n 2 .
i=1 xi

2. Show that the line passing through the origin and the center of gravity of
the point cloud is
Pn
yi
y = β ∗ x, with β ∗ = Pni=1 .
x
i=1 i

3. Show that β̂ and β ∗ are both unbiased estimators of β.

Exercise 1.9 (Simple Regression) We have n points (xi , yi ), i = 1, . . . , n,


and we know that there exists a relationship of the form: yi = α+βxi +εi , where
the errors εi are centered variables, uncorrelated and with the same variance
σ2 .

19
1. Recall the formulas for the least squares estimators of α̂ and β̂, as well as
their respective variances.
2. In this question, we assume α is known but β is not.
(a) By returning to the definition of least squares, calculate the least
squares estimator of β̃.
(b) Calculate the variance of β̃. Show that it is inferior to the variance
of β̂.
3. In this question, we assume α is unknown but β is known.
(a) By returning to the definition of least squares, calculate the least
squares estimator of α̃.
(b) Calculate the variance of α̃. Show that it is inferior to the variance
of α̂.

Exercise 1.10 (Forces of Friction and Speed) In the 17th century,


Huygens was interested in the forces of resistance of an object moving in a
fluid (air, water, etc.). He initially hypothesized that the forces of friction were
proportional to the speed of the object, and, after experimental validation, he
suggested that friction forces were proportional to the square of the speed. An
experiment is carried out in which we vary the speed v of an object and measure
the forces of friction F . One tests the relation of proportionality between these
forces of friction and the speed.
1. What model(s) would you test?
2. How would you go about determining the suitable model?
Exercise 1.12 (Price of an Apartment in Function of its Surface) In
June 2005, a survey done on a set of 108 apartments sold (in m2 ) in the Rennes
agglomeration shows the following figures (see Figure 1.19):
1. From the listing of Table 1.2, provide an estimate of the correlation coef-
ficient between the price p and the surface area s of the apartments.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 134.3450 45.4737 2.954 0.00386
Surface 6.6570 0.6525 10.203 < 2e-16

Residual standard error: 77.93 on 106 degrees of freedom


Multiple R-Squared: 0.4955, Adjusted R-squared: 0.4907
F-statistic: 104.1 on 1 and 106 DF, P-value: < 2.2e-16

Table 1.2 - Rent price as a function of surface area: results of the


simple linear regression (software R).

20
Figure 5: Rental prices of apartments as a function of their surface area.

1. Propose a model allowing the study of the relationship between the price
of apartments and their surface area. Specify the hypotheses of the model.

2. Based on Table 1.2, does the surface area play a role in the rental price of
type 3 apartments? Consider the simple linear model.
3. What is the estimate of the coefficient β (coefficient of the surface area in
the model)? Comment on the precision of this estimate.

4. The average surface area of the 108 apartments is 68.74 m2 and the average
price of the apartments is 509.95 euros. What is the estimated average
rental price per square meter? How does this average price differ from the
estimation of β?
5. In the pool of data you have, how do you know which apartments are
”good deals” from the point of view of surface area?

21

You might also like