Introduction To Mathematical Modeling: Simple Linear Regression
Introduction To Mathematical Modeling: Simple Linear Regression
IDIR Yacine
yacine.idir@nyu.edu
January 2024
Temperature at 12h 23.8 16.3 27.2 7.1 25.1 27.5 19.4 19.8 32.2 20.7
O3 max 115.4 76.8 113.8 81.6 115.4 125 83.6 75.2 136.8 102.8
1
Figure 1: 10 daily data of temperature and ozone.
yi ≈ f (xi ).
To refine the search, it is necessary to provide a criterion that quantifies the
quality of the fit of the function f to the data. In this context, we will use the
sum of the squares of the differences between the observed values and those
predicted by the function F, which is assumed to be the unknown true function.
The mathematical problem can then be written as follows:
n
X
arg min L(yi − f (xi )),
f ∈F
i=1
where n represents the number of available data points (size of the sample) and
L(·) is the loss function or cost function.
0.1 Modeling
In many situations, a natural first approach is to assume that the variable to
explain is a linear function of the explanatory variable, that is to say to search
for f in the set F of affine functions of R in R. This is the principle of simple
linear regression. We suppose that we have a sample of n points (xi , yi ).
∀i ∈ {1, . . . , n}, yi = β1 + β2 xi + εi
The quantities εi , although small, ensure that the points are not exactly aligned
on a straight line. These are called errors (or noise) and are supposed to be
2
random. For the model to be relevant to our data, we must nevertheless impose
assumptions on them. Here are the ones we will use in the future:
Y = β1 1 + β2 X + ε,
where:
• the vector Y = [y1 , . . . , yn ]′ is random of dimension n,
• the vector 1 = [1, . . . , 1]′ is the vector of Rn whose components all equal
1,
• the vector X = [x1 , . . . , xn ]′ is the given explanatory variable (non-random),
• the coefficients β1 and β2 are the unknown (but not random) parameters
of the model,
• the vector ε = [ε1 , . . . , εn ]′ is the vector of residuals ( random vector).
This vector notation is particularly convenient for the geometric interpreta-
tion of the problem. We will return to Section 1.3 on the constant use in linear
regression modeling. Note that we have already established some notations.
3
quantity:
n
X
S(β1 , β2 ) = (yi − β1 − β2 xi )2 .
i=1
In other words, the straight line of least squares minimizes the sum of the squares
of the vertical distances of the points (xi , yi ) from the straight line y = β1 +β2 x.
The first method consists of noticing that the function S(β1 , β2 ) is strictly
convex, so it admits a minimum at a unique point (β̂1 , β̂2 ), which is determined
by setting the partial derivatives of S to zero. We obtain the ”normal equations”:
n
∂S X
= −2 (yi − β̂1 − β̂2 xi ) = 0
∂β1 i=1
n
∂S X
= −2 xi (yi − β̂1 − β̂2 xi ) = 0
∂β2 i=1
4
where x̄ and ȳ are the empirical means of x and y respectively. The second
equation gives:
n
X n
X n
X
β̂1 xi + β̂2 x2i = xi yi
i=1 i=1 i=1
(1.2)
Expression of β̂2 assumes that the denominator (xi − x̄)2 is non-null. Or
P
this can happen if all the xi are equal, a situation without interest for our
problem because we then exclude it a priori.
Remarks:
1. The relation β̂1 = ȳ − β̂2 x̄ shows that the OLS line passes through the
center of gravity of the cloud of points (xi , yi ).
2. The expressions obtained for β̂1 and β̂2 show that these two estimators
are linear with respect to the vector Y = [y1 , . . . , yn ]′ .
Theorem 1 If β̂1 and β̂2 are the ordinary least squares estimators from
Definition 2, then under the assumptions (H1) and (H2), β̂1 and β̂2 are unbiased
estimators of β1 and β2 , respectively.
5
That is, we have:
E[β̂1 ] = β1 , E[β̂2 ] = β2 .
Proof:
Proof. We start again from the expression of β̂2 used in the proof of non-
bias: P
(xi − x̄)εi
β̂2 = β2 + P ,
(xi − x̄)2
where the errors εi are uncorrelated and of the same variance σ 2 so the variance
of the sum equals the sum of the variances:
(xi − x̄)2 σ 2 σ2
P
Var(β̂2 ) = P 2 =P .
( (xi − x̄)2 ) (xi − x̄)2
6
Moreover, the covariance between β̂1 and εi is written as:
σ 2 (x̄ − x̄)
Cov(β̂1 , εi ) = Cov ȳ − β̂2 x̄, εi = P = 0,
n (xi − x̄)2
σ2 x̄2 σ 2
P
wi yi
Var(β̂1 ) = Var P − β̂2 x̄ = +P − 2Cov(β̂1 , β̂2 ),
wi n (xi − x̄)2
that is to say:
Remarks:
1. We have seen that the OLS line passes through the center of gravity of
the cloud of points (x̄, ȳ). Suppose that β̂2 is positive, then it is clear that
when we increase the slope, the ordinate at the origin decreases and vice
versa, we thus find the negative sign for the covariance between β̂1 and
β̂2 .
2. In inferential statistics, the variance of an estimator typically decreases
inversely proportionally to the sample
√ size, that is to say in 1/n. In other
terms, its precision is generally 1/ n. This does not apply if we consider,
for example, the expression obtained for the variance of β2 :
σ2
Var(β̂2 ) = P .
(xi − x̄)2
7
0.2.3 Calculation of residuals and residual variance
In R2 (space of variables xi and yi ), β̂0 is the intercept on the y-axis and β̂1
the slope of the fitted line. This line minimizes the sum of squared vertical
distances of the points from the cloud to the fitted line. Let ŷi = β̂0 + β̂1 xi be
the ordinate of the point of the least squares line at abscissa xi , or fitted value.
The residuals are defined by (see figure 1.2):
Note now that the variances and covariance of the estimators β̂0 and β̂1 , estab-
lished in the previous section are not practical because they involve the variance
σ 2 of the errors, which is generally unknown. However, we can give an unbiased
estimator thanks to the residuals.
Pn
Theorem 4 (Unbiased estimator of σ 2 ) The statistic σ̂ 2 = 2
i=1 ei /(n−
2) is an unbiased estimator of σ 2 .
0.2.4 Prediction
One of the purposes of regression is to make predictions, that is, to predict the
response variable y in the presence of a new value of the explanatory variable
x. So let xn+1 be a new value, for which we want to predict yn+1 . The model
remains the same:
8
with E[εn+1 ] = 0, Var(εn+1 ) = σ 2 or Cov(yn+1 , εn+1 ) = 0 for i = 1, . . . , n.
It is natural to predict the corresponding value via the adjusted model:
Var(ŷn+1 ) = Var(β̂1 + β̂2 xn+1 ) = Var(β̂1 ) + x2n+1 Var(β̂2 ) + 2xn+1 Cov(β̂1 , β̂2 )
(xn+1 − x̄)2
1
Var(ε̂n+1 ) = σ 2 1 + + P .
n (xi − x̄)2
As such, the variance increases the further xn+1 is from the center of gravity
of the cloud. In other words, prediction is more perilous when xn+1 is ”far” from
the average x, because the variance of the prediction error can be very large!
This is intuitively understood by the fact that one more observation shifts the
center of gravity of the cloud and impacts the prediction error.
9
0.3 Geometric Interpretations
0.3.1 Representation of Variables
If we approach the problem from a vector perspective, we have two vectors at
our disposal: the vector X = [x1 , . . . , xn ]′ of observations for the explanatory
variable and the vector Y = [y1 , . . . , yn ]′ of observations for the variable to
be explained. These two vectors belong to the same space Rn : the space of
variables.
If we scale the vector 1 = [1, . . . , 1]′ , we first see by the Pythagorean theorem
which assumes xi are not equal, the vectors 1 and X are not collinear: they
generate a plane in Rn we will denote by M (X). One can project orthogonally
onto the vector Y on the space M (X), dimension 2, notation ProjM (Y ). Figure
1.3, for example because by M (X), there is a unique decomposition of the form
Y = β1 1 + β2 X. By definition of the orthogonal projection, Y is the unique
vector of M (X) minimizing the Euclidean distance ∥Y − Y ′ ∥, which comes back
to the same as minimizing its square. By definition of the Euclidean norm, this
quantity is:
n
X
′ 2
∥Y − Y ∥ = (yi − (β̂1 + β̂2 xi ))2 ,
i=1
which brings us back to the method of ordinary least squares. We deduce
from this that β̂1 = β1 , β̂2 = β2 and Ŷ = [ŷ1 , . . . , ŷn ]′ , with the expressions of
β̂1 , β̂2 and Ŷ seen previously.
10
0.3.2 The Coefficient of Determination R2
We keep the notations from the previous paragraph, with Y ′ = [ŷ1 , . . . , ŷn ]′ the
orthogonal projection of the vector Y on M (X) and
ê = Y − Y ′ = [e1 , . . . , en ]′
the vector of residuals already encountered in section 1.2.3. The Pythagorean
theorem gives us directly:
n
X n
X n
X
(yi − ȳ)2 = (ŷi − ȳ)2 + e2i
i=1 i=1 i=1
Remarks:
1. R2 can also be seen as the square of the empirical correlation coefficient
between xi and yi (see exercise 1.2):
Pn !2
2 − x̄)(yi − ȳ)
i=1 (xi
R = pPn pPn = ρ2x,y .
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
11
1 Case of Gaussian Errors
Better than the expressions for estimators and those of their variances, one could
know their laws: this would allow, for example, to obtain confidence regions and
to carry out hypothesis tests. In this optimal case, it is first necessary to make a
hypothesis force on our model, namely to specify the law of errors. We suppose
here that the errors are Gaussian. The hypotheses (H1) and (H2) become then:
(H1) εi ∼ N (0, σ 2 )
(H2) εi are mutually independent
The simple regression model becomes a parametric model, where the param-
eters (β1 , β2 , σ 2 ) are values in R × R × R+ . The law of εi is known, the laws of
yi are deduced:
12
we consider the least squares estimators, that is for β̂1 = β1 and β̂2 = β2 .
The maximum likelihood estimators of β1 and β2 are equal to the least squares
estimators.
That being said, it remains simply to maximize log L(β̂1 , β̂2 , σ 2 ) with respect
to σ 2 . Let’s calculate the derivative with respect to σ 2 :
n
∂ log L(β̂1 , β̂2 , σ 2 ) n 1 n 1 X
= − + S( β̂ 1 , β̂ 2 ) = − + (yi − β̂1 − β̂2 xi )2 .
∂σ 2 2σ 2 2σ 4 2σ 2 2σ 4 i=1
−σ 2
c= P
(xi − x̄)2
P 2
2 2 xi
σ̂1 = σ P
n( (xi − x̄)2 )
σ2
σ̂22 = P
(xi − x̄)2
1 X 2
σ̂ 2 = ei
n−2
P 2
2 2 xi
σ̂1 = σ̂ P
n( (xi − x̄)2 )
σ̂ 2
σ̂22 = P
(xi − x̄)2
The quantities σ̂12 and σ̂22 correspond to the estimators of the variances of
β1 and β2 respectively.
Properties 1 (Laws of Estimators with known Variance) The laws
of the MCO estimators with known variance σ 2 are:
13
P 2
1 σ12
β̂1 β1 2 1 xi /n −x̄ c
β̂ = ∼N , σ V where V = P = 2
β̂2 β 2 (xi − x̄)2 −x̄ 1 σ c σ22
Remark. These properties, like those to come, are not easier to show in the
context of simple linear regression as in the case of multiple linear regression.
This is why we will report the proofs in Chapter 3.
The problem with the above properties is that they involve the theoretical
variance σ 2 , generally unknown. The natural way to proceed is to replace it
by its estimator σ̂ 2 . The laws intervening in the estimators are thus slightly
modified.
Properties 2 (Laws of Estimators with Estimated Variance) The
laws of the MCO estimators with estimated variance σ̂ 2 are:
β̂1 − β1
(i) ∼ tn−2 , where tn−2 is a Student’s t-distribution with (n−2) degrees of freedom.
σ̂1
β̂2 − β2
(ii) ∼ tn−2 .
σ̂2
1.1.2 Prediction
In terms of prediction in the case of Gaussian errors, the results obtained in
section 1.2.4 for the expectation and the variance are still valid. Moreover,
since yn+1 is linear in β1 , β2 and εn+1 , we can specify its law:
(xn+1 − x̄)2
2 1
yn+1 − ŷn+1 ∼ N 0, σ 1 + + P .
n (xi − x̄)2
Once again we do not know σ 2 and we estimate it by σ̂ 2 . As yn+1 − ŷn+1 and
2
σ̂ are independent, we can state a result giving confidence intervals for yn+1 :
Proposition 3 (Law and confidence interval for prediction) With the
previous notations and hypotheses, we have:
yn+1 − ŷn+1
q 2
∼ Tn−2 ,
σ̂ 1+ 1
+ Pn+1 −x̄)2
(x
n (xi −x̄)
14
2 Example
We will now deal with the 50 daily data presented in Annex I. The explanatory
variable is the concentration of ozone, denoted O3, and the explanatory variable
is the midday temperature, denoted T12. The data are processed with the R
statistical software.
Residuals:
Min 1Q Median 3Q Max
-45.256 -15.326 -3.461 17.634 40.072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.4150 13.0584 2.406 0.0200 *
T12 2.7010 0.6266 4.311 8.04e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.5 on 48 degrees of freedom
Multiple R-Squared: 0.2791, Adjusted R-squared: 0.2641
F-statistic: 18.58 on 1 and 48 DF, p-value: 8.041e-05
The output from the software gives the estimated values β̂1 and β̂2 of the
parameters, their standard errors, the statistics of the tests under the hypothe-
sis H0 : βi = 0. We reject H0 for both estimated parameters.
Your example:
- Generate some Vector X with any distribution.
- Generate a normal vector Y, with mean a · X + b and variance σ 2 .
- Code the function S(a,b).
- Minimize this function with respect to a and b using the function optim.
- Compare your result with the function ”lm”.
3 Exercises
Exercise 1.1 (MCQ) 1. In a simple regression, if the R2 value is 1, are the points
aligned?
A. Yes
15
B. No
C. It depends;
2. Does the OLS regression line of a simple regression always pass through
the point (x̄, ȳ)?
A. Always;
B. Never;
C. Sometimes;
1. A.xN = 0;
2. B.xN = x̄; ;
3. C.N orelation.
Exercise 1.2 (R2 and empirical correlation) Recall the formula defining
the coefficient of determination R2 and develop it to show that it is equal to the
square of the empirical correlation between x and y, denoted rxy , that is to say
that we have:
Pn !2
2 2 − x̄)(yi − ȳ)
i=1 (xi
R = rxy = pPn pPn
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
Exercise 1.3 (Weights of fathers and sons) The statistical study below
deals with the weights of fathers and their firstborn son.
Father 65 63 67 64 68 62 70 66 68 67 69 71
Son 68 66 68 65 69 66 68 65 71 67 68 70
Here are the numerical results we have obtained:
X X X X X
si = 800 s2i = 53418 si fi = 54107 fi = 811 fi2 = 54849.
1. Calculate the least squares line of the weights of the sons as a function of
the weight of the fathers.
2. Calculate the least squares line of the weight of the fathers based on the
weight of the sons.
16
3. Show that the product of the slopes of the two lines is equal to the square
of the correlation coefficient between the si and fi (or the coefficient of
determination).
Exercise 1.4 (Height of a tree) We wish to express the height (in feet)
of a tree based on its diameter x in inches using a simple linear model. For
this purpose, we have measured 20 couples (diameter,height) and carried out
the following calculations: x̄ = 4.53, ȳ = 8.65
1 X 1 X 1 X
(xi − x̄)2 = 10.97 (yi − ȳ)2 = 2.24 (xi − x̄)(yi − ȳ) = 3.77.
20 20 20
We denote by β̂0 + β̂1 x the regression line. Calculate β̂0 and β̂1 .
We are given the standard error estimates of β̂0 , β̂1 , and β̂2 , which are 0.162,
0.162, and 0.05, respectively. We assume the errors εi are Gaussian, centered,
with the same variance and independent. Test H0 : β1 = 0 against H1 : β1 ̸= 0.
Why is this test interesting in our context? What do you think of the result?
1. Represent the point cloud. Determine the regression line. Calculate the
coefficient of determination R2 . Comment on it.
2. Two trainees seem to stand out from the others. Remove them and de-
termine the regression line on the remaining ten points. Calculate the
coefficient of determination R2 . Comment on it.
17
Exercise 1.6 (The Height of Eucalyptus)
We wish to explain the height y (in meters) of a tree as a function of its circum-
ference x (in centimeters) at 1.30m from the ground. We recorded n = 1429
pairs (xi , yi ), the cloud of points being represented in Figure 1.8. We obtained
(x̄, ȳ) = (47.3, 21.2), and:
X X X
(xi − x̄)2 = 102924 (yi − ȳ)2 = 8857 (xi − x̄)(yi − ȳ) = 26466
1. Calculate the least squares line for the model y = β1 +β2 x+e and represent
it on Figure 1.8.
2. Calculate the coefficient of determination R2 . Comment on the quality of
the fit of the data to the model.
Pn
With these estimators, the sum of squared residuals is i=1 (yi − ŷi )2 = 2052.
If we assume that the errors εi are Gaussian, centered, independent, and have
the same variance σ 2 , deduce an unbiased estimator of σ 2 .
Pn
2 (yi − ŷi )2
σ̂ = i=1
n−2
Provide an estimator σ̂ 2 of the variance of β̂1 .
18
using a cardio-frequency meter. We seek to know if age influences the threshold
frequency. We have data for 20 pairs of age (xi ), threshold frequency (yi ). We
obtained (x̄, ȳ) = (35, 170) and:
X X X
(xi − x̄)2 = 1991 (yi − ȳ)2 = 189.2 (xi − x̄)(yi − ȳ) = −195.4
yi = βxi + ei , i = 1, . . . , n,
where we assume that the errors ei are such that E[ei ] = 0 and Cov(ei , ej ) =
δij σ 2 .
1. Going back to the definition of least squares, show that the least squares
estimator of β is
Pn
xi yi
β̂ = Pi=1
n 2 .
i=1 xi
2. Show that the line passing through the origin and the center of gravity of
the point cloud is
Pn
yi
y = β ∗ x, with β ∗ = Pni=1 .
x
i=1 i
19
1. Recall the formulas for the least squares estimators of α̂ and β̂, as well as
their respective variances.
2. In this question, we assume α is known but β is not.
(a) By returning to the definition of least squares, calculate the least
squares estimator of β̃.
(b) Calculate the variance of β̃. Show that it is inferior to the variance
of β̂.
3. In this question, we assume α is unknown but β is known.
(a) By returning to the definition of least squares, calculate the least
squares estimator of α̃.
(b) Calculate the variance of α̃. Show that it is inferior to the variance
of α̂.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 134.3450 45.4737 2.954 0.00386
Surface 6.6570 0.6525 10.203 < 2e-16
20
Figure 5: Rental prices of apartments as a function of their surface area.
1. Propose a model allowing the study of the relationship between the price
of apartments and their surface area. Specify the hypotheses of the model.
2. Based on Table 1.2, does the surface area play a role in the rental price of
type 3 apartments? Consider the simple linear model.
3. What is the estimate of the coefficient β (coefficient of the surface area in
the model)? Comment on the precision of this estimate.
4. The average surface area of the 108 apartments is 68.74 m2 and the average
price of the apartments is 509.95 euros. What is the estimated average
rental price per square meter? How does this average price differ from the
estimation of β?
5. In the pool of data you have, how do you know which apartments are
”good deals” from the point of view of surface area?
21