STAT1400 2022 1st Week4-Lecture 8

https://www.student.uwa.edu.
au/learning/resources/ace/
respect-intellectual-property/copyright-and-uwa-unit-content
Adriano Polpo (UWA) STAT 1400

STAT 1400 - Statistics for Science
stat1400-ems@uwa.edu.au
Contributors to lecture material: Adrian Baddeley, Adriano Polpo, John Bamberg, Ed Cripps, Julie Marsh, Kevin Murray,
Gordon Royle, and Berwin Turlach.
Adriano Polpo (UWA) STAT 1400

father/son heights
Which line is best?
Equation of a line
The equation of a line is
y = mx + c
where
m is the slope or gradient
c is the y-intercept.
Plotting lines
10
8 y= 1x + 8
6 y = 0.5x + 3
1 2 3 4 5 6 7 8 9 10
Residuals
14
12 “residual”
10
2 4 6 8 10 12 14
Method of Least Squares
In 1806, the French mathematician Adrien-Marie Legendre

proposed the method of least squares.
He suggested that the best line would be the one where
the sum of the squares of the residuals
is as small as possible.
Positive or negative deviations count equally

Large deviations count significantly more
Notation
We are given n data points
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},
and want to find the line y = b0 + b1 x so that

n
X
(yi (b0 + b1 xi ))2
i=1
is as small as possible.
(Statisticians use b1 and b0 rather than m and b for the gradient

and y-intercept of the line.)
Line of best fit
So the best linear equation fitting the father/son data is
y = 86.088 + 0.514x
So the best guess for the height of the son of a 180cm father
would be
86.088 + 0.514 ⇥ 180 = 178.76.
The best line
The Anscombe Quartet
The Anscombe Quartet is a famous collection of four datasets,

each containing eleven (x, y)-pairs.
Each of the datasets has (almost) the same summary data
Each of the datasets has the same linear model.

Anscombe Quartet Scatterplots
60
Anscombe Regression Lines
Equation of line is y = 3 + 0.5x with r = 0.82 in each case

59
Terminology
In discussing linear models, it is common that:

The x-variable is called the predictor, or explanatory, variable
The y-variable is called the response variable
The line is called the fitted line and expressed in the form
ŷ = b0 + b1 x.
The “hat” notation, such as ŷ, is used throughout statistics to

denote an estimated or predicted value derived from a model.
Prediction
14
12
10
2 4 6 8 10 12 14
Are residuals errors?
For each data point (xi , yi ) we have
yˆi = b0 + b1 xi
which is the “predicted value” for yi .
So put
ei = yi yˆi
and call this the residual or error. It is the di↵erence between the
observed value and the fitted value.
Residual Plot
A residual plot is a plot whose vertical axis is the residuals, and
whose horizontal axis is the predictor variable or the fitted value.
Residual Analysis
Is there a pattern?
Are they centred around zero?
Are there any outliers?
Is the variance of the residuals roughly constant?
Satisfactory Residual Analysis
● ●
6
●
● ●
●●
4
● ●
● ●● ●
●
● ●
●● ● ●
4
● ●
● ●
Response
Response
●● ●
●
2
● ●
●
●
● ●
2
● ● ● ●
● ●● ● ●
● ● ● ●
0
● ●
● ● ●
0
● ●
●
● ● ●
● ● ● ● ●
−2
● ●● ● ●● ● ●
● ● ● ● ●
−2
●
● ● ●
● ●
−2 −1 0 1 2 −2 −1 0 1 2
Explanatory variable Explanatory variable

1.5
2
● ● ●
● ● ● ● ●
● ●
● ●
●
1
● ● ● ● ●
● ● ● ● ●
0.5
● ● ● ●
● ●
Residuals
Residuals
● ● ● ●
●
● ● ● ● ● ●
0
● ● ● ●
● ● ●
● ● ●
●
● ●● ●
−0.5
● ● ● ● ●
●
−1
● ● ●● ●
● ● ●
● ●
●
● ● ● ●
● ●
−2
● ●
−1.5
● ●
−2 0 2 4 −2 0 2 4
Fitted Values Fitted Values

Unsatisfactory Residual Analysis
10
● ●
●●
50
●
8
●
●
40
●
● ● ●
Response
Response
6
●●
30
● ●
● ●
●
●
4
● ● ●
● ●
20
● ●
● ●
● ●● ●
2 ● ● ●
● ●●
● ●
10
●
● ● ●●●
●● ● ●
● ● ●
●●●● ●●● ●
●●
0
●●●●●●● ●●● ●●●●●●●
0
● ●
−2 −1 0 1 2 −2 −1 0 1 2
● ●
30
6
●●
●
● ●
4
20
●
●
Residuals
Residuals
●
●● ●
2
●
● ●
10
● ● ●
● ● ●●● ●●
0
● ● ● ●
● ● ● ● ●
● ● ●
● ● ●● ●
●
●●●
0
● ●●●
−2
● ● ●● ●
● ● ● ●
●● ●●● ●● ●●
●
●● ●●●●
−10
●
−4
● ●●●●
3.50 3.55 3.60 3.65 3.70 3.75 −10 −5 0 5 10 15 20 25

More residual analysis
6
● ●
10
●
● ●
● ● ●
● ● ●●
4
●
● ●
5
● ●● ●●
●
● ● ●● ● ● ● ●
● ● ● ●
● ● ● ●●
2
● ● ● ● ●
0
● ● ●
Response
Response
● ● ●
● ● ●
● ● ●
● ● ● ● ●
0
● ● ●
●
● ●● ●
● ● ● ●
−6 −4 −2
● ● ●
−10
● ●
●
● ●
●
● ● ●
−20
●
● ●
−2 −1 0 1 2 −2 −1 0 1 2
10 15
● ●
4
● ●
● ● ● ● ●
● ●●
●● ● ●
2
● ● ●● ● ●
● ● ● ● ● ● ● ● ●
5
● ● ● ●
●
0
Residuals
Residuals
●
● ● ● ● ● ●● ● ● ●
● ● ● ●● ●
● ● ● ●● ●●●
0
● ●
−8 −6 −4 −2
● ● ● ● ● ●
● ●
● ● ● ● ●
● ●
−5
● ●
−15 ●
●
●
● ● ● ●
−1 0 1 2 −4 −2 0 2 4

STAT1400 2022 1st Week4-Lecture 8

Uploaded by

Copyright:

Available Formats

STAT1400 2022 1st Week4-Lecture 8

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT1400 2022 1st Week4-Lecture 8

Uploaded by

Copyright:

Available Formats

https://www.student.uwa.edu.

Adriano Polpo (UWA) STAT 1400

Adriano Polpo (UWA) STAT 1400

The equation of a line is

In 1806, the French mathematician Adrien-Marie Legendre

He suggested that the best line would be the one where

the sum of the squares of the residuals

Positive or negative deviations count equally

We are given n data points

{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},

and want to find the line y = b0 + b1 x so that

(Statisticians use b1 and b0 rather than m and b for the gradient

So the best linear equation fitting the father/son data is

The Anscombe Quartet is a famous collection of four datasets,

Each of the datasets has (almost) the same summary data

Each of the datasets has the same linear model.

Equation of line is y = 3 + 0.5x with r = 0.82 in each case

In discussing linear models, it is common that:

The “hat” notation, such as ŷ, is used throughout statistics to

For each data point (xi , yi ) we have

which is the “predicted value” for yi .

Explanatory variable Explanatory variable

Fitted Values Fitted Values

●●●●●●● ●●● ●●●●●●●

Explanatory variable Explanatory variable

3.50 3.55 3.60 3.65 3.70 3.75 −10 −5 0 5 10 15 20 25

Fitted Values Fitted Values

Explanatory variable Explanatory variable

Fitted Values Fitted Values

You might also like