Lecture 2
Lecture 2
Lecture 2
1. Introduction
2. Simultaneous distributions
3. Covariance and correlation
4. Conditional distributions
5. Prediction
1
Basic ideas
We will often consider two (or more) variables
simultaneously.
Examples (B& S, page 15)
2
There are two typical ways this is can be done:
(1) The data (x
1
, y
1
), . . . , (x
n
, y
n
) are
considered as independent replications of a
pair of random variables, (X, Y ).
(2) The data are described by a linear regres-
sion model
y
i
= a +bx
i
+
i
, i = 1, . . . , n
Here y
1
, . . . , y
n
are the responses that are
considered to be realizations of random vari-
ables, while x
1
, . . . , x
n
are considered to be
xed (i.e. non-random) and the
i
s are
random errors (noise)
Situation 1) occurs for observational studies,
while situation 2) occur for planned experi-
ments (where the values of the x
i
s are under
the control of the experimenter).
In situation 1) we will often condition on the
observed values of the x
i
s, and analyse the
data as if they are from situation 2)
In this lecture we focus on situation 1)
3
Joint or simultaneous distributions
The most common way to describe the si-
multaneous distribution of a pair of random
variables (X, Y ), is through their simultaneous
probability density, f(x, y)
This is dened so that
P( (X, Y ) A) =
A
f(x, y) dxdy
The marginal density of X is obtained by
integrating over all possible values of Y :
f
1
(x) =
f(x, y)dy
and similarly for the marginal density f
2
(y) of Y .
If f(x, y) = f
1
(x) f
2
(y), the random variables
X and Y are independent.
Otherwise, they are dependent, which means
that there is a relationship between X and Y ,
so that certain realizations of X tend to occur
more often together with certain realizations
of Y than others.
4
Covariance and correlation
The dependence between X and Y is often
summarized by the covariance:
= Cov(X, Y ) = E[(X
1
)(Y
2
)]
and the correlation coecient:
= corr(X, Y ) =
Cov(X, Y )
sd(X) sd(Y )
The following are important properties of the
correlation coecient.
corr(X, Y ) takes values in the interval [1, 1]
corr(X, Y ) describes the linear relationship
between Y and X.
If X and Y are independent corr(X, Y ) = 0,
but not (necessarily) the other way around
5
Correlation: correlated data
1 0 1 2
1
0
1
Correlation 0.9
x
y
2 1 0 1 2
1
0
1
Correlation 0.5
x
y
2 1 0 1 2
1
0
1
Correlation 0.5
x
y
2 1 0 1 2
1
0
1
2
Correlation 0.9
x
y
6
Correlation: uncorrelated data
1 0 1 2 3
1
0
1
Correlation 0.0
x
y
7
Correlation: uncorrelated data
2 1 0 1 2
0
1
2
3
4
5
6
Correlation 0.0
x
y
8
Transformations
Sometimes a transformation may improve the
linear relation
9
Sample versions of covariance
and correlation
Data (x
1
, y
1
), . . . , (x
n
, y
n
) are independent
replicates of (X, Y ).
Empirical analogues to the population concepts
and basic results:
Empirical covariance:
=
1
n 1
n
i=1
(x
i
x
n
) (y
i
y
n
)
Empirical correlation coecient:
=
s
1n
s
2n
When n increases:
10
Conditional distributions
The conditional density of Y given X = x
is given by
f
2
(y|x) =
f(x, y)
f
1
(x)
If X and Y are independent, so that f(x, y) = f
1
(x)f
2
(y),
we see that f
2
(y|x) = f
2
(y). This is reasonable, and
corresponds to the fact that there are no information in
a realization of X about the distribution of Y
Using the conditional density, one may nd the
conditional mean and the conditional variance:
Conditional mean:
2|x
= E(Y |x)
Conditional variance :
2
2|x
= Var(Y |x)
When (X, Y ) is bivariate, normally distributed,
2|x
is linear in x, and is known as the regres-
sion of Y on X = x (cf. below).
11
Prediction
When X and Y are dependent, it is reasonable
that knowledge of the value of X can be used
to improve the prediction for the correspond-
ing realization of Y .
Let
Y (x) be such a predictor. Then:
Y (x) Y is the prediction error
Y
opt
(x) = E(Y |x) minimizes E[(
Y (x)Y )
2
],
the mean squared prediction error
E(Y |x) will often depend on unknown
parameters, and it may be complicated to
compute
12
Linear prediction
It is convenient to consider linear predictors,
i.e. predictors of the form:
Y
lin
(x) = a +bx
Minimizing E[(a + bX Y )
2
] w.r.t. a and b
yields:
b =
2
1
and a =
2
b
1
The minimum is E[(
Y
lin
(x)Y )
2
] =
2
2
(1
2
).
Note that if
2
increases, the mean squared
error decreases.
13
Linear prediction, contd.
Without knowledge of the value of X, the best
predictor is the unconditional mean of Y , i.e.
Y
0
=
2
.
This has mean squared error E[(
Y
0
Y )
2
] =
2
2
.
Hence, a sensible measure of the quality of a
prediction is the ratio
E[(
Y
lin
(x) Y )
2
]
E[(
Y
0
Y )
2
]
= 1
2
.
For judging a prediction, the squared correla-
tion coecient is the appropriate measure.
When a and b are unknown, we plug in the
empirical counterparts:
b =
2
1
and a =
2
b
1
= y
b x
14
The bivariate normal distribution
When (X, Y ) is bivariate normal:
The distribution is described by the ve
parameters
1
,
2
,
2
1
,
2
2
and
The marginal distributions of X and Y are
normal, X N(
1
,
2
1
), Y N(
2
,
2
2
)
corr(X, Y ) = and Cov(X, Y ) =
1
2
The conditional distributions are normal
E(Y |x) =
2
+
2
1
(x
1
)
Var(Y |x) =
2
2
(1
2
)
b =
2
1
=
2
1
and a =
2
b
1
15