Basics of Multivariate Normal
Basics of Multivariate Normal
Introduction
Scope of multivariate analysis
Variance-covariance matrices
Multivariate distributions
Nonsingular Multivariate Normal Distribution
Characterization of multivariate normal via linear functions
Summary
References
Solutions to Exercises
2.1 INTRODUCTION
We begin the unit with an overview of the scope of multivariate analysis. We identify the
extensions of problems of univariate analysis to higher dimensions and also outline the
problems special to multivariate analysis which do not have univariate equivalence. We
do this in section 2.2. In section 2.3 we study the properties of variance-covariance
matrices in detail. We also identify the class of variance-covariance matrices with the
class of nnd matrices. In section 2.4 we give several examples of discrete and continuous
multivariate distributions. We also define the concepts of marginal and conditional
distributions and also the important concept of independence. We illustrate these
concepts with the help of examples. We also give a method of obtaining the density of a
transformed random vector. In section 2.5 we introduce multivariate normal distribution
using its density function. We obtain the marginal and conditional distributions. In
section 2.6 we study the multivariate normal distribution defined via linear zero functions
and obtain several properties in an elegant manner. We also show that the two definitions
of multivariate normal coincide if the variance-covariance matrix is positive definite.
Objectives :
After completing this unit, you should be able to
Understand and apply the concepts of marginal and conditional distributions and
independence.
Understand and apply the properties of multivariate normal distribution.
Appreciate the beauty of the density-free approach to multivariate normal
distribution.
64
68
50
49
70
50
70
61
62
X
58
50
47
55
46
60
52
58
78
42
61
75
35
72
60
65
58
64
54
71
65
65
70
48
58
59
65
81
47
54
63
78
64
31
59
61
48
62
53
61
82
47
56
60
42
68
64
50
65
49
x1
The vector x x2 is called a random vector and the matrix X is called a data matrix
x
3
related to x. The (i, j)th element xij of X denotes the score of the ith candidate in the
aptitude test or the test on software technology or the interview according as j = 1, 2, 3
respectively. Thus x51 = 49 is the score of the fifth candidate in the aptitude test. Each
row of X corresponds to the scores of a candidate and each column of X corresponds to
the scores in a particular test / interview. Univariate analysis deals with the data on a
single variable, say, those in the interview. Multivariate analysis deals with data on more
than one variable (possibly correlated) collected on the same subjects. One of the major
aims of statistics relates to dealing with variability in the data. By dealing with
variability we mean (i) determining the extent of variability, (ii) identifying the sources of
variability and (iii) either control the variability by taking suitable measures or taking
advantage of the variability to select certain subject in an optimal manner or classifying
the subjects or variables into different groups depending upon the variability. When we
deal with a single variable, the variability is often quantified by the variance. When we
deal with more than one variable, then the variability is often quantified by the matrix of
variances and covariances. For example, in the case of the data mentioned above, the
variability is quantified by the matrix
33 (( ij ))
68.0579
66.0474
68.0579
80.6816
65.0816
66.0474
65.0816
169.7342
let us turn our attention to the recruitment problem. If the recruitment is based on just the
interview scores, then the candidates 6, 10 and 2 get selected. Again, if the recruitment is
based on the aptitude test, the candidates 19, 6 and 8 get selected where as under the
criterion of software technology scores, the candidates 19, 2 and 8 get selected. Notice
that it is not realistic or optimal to base the judgment on the scores of just one of the three
as we are ignoring useful information on the others. One possible way of using the
scores on all the three is taking the average of the scores on the three (two tests and the
interview) for each candidate and select the three candidates with the top three average
scores. What should be the object in choosing the criterion? We should choose a
criterion which can distinguish among the candidates in the best possible manner. When
we took the average, we took the linear combination ltx where
1 1 1
, , . Why not look for a linear combination ptx which distinguishes maximum
3 3 3
lt =
over all linear combinations among the candidates in the best manner or in other words
which has the largest variability (variance) and use that as an index for selection
criterion? This is precisely what is done in obtaining the first principal component. The
first principal component ptx is a normalized linear combination of x which has the
largest variance among all normalized linear combinations of x.
Getting information on any variable is expensive in terms of time and or money. We may
like to ask whether it is worthwhile conducting the test in software technology given that
the aptitude test and interview are being conducted. Put in other words, does the test in
software technology provide significant additional information in the presence of the
aptitude test and the interview? This is called the assessment of additional information.
Based on the data X on x can we group the candidates into some well-defined classes?
This may be a useful information if the company has jobs of different types (a)
requiring high skills and (b) requiring medium skills but intensive hard work. There may
be a third group which is not of any use to the company. This is called the problem of
discrimination.
How well can we predict the interview score based on the two test scores? This problem
is called the problem of multiple regression and correlation. Suppose the interview
mentioned above is a technical interview. Assume that there is another HR interview and
the score on HR interview be denoted by x4. We may be interested in the association
between the tests and the interviews, i.e., between (x1, x2) and (x3, x4). Such a problem is
called the problem of canonical correlations. The problems mentioned above are some
problems specific to multivariate analysis which do not occur in univariate analysis. In
univariate analysis, we talk about inferences (estimation / testing) on the mean /
proportion / variance of a variable. These problems can be extended to the inferences on
mean vector / variance covariance matrix of a random vector. Univariate analysis of
variance has an analogue in Multivariate analysis of variance.
In the blocks 5, 6, 7 and 8 we study these topics in detail.
2.3 VARIANCE COVARIANCE MATRICES
As we saw in the previous section, the variance covariance matrix of a random vector is a
quantification of the joint variability of the components of the random vector. The
variance covariance matrices play a very important role in multivariate analysis. In this
section we formally define a random vector, its mean vector and variance-covariance
matrix. We shall obtain formulae for the mean vector and variance-covariance matrix of
linear compounds of a given random vector. We shall give a method of transforming
correlated random variables to uncorrelated random variables. We shall show that every
variance covariance matrix is nnd and that every nnd matrix is the variance-covariance
matrix of a random vector.
x1
E x1
variables x1, x2,, xp. E x is called the mean vector of x where E(xi) denotes
E x
p
the mean of xi. Let ij denote the covariance between xi and xj. Then the matrix =
((ij)) of order pxp is called the variance-covariance matrix of x, denoted by D(x). Let
y1
y
y
q
1,, p, j = 1,, q. Then pxq = ((ij)) is called the covariance matrix between x and y
and is denoted by Cov(x,y).
Clearly D(x) = Cov(x,x)
Example 1. Let x1, x2, and x3 be random variables with means 2.3, -4.1 and 1.5
respectively and the variances 4, 9 and 16 respectively. Let ij denote the correlation
coefficient between xi and xj ; j = i,,3 and i = 1,2,3. Let 12 = 0.5, 13 = 0.3 and 23 =
-0.4.
Write down the mean vector and the variance covariance matrix of
t
x x1 x2 x3 .
E x
1
2.3
M = 4.1
Solution. The mean vector E(x) of x is E x
1.5
E x p
Let = ((ij)) denote the variance-covariance matrix of x. Let V(.) and Cov(. , .) denote
the variance and covariance respectively. Then,
= 0.3 x 2 x 4 = 2.4
23 Cov x2 , x3 23 V x2 .V x3
= -0.4 x 3 x 4 = -4.8
4 .0
Thus
3 .0
2 .4
3 .0
2 .4
9 .0
4 .8
4 .8
16
Notice that is symmetric in the above example. In fact, this is true for every variancecovariance matrix because ij = Cov(xi, xj) = Cov(xj, xi) = ji for all i and j. since the
leading principal minors of are 4.0, 27.0 and 288.0 respectively, it follows from
theorem 5 of unit 1 that is positive definite. We shall later show that every variancecovariance matrix is non-negative definite.
Let x be a random vector. Consider ltx where lt is a fixed vector (i.e., the components of l
are not random variables.) We shall now find the mean and variance of ltx.
Theorem 1. Let x px1 be a random vector with E(x) = and variance-covariance matrix of
x equal to . Let l be a fixed vector and let ltx = l1x1 + l2x2 ++ lpxp be a linear
combination of the components of x. Then E(ltx) = lt E(x) = lt and V(ltx) = ltl. Cov(ltx,
mtx) = ltm where m is a fixed vector.
Proof: E(ltx) = E(l1x1 + l2x2 ++ lpxp)
= l1E(x1) + l2E(x2)++ lpE(xp)
= lt E(x) = lt .
t
V(l x) = V(l1x1 + l2x2 ++ lpxp)
= Cov(l1x1 + l2x2 ++ lpxp, l1x1 + l2x2 ++ lpxp)
p
l l
i j
i 1 j 1
Cov(ltx, mtx) =
ij
= ltl
l m
i
ij
i 1 j 1
= ltm.
1
V(l x) = l l = 1
3
t
1
= 9 .4
9
4.0
1 3.0
2.4
7 .2
3.0
9.0
4.8
1
13.6 1
1
1
x 30.2 = 3.36
9
2.4 1
4.8 1
16 1
1
1
1 . Find the
3
1
Cov(l x, m x) = l m
t
1
= 3 1
1
= 9.4
9
E1.
Let
x x1
x2
4.0
3.0
9.0
2.4
7.2
x4
x3
3.0
4.8
13.6 1
1
2.4 2
4.8 1
16 1
1
2
x -2.0 =
.
3
3
1 1 1
1
0.2
and variance covariance matrix =
0.2
0.2
0.2
1
0.2
0.2
0.2
0.2
1
0.2
0. 2
0. 2
0. 2
Find the mean and variance of ltx and Cov(ltx, mtx) where lt = 1 1 1 1 and
mt = 1 1 1 1 .
We know that V(x) = E(x - E(x))2 and Cov(x, y) = E((x - E(x)) (y - E(y)). Is there a
multivariate analogue to the above? Notice that
D(x) = = ((ij))
Where ij =Cov(xi, xj) = E((xi - E(xi)) (xj - E(xj)).
x E x
1
1
E
M
x1 E x1 L x p E x p
Thus D(x) =
x p E x p
or
D(x) = E
x E x x E x
t
(2.3.1)
x E x y E y
t
(2.3.2)
(a) E(Bx) = B
(b) D(Bx) = BBt
(c) Cov(Bx, Cy) = BCt
b1t
t
th
Proof: (a) B = M
where bi is the i row of B, i = 1,, r.
brt
b1t x
Now E(Bx) = E M
brt x
b1t E x
M
brt E x
b1t x
b1t
M E x = B
t
br
b 1t x
M
b rt x
t
Thus the (i, j)th element of D(Bx) is Cov( bit x, b j x) = bit b j for i, j = 1,, p from
theorem 1.
b1t
b1t x
c1t y
, where c tj is the jth row of C, j = 1,, s.
ct y
r
t
The (i, j)th element Cov(Bx, Cy) = bi cj.
b1t
Example 3. Let be the variance covariance matrix of a random vector x of order px1.
Let rij denote the correlation coefficient between xi and xj, i, j = 1,, p.
Write R = ((rij)). R is called the correlation matrix of x.
11
0
Write T =
0
22
0
0
pp
Thus T is a diagonal matrix with ith diagonal element equal to the standard deviation of xi,
i = 1,, p. Assume that 11,, pp are strictly positive. Show that R = T-1T-1.
Cov xi , x j
V xi V x j
ij
=
ii
jj
-1
-1
11
1
2
= ii 2 i1
ii 2 ij
ii 2 ip
th
j element
= ii 2 ij jj2 =
ij
ii jj
1 p
i1 ip
p1 pp
M
12
jj
0
M
0
M
12
jj
0
M
0
jth element
= rij
Thus R = T-1T-1.
This establishes a relationship between the variance-covariance matrix and the correlation
matrix of a random vector.
E2.
= 3.0
2.4
3.0
9.0
4.8
2.4
4.8
16
We are ready now to show the equality of the class of all nnd matrices with the class of
all variance-covariance matrices as promised in the beginning of this section.
Theorem 3.
Proof: (a) Let be the variance-covariance matrix of a random vector x. Then for each
fixed l, ltl = V(ltx) 0 (Since variance of a random variable is nonnegative.) Hence
is nnd.
(b) Let p x p be an nnd matrix. Then by theorem 4(b) of unit 1, there exists a matrix C of
order p x r for some positive integer r such that = CCt. Let x1, x2,, xr be independent
random variables each with variance 1. write xt = (x1,, xr). Then D(x) = I r x r (I r x r is
the identity matrix of order r x r). Write y = Cx. Then by theorem 2, D(y) = CICt = CCt
= .
Corollary: The variance-covariance matrix of a random vector x is positive semidefinite if and only if there exists a fixed non-null vector l such that ltx is a constant with
probability 1.
Proof:
is positive semi-definite
There exists fixed l 0, such that ltl = 0
There exists fixed l 0 such that V(ltx) = 0
There exists a fixed vector l 0 such that ltx is a constant with
probability 1.
Before proceeding further, let us recall that if is pd, then there exists a nonsingular
t
t
matrix B such that = BBt. Write y = B-1x. Then D(y) = B-1 B -1 = B-1BBt B -1 = I.
Hence the components of y are uncorrelated, each with variance 1. Notice that in method
1, P is a choice for the matrix B.
In unit 1, we gave a method of computing a lower triangular square root of a pd matrix.
Before giving the method 2, we shall give another algorithm of obtaining a lower
triangular square root of a pd matrix. This algorithm also helps us in getting the inverse
of the triangular square root as a bonus whereby we can write down y immediately.
Let be a pd matrix of order p x p.
Algorithm:
Step 1.
Step 2.
Step 3.
Step 4.
Step 5.
t ji
tii
ith row of T
Observe that
0
b 22
0
0
x1
xp
pp
b
y1 = b11x1
y2 = b21x1+ b22x2
yi = bi1x1++ biixi
yp = bp1x1++ bppxp
Thus the first component of y is a scalar multiple of the first component of x. The second
component of y is a linear combination of the first two components of x, and so on.
We now illustrate the above methods with example.
2
.
Example 4. Let x be a random vector with D(x) =
1 2
(a) Make an orthogonal transformation = Px (where P is an orthogonal matrix) such
that the components of are uncorrelated. Obtain the variances of 1 and 2.
Where p1 =
3
= 3p1 p +3p2 p = (p1 : p2)
2
0
t
1
t
2
t
0 p1
3
t = P
1
0
p2
1
x1 x2 and 2 = 1 x1 x2
2
2
3
0
3
PtP =
1
0
0
1
V(2) = 1
and
or y = 3
0
Pt
p1t
1 1 1
. The transformation
t =
2 1 1
p2
is = Ptx.
So
1
1 1
and p2 =
1
2
Thus 1 =
1
1 and y2 = 2
3
= 3
0
t
P x
2 1 1 0 .. (1)
1 2 0 1 .. (2)
-----------------2
1
1
0 .. (3) = (1) 2
2
2
1
3
1
------------------2
thus B =
2
1
2
1
2
1
0 .. (5) = (3)
2
2
3 1
.. (6) = (4)
6
3
2
3 and B-1 =
1
2
1
1
x1
2
y1=
E3.
1
x1
6
2
x2 .
3
x1
x2
Consider a random vector x = x3 with E(x) =
x4
x
5
5
1
and D(x) = 1
1
1
1
4
0
1
0
6
1
2
1
0
2
2
0
1
2
8
3
3
9
2
0
3
2
3
2
x1
Write y = x2
x
3
2
Let B =
4
x4
. Thus x =
and =
x5
4
and C = 1
0
Write u = By and v = C
Compute the following:
(a) E(u), D(u)
(b) E(v), D(v) and
(c) Cov(u, v).
E4.
and D(x) = 1
1
1
2
1
1 .
2
Notice that each row sum of D(x) is 0. Obtain a linear combination l x which is
constant and with probability 1. What is the value of this constant?
E5.
4 2 6
obtain a lower triangular matrix B and its inverse such that the components of y =
B-1x are uncorrelated each with variance 1.
each x1, x2, , xp. We define the concept of conditional distribution. We briefly study the
concept of independence of random variables and its relation to uncorrelatedness. First,
let us consider a few examples.
Example 6. A college has 2 specialists in long distance running, 4 specialists in Tennis
and 6 top level cricketers among its students. The college plans to send 3 sportsmen from
the above for participating in the University sports and games. The three sportsmen are
selected randomly from among the above 12. Let x1 and x2 denote respectively the
number of long distance specialists and the number of tennis specialists chosen. The joint
probability mass function of x1 and x2 is defined as P{x1 = i, x2 = j} for i =0,1,2 and j =
0,1,2,3. Obtain the joint probability mass function of x1, x2.
Solution:
6 12
20
p(0, 0) = P{x1 = 0, x2 = 0} = =
220
3 3
4 6
2
p(0, 1) = P{x1 = 0, x2 = 1} =
1
Similarly,
4 6
1
p(0, 2) =
2
4
p(0, 3) =
3
2
p(1, 0) =
1
2
p(1, 1) =
1
12
60
220
3
12
36
3
220
12
4
3
220
6 12
30
2 3
220
4 6 12
48
1 1 3
220
2 4
12
2 6
1
12
6
3
220
12
4
220
3
12
p(1, 2) =
220
1 2 3
p(1, 3) = 0 since the number of persons chosen is 3.
p(2, 0) =
2
2 4
1
p(2, 1) =
2
p(2, 2) = 0
p(2, 3) = 0
The values taken by x1 and x2 and the corresponding probabilities constitute the joint
distribution of x1 and x2 and can be expressed in a tabular form as follows:
Table 1
Joint distribution of x1 and x2
Value
x2
Row
taken by
x1
1
2
20
220
30
220
6
220
56
220
Joint probability
sum
Joint probability
Column sum
60
220
48
220
4
220
112
220
36
220
12
220
4
220
48
220
4
220
120
220
90
220
10
220
For the joint distribution table, it is easy to write down the distributions of x1 and x2 which
we call the marginal distributions of x1 and x2 respectively.
P{x1 = 0} = P{x1 = 0, x2 = 0}+ P{x1 = 0, x2 = 1}+ P{x1 = 0, x2 = 2}+ P{x1 = 0, x2 = 3}
=
20
60
60
4
+
+
+
.
220 220 220 220
120
220
90
220
10
220
Similarly, the marginal distribution of x2 is obtained using the column sums in table 1.
Thus the marginal distribution of x2 is
Table 3
Marginal distribution of x2
Value
Probability
56
220
112
220
48
220
4
220
Suppose we are given additional information that no long distance running specialist is
selected, or in other words, we know that x1 = 0. Then what are the probabilities for x2 =
0, 1, 2, 3 given this additional information? Notice that we are looking for the
P( x1 0, x2 0)
20 120
20
1
=
=
P( x1 0)
220 220 120
6
60 120
60
1
=
220 220 120
2
36 120
36
3
P{x2 = 2 | x1 = 0} =
=
220 220 120
10
4
120
4
1
P{x2 = 3 | x1 = 0} =
=
220 220 120
30
P{x2 = 1 | x1 = 0} =
1
6
1
2
3
10
1
30
E6.
E7.
In example 6, let pijk denote P{x1 = i, x2 = j, x3 = k}. Obtain pijk for i = 0,1,2; j =
0,1,2,3 and k = 0,1,2,,6. The values of x1, x2 and x3 and the corresponding pijk
constitute the joint distribution of x1, x2 and x3.
E8.
x1
x1
is called
a discrete random vector and the distribution of x (the joint distribution of x1, x2 and x3) in
such a case is called discrete multivariate distribution.
On the other hand we say that x1,, xp are jointly continuous (x =
x p is
t
continuous) if there exists a function f(u1,,up) defined for all u1,, up having the
property that for every set A of p-tuples,
f (u u
1
A1
)du1 du p
Ap
ap}
=
a1
ap
f ( u1 K u p )du1 K du p
p
f (u , , u
1
a1
)du1 , , du p
a1
f(a1,,ap)a1,,ap.
when a1, i = 1,,p are small and f is continuous. Thus f(a1,,ap) is a measure of the
chance that the random vector x is in a small neighborhood of (a1,,ap).
Let fx(u1, u2,,up) be the density function of x.
Then the marginal density of xi, denoted by f xi ui is defined as
f xi ui =
f x1| x2 u2 ( u1 | u2 )
f x ( u1 ,K u p )
f x2 ( u2 )
Why is the conditional density defined thus? To see this let us multiply the left hand side
of the above equality by du1,,dur and the right hand side also by du1,,dur = durdur+1
dup / dur+1dup.
Then f x | x
1
u 2
f x (u1 , u p )du1 du p
f u2 (ur 1 , , u p )dur 1 du p
P u1 x1 u1 du1 , , u p x p u p du p
P ur 1 xr 1 ur 1 dur 1 , , u p x p u p du p
(a)
(b)
(c)
(d)
where c is a constant.
Now 1 =
1 1
f x ( u1 ,u2 )du1du2 =
c.(2 u
u2 ) du1du2
0 0
u2
u2
= c. 2u1 1 2
2 0 2
= 2
1 1
= c.1 = c or c = 1
2 2
=1
(u1 , u2 ) du2
(2 u
u2 )du2
= 2 u1 u2 du2
0
1
3
=
- u1.
2
2
= 2 u1
3
-u1 ,
Thus f x1 ( u1 ) 2
0
for 0 u1 1
else where.
3
- u2 ,
(c) By symmetry, the marginal density of x2 is f x2 ( u2 ) 2
for 0 u2 1
So,
else where.
2 u1 u2
f x ( u1 ,u2 )
= 3
where 0
f x2 ( u 2 )
u2
2
< u1 <1, 0 < u2 <1. since fx(u1, u2) = 0 whenever u1 (0, 1) or u2 (0, 1)
2 u1 u2
whenever 0 u1 1, 0 u2 1
3
2
we have f x1 | x2 u 2 ( u1 | u2 ) 2
0
otherwise
(d) P
1
x1
x2 u 2
4
x1 | x 2 u 2
(u1 | u2 )du2
1
4
2 u1 u2
du1
3
1
u2
4
2
1
1
1 2 u1 u2 du1
= 3
u2
4
2
=
u12
2
u
u
u
1
1
2
= 3
2
u2
2
1
1
4
1
2 1 u
2 u2 2
= 3
2
4 32 4
u2
2
1 31 3u2
31 24u2
3
=
u2 32 4 16 3 2u2
2
1
f x2 ( u2 ) 2e u1 e 2u2 du1
2e 2 u 2
u1
du1
= 2e 2 u e u 0
= 2e 2 u 2
whenever 0 < u2 .
which is an exponential distribution with parameter 2.
2
(b) f x1|x2 u2 ( u1 | u2 )
f x ( u1 ,u2 )
2e u1 e u 2
=
e u1 . Whenever 0 < u1 , 0 < u2 .
u 2
f x2 ( u 2 )
2e
Thus f x1|x2 u2 ( u1 | u2 )
if u1 0, ,u2 0 ,
otherwise
It can be easily checked that this is the same as the marginal distribution of x1. Thus in
this example the joint density of x1 and x2 is the product of the marginal densities of x1
and x2.
x1
where x1 is of order r x 1 and x2 is of order (p-r) x 1 have density fx(u)
Let x =
x2
u1
is partitioned according as the partition of x. We say that x1 and x2 are
where u =
u2
It can be shown (which is beyond the scope of the present notes) that x1 and x2 are
independent if and only if the joint density of x1 and x2 (i.e., the density of x) is equal to
the product of the marginal density of x1 and x2. or in other words
fx(u) = f x1 ( u1 ). f x 2 ( u2 ) .
In fact if we can factorize fx(u) = g1(u1). g2(u2) where gi(ui) involves only ui, i = 1,2, then
x1 and x2 are independent. Further c1.g1(u1) and c1.g2(u2) are the marginal densities of x1
and x2 respectively where c1 and c1 are constants.
Thus x1 and x2 of example 8 are independent.
However, x1 and x2 of example 7 are not independent.
We give below a relationship between uncorrelatedness and independence.
Theorem 4. Let x1 and x2 be independent. Then Cov(x1, x2) = 0.
Proof: Let f x1 ( u1 ) and f x 2 ( u2 ) be the densities of x1 and x2. Then the joint density of
x1
u
) is fx(u) = f x1 ( u1 ). f x 2 ( u2 ) where u = 1 .
x1 and x2 (i.e., the density of x =
x2
u2
Let u1 = (u1,,ur)t and u2 = (ur+1,,up)t.
Cov(u1, u2)
- K u1 f x1 ( u1 )du1 K dur
K u
t
2
p =0
f x2 ( u2 )dur 1 K du
since the first integral in the previous expression splits into the product of the two later
integrals.
However the converse is not true as shown through the following exercise.
E9.
-1
x1
with the density
Consider a random vector x =
x2
cu1 2 u1 u2
0 u1 1, 0 u2 1
fx(u1, u2) =
0
otherwise
where c is a constant.
(a)
(b)
(c)
(d)
Let x = (x1,,xp)t be a random vector with density fx(u) where u = (u1,,up)t. Earlier we
saw that even if x1,,xp are correlated with a positive definite variance-covariance
matrix, we can find a linear transformation y = Bx with B nonsingular such that the
components of y are uncorrelated. It is several times of interest to find the distribution of
y = g(x) [i.e., yi = gi (x1,,xp), a function of x1,,xp for i = 1,,p] given the distribution
of x. How can we find this? We need the following assumptions:
(i)
(ii)
g1
u p
0 at all points ( u1 , ,u p ).
g p
u p
Under these two conditions it can be shown (which is beyond the scope of this notes) that
1
the density of y is given by f y v f x u . J u1 , ,u p where J u1 , , u p is the
absolute value of J u1 , ,u p , where xi = hi(v), i = 1,,p.
Let us consider the bivariate normal distribution. This is a special case of the multivariate
normal distribution which we shall study in detail in the next few subsections.
x1
have density
Example 9. (Bivariate normal distribution). Let x =
x2
f x u1 ,u2
u
1 x1
x
1
1 x21x2
u 2 x2
x2
2 x1x2
2 x1 x2 1
2
x1 x 2
u1 x u 2 y
x1 x2
12
22
11 x2 , 22 x2 and 12 x x x x .
1
Thus
1 2
Also
2
12
11 22
1
1 x2 x
1 2
2 2
x1 x2
u1 x1
x
1
2 2
x1x2 x1 x2
x1 x2
x1 x2
2
x1x2
u u
1
x21
1
u1 x1 : u2 x2
=
2
1 x1x2
2 x x
1 2
x2
u2 x 2
2 2
x1 x2
1
x2
x1 x 2
x1x2
x1 x2
1
x22
x1
u1 x1
u2 x2
x22
x1 x2 x1 x2
= u1 x1 : u2 x2
x21
1 x21x2 x21 x22 x1x2 x1 x2
= u1 x1
= u
u 2 x2
t
u1
u1
22
12
1
.
where u and x
u
x
2
2
u1 x1
2 x2
12
11
u1 x1
2
x
2
Then
N 2 x ,
1
2
1
2
1
u
2
t 1u
Writing
12 22 b12 b22 0 b22
b11 11
We have
b21
12
11
122
11
b22 22
and
1
Now B
So
and
Then
1
b11
b b
21 11
b22
1
b22
1
1
x1 ; v1
u1
b11
b11
1
x2 b21b11 x1 ; v2 1 u2 b21b11u1
y2
b22
b22
y1
J u1 ,u2
v1
u1
v1
u2
v2
u1
v2
u2
1
b11
1
b11b22
b22
11
122
22
11
11
22
122
= 2
Also -1 = (BBt) -1 = B t B 1 B 1 B 1
-1
Hence
(u - x)t -1(u - x)
t
= (u - x)t B 1 B 1 (u - x)
= (v - y)t (v - y)
1 1
e . e
1
2
2
2
1 t 1 1 t
vy v y
vy v y
2
2 2
clearly the same as the range of y values of x1 and x2, namely, 0 < y1 < , 0 < y2 < .
Hence y ~ N 2 y , I
1 12 v1 y1 2 v 2 y2 2
(d) The joint density of y1 and y2 is
e
2
1 2 v1 y1 2 2 v 2 y2 2
e
.e
dv1dv2
Hence the marginal density of y1 is
2
1 12 v1 y1
1 v12 2 y2
e
dv2
2 e
2
2
1 12 v1 y1
e
since the integrand above as the density of a normal
2
2
=
=
distribution.
Thus the marginal density of y1 = b11x1 is
1 12 v1 y1
e
2
N y1 ,1 .
1
11
= 12
12 22
11
1 12
1
1
Hence
1
11
12
0
11
1
11
0 22 21 11 12 0 1
1 0
12
.
1
1
11
22 12 111 12
x1
can be rewritten as
Hence the density of x =
x2
1
0
12
1 0
1
11
u1 x1 :u2 x2 1
1
u1 x1
2
.e
1
1
12 1
1
u2 x2
2
11
1 0
0
1
22
12 11 12
u1 x1
1
0 u
1
x1
12 1 u2 x u2 x2 12 u1 x1
2
11
11
Writing
1
2
1
2
.e
1 u1 x1
2
11
u2 x2
1
2
12
u
11 1 x1
1
22 12 11
12
2
1
Also notice that 11 22 11 11 22 12 11 12
2 11 22 12 111 12
.e
1 u1 x1
2 11
.e
2 11
2 22 12 111 12
.e
1
2
.e
u 2 x2
1 u1 x1
2 11
u 2 x2
1
2
12
u1 x1
11
1
22 12 11
12
12
u1 x1
11
1
22 12 11
12
12
122
x
2
2
or N x2 x1 x2 2 u x1 , x 2 1 x1x2
x1
x
N x 2 x1 x2 2 u x1 , x22 1 2x1x2
x1
u2
u2
1 21
f x u1 , ,u p
e
2
1 22
1 2p
e
e
2
2
ut u
2
v1
b11
u p
= Absolute value of
v p
u p
b p1
b12
b1 p
bp 2
b pp
.e v
B 1 B 1 v
1
absolute value of B
Let us write = BBt. Then || = |B||Bt| = |B|2. So the absolute value of |B| is ||, the
positive square root of ||.
Thus the density of y can be rewritten as f y v
1
2
1
v t 1 v
2
Notice that the density of y depends on the parameters and . This density is called as
the density of p-variate normal distribution with parameters and and we denote the
distribution as
y ~ Np( , )
In the same notation, x that we considered above has the distribution x ~ Np(0, I).
Let us now identify the parameters and .
We know that E(x) = 0 and the variance-covariance matrix D(x) = I since x1,, xp are
independent standard normal variates.
Since y = Bx + , we have
E(y) = E(Bx + ) = BE(x) + =
D(y) = D(Bx + ) = D(Bx) = BIBt = BBt =
Thus, and are the mean vector and the variance-covariance matrix of y ~ Np( , ).
Recall that we started with B nonsingular and hence is nonsingular (in fact, positive
definite). The fact that B is nonsingular was crucial in obtaining the density of y as above
(Notice that the jacobian not being 0 was an assumption while obtaining the density of
the transformed random vector in the form mentioned above.) Thus, the distribution
Np( , ), i.e., the p-variate normal distribution with mean vector and variancecovariance matrix is called a nonsingular p-variate normal distribution if is positive
definite. Later on, we shall also study the case where need not necessarily be positive
definite. Let us now summarize this discussion.
Definition: A random vector y of order p x 1 is said to have a nonsingular p-variate
normal distribution with parameters and if it has the density
f y v
1
v t 1 v
2
Partition
y1
y
y2
v1
v ,
v2
1 11 12
and t
2 12 22
(2.5.1)
where y1, v1 and 1 are of order r x 1 (1 r p) and 11 and 22 are of order r x r and (p-r)
x (p-r) respectively. Such partitions of v, and are called conformable partitions to
that of y.
We first show that if 12 = 0, then y1 and y2 are independent. Notice that if 12 = 0, then
the covariance between yi and yj, i = 1,,r and i = r+1,,p is 0 and hence each yi,,yr is
uncorrelated with each of yr+1,,yp.
Theorem 6. Let y ~ Np( , ) where is positive definite. Let y, v, and be partitioned
as in (2.5.1). Then y1 and y2 are independent if and only if 12 = 0.
Proof: We need to prove only the if part as the only if part has already proved in the
previous section (see theorem 4)
The density of y is f y v
1
v t 1 v
2
11 O
11 22 .
O 22
Since 12 = O,
111
Also
-1
22
t
v v1
Hence
t
1
1vt : 2t
t
2
111 O v
1 1
1
2 2
O 22 v
= v1 1 111 v1 1 v 2 2 221 v2 2 .
Thus f y v can be rewritten as
1
fy v p 1 1 e
2 2 1 2 2 2
1 t 1 t 1
v11 1 v11 v2 2 2 v2 2
2
1 t 1 1 t 1
v 11 1v11 v 22 2 2v 2
2 2
r 1 p r 1
22 2 2
1 2
1 1
e e
2 2
= f y1 v1 . f y 2 v2
Hence y1 and y2 are independent.
E12.
Show that the marginal distributions of y1 and y2 where 12 = 0 are Nr( 1, 11) and
Np-r( 2, 22) respectively.
12
I
t 1
22 12 11
I
Also
111
Hence
0
22 - 21 11 112 0
21 111
22
12
1
= 11 . 22 21 11 12
since
A O
A . C where A and C are square and |I| =1
B C
Now f y v
= e
1
v1
2
vt :
2 2
11
t 1v
1
11
O
11 12
11
O
O
1
1
v
2
1
11
1
1t :t2
2
21 11
22
12
I
O v
1 1
1
2 2
t12 11
I v
22 21 11112
1
2
21 11
22
12
22 21 11112
12
I
0
t
1
12 11 I
12
0
I 111
22 21 11112 0
I
0 11
I 0
t12 111
I 111
12
11
0 11
0
I
I
O v
1
1 1
where
t
1
12 11 I v
v
2
2 2
12 11
11
1
v1
2
t 1
v
1 11
12 11
O
y
I
I
t
1
v
2
2
1 1
t
2 12 v11
1 1
O
1.
I
is
v1
2 2111
p r
1
1
22 21 11 12 v
22 21 11112
21y1 1
11
v1
2 2111
1
are independent.
Hence the marginal distribution of y1 from the above factorization is Nr( 1, 11). Also the
1
1
1
marginal distribution of y2 2111 y1 is N p r 2 2111 1 , 22 21 11 12 . (Notice
t
that 21= 12 )
Corollary: If y ~ Np( , ) where is positive definite, then yi ~ N( i, ii) for i = 1, , p.
We shall now obtain the conditional distribution of y2 given y1 = v1 where y
(~ Np( , )) is partitioned as in (2.5.1). Once again we assume that is p.d.
Theorem 8. Let y ~ Np( , ) where is positive definite. Let y, v, and be
partitioned as in (2.5.1).
The conditional distribution of y2 given y1 is
N p r 2 21 1 1 , 22 21 11112 .
11
1
Proof: In the proof of theorem 7 we showed that y2 2111 y1 and y1 are independently
1
distributed. Hence the conditional distribution of y2 2111 y1 given y1 is the same as
the
unconditional
distribution
of
y2 21 1 y1 ,
11
which
is
1
Also the conditional variance-covariance matrix is the same as that of y2 2111 y1
1
which is 22 21 11 12 .
Thus
the
conditional
distribution
N p r 2 21 1 v1 1 , 22 21 111
12
11
of
y2
given
y1
is
Let r = p 1. Then y2 is univariate random variable yp. Also 21 is a row vector of order
1 x (p-1) and 11 is of order (p-1) x (p-1). The conditional expectation of yp on y1 ,, yp-1.
From the above theorem,
1
E(yp | y1 ,, yp-1) = p 2111 v
1
= 0 + 1v1 ++ p-1vp-1
1
where 0 = p 2111 1 and = (1, , p-1)t = 11 12
1
So it is clear that if y1 ,, yp have a joint p-variate normal distribution, then the regression
of yp on y1 ,, yp-1 is linear in y1 ,, yp-1. 1, , p-1 are called the regression coefficients.
1
4 1 2
Example 10. Let y = (y1, y2, y3)t have N3( , ) where 1 and 1 4 2 .
0
2 2 4
[By ~ N(, 2), we mean has a normal distribution with mean and variance 2.]
(b) It is easy to see that the marginal distribution of (y1, y3)t is a bivariate normal. From
the given information on and , we have
E(y1) = 1, E(y3) = 0, V(y1) = V(y3) = 4, Cov(y1, y3) = 2
y1
~ N2
Hence
y3
1
0
(c) The conditional distribution of y1 given y2 = -0.5 and y3 = 0.2 is normal by theorem 8.
11
Partition
12
t12
22
0.5 2
0.2 3
1
Then the conditional mean is 1 12 22
= 1 + (1 2)
=1+
1
0
12
1 4
12 2
0.5 1
0.2 0
0.5
= 1 + 0.1 = 1.1
0.2
1
0
12
1
= 4 1 = 3.
2
We shall use the algorithm in the previous section to get B and B-1 as required. Thus we
form
4
1
2
1
0
0
(1)
1
4
2
0
1
0
(2)
2
2
4
0
0
1
(3)
------------------------------------------------------------------------------------------------------2
0
0
(4) = (1) 4
15
3
1
0
1
0
(5) = (2) (4)x
4
2
4
3
0
3
0
1
(6) = (3) (4)
2
2
------------------------------------------------------------------------------------------------------2
0
0
(7) = (4)
2
1
3
15
0
0
(8) = (5) 154
60
15
2
5
12
0
0
1
(9) = (6) - 5 x(8)
5
5
5
------------------------------------------------------------------------------------------------------2
0
0
(10) = (4)
2
1
3
15
0
0
(11) = (8)
60
15
2
5
12
5
0
2
Hence B
1
2
15
2
3
5
1
0 and B
12
1
15
1
2
1
60
1
15
0
2
15
1
(12) = (9)
15
12
12
5
0
0
5
15
12
Hence if
-1
is a vector of independent
N(0, 1) variables.
E13.
9
0
0
4
2
0
2
0
0
1
6
0
1
0
y1
.
(a) Obtain the marginal distribution of
y3
y1
y
and 2 are independent.
(b) Show that
y4
y3
y1
y
given 3 =
(c) Obtain the conditional distribution of
y2
y4
(d) Write down the correlation coefficient between y1 and y3.
1. 2
2. 6
distribution if and only if every fixed linear combination of the components of y has a
univariate normal distribution. Using this characterization, we derive several properties
of multivariate normal distribution some of which we studied in the previous section.
Here we do not need the density function and the results are obtained easily and without
tears as S K Mitra, one of the experts in the field says. This approach is due to
D.Basu(1956).
Theorem 9. Let y ~ Np( , ) where is positive definite. Then the following hold.
(a) Let z = By where B is a fixed nonsingular matrix. Then z ~ Np(B , BBt)
(b) Let x = Cy where C is a fixed r x p matrix of rank r (1 r p). Then x ~ Nr(C ,
CCt)
(c) Let w = ly where l is a fixed nonnull vector. Then w ~ N(l , l l).
Further the distributions of z and x are nonsingular multivariate normal.
Proof: (a) The density of y is f y v
1
p
1
v
2
t 1v
1 1
1
B u B
2
B u
t
B1
1
abs. value of B
B Bt
1
u
2
t BBt u
1
(b) Since C is an r x p matrix of rank r, the rows of C are linearly independent. Hence
C
there exists a matrix T of order (p-r) x p of rank (p-r) such that B is nonsingular.
T
p
[We can extend the rows of C to a basis of R . The rows of T are additional vectors in
the extended basis of Rp.]
Cy
t
follows Np(B , BB ).
T
y
Observe that the components of Cy are the first r components of . By theorem 7, the
marginal distribution of the first r components of , i.e., distribution of Cy is r-variate
normal. Since E(Cy) = C and D(Cy) = CCt, it follows that
Cy = Nr(C , CCt).
(c) This is a special case of (b) where r = 1.
Thus we have proved that if y ~ Np( , ) where is positive definite, then every nonnull
fixed linear combination ly of y has a univariate normal distribution. If l=0, then ly = 0
with probability1. It has mean 0 and variance 0. It can be thought of as N(0, 0).
4
6
1
1
1
5
Now C
1
= 1
1
CC = 1
1
5
1
5
1
0
2
1
0
6
1
1
1
2
1
8
4
2 1
4 1
9 1
25
63
=
63 269
1
5
(c) Let m = 1 . By (c) of theorem 9 and (a) above, we have my ~ N(l , l l).
1
Writing l =
1
. m=
5
E14.
1
5
1
5
1
5
1 1 1 1
1 2 2 1
1 1 1 16
1
(a) Let y ~ Np( , ). Let y y1 L y p . Obtain the distribution of y .
p
(b) Obtain x = Cy and w = Ty where x and w are of order 2 x 1 such that x and w
have independent nonsingular bivariate normal distributions. Compute the
parameters of these distributions.
We now embark on the issue of characterizing the multivariate normal distribution via
linear functions.
Definition: (Characteristic function). For a univariate random variable x, x(t) = E(eitx) is
called the characteristic function of x. For a multivariate random variable y of order p x
1, y(t) = E( eit y ) is called the characteristic function of y.
t
y1
Let y = M be a partition of y. Partition t accordingly. If
y
k
(t) = 1(t1) k(tk)
then y1,, yk are mutually independent.
(For proof and other properties of characteristic functions, refer to chapter 15 of Feller
(Volume 2, 1966).)
We now prove a theorem due to Cramr and Word that connects the distribution of a pvariate random vector with the distributions of its linear combinations.
Theorem 12: Let y be a p-variate random vector. Then the distribution of y is completely
determined by the class of univariate distributions of all linear functions ly, l Rp. (l
fixed).
Proof: Let the characteristic function of ly be (t, l) = E(eitly).
Now (1, l) = E(eily) = (l) is the characteristic function of y as a function of l = (l1,,
lp)t. By theorem 11(a) above, the distribution of y is completely specified by the
characteristic function of y.
Motivated by theorem 12 together with the fact that every linear function of a random
vector with (nonsingular) multivariate normal has a univariate normal distribution, we
define multivariate normal distribution as follows.
Definition. A p-dimensional random vector y is said to have a p-variate normal
distribution if every linear function ly (with l fixed) of y has a univariate normal
distribution.
From now on, in this section, we use the above definition and obtain several important
properties of multivariate normal distribution. We shall also show that this coincides
with the earlier definition through density whenever the density exists. This approach is
called a density-free approach.
Theorem 13. Let y be a p-dimensional random vector having a p-variate normal
distribution. Then E(y) and D(y) exist.
Proof: Since yi is a linear function of y, by definition, E(yi) = i and V(yi) = ii exist and
are finite for i = 1,, p. Since yi + yj is a linear function of y, V(yi + yj) = V(yi) + 2
Cov(yi, yj) + V(yj) exists and is finite. Hence Cov(yi, yj) = ii exists and is finite. Hence
E(y) = (1,, p)t and D(y) = = ((ij)) exist and are finite.
Proof: Recall that the characteristic function of univariate random variable having a
normal distribution with mean and variance 2 is (s) = E(eis) =
is
s 2 2
2
Hence the characteristic function of tty denoted by (s, t) = E(eisty) = eist 2 s t t , for each
t.
Now (t) = E( eit
) = (1, t) = eit 2t t .
Notice that the characteristic function of y depends on and , its mean vector and
variance covariance matrix respectively. Also by theorem 11, the distribution of y is
completely specified by its characteristic function. Hence the distribution of y is
completely specified by its mean vector and variance-covariance matrix. Let E(y) =
and D(y) = . Henceforth, we shall say y ~ Np( , ) to denote that y has a p-variate
normal distribution with parameters and . Clearly = E(y) and = D(y). We do not
insist that is positive definite. However, in view of theorem 3 of section 2.3 is nnd.
Theorem 15. Let y ~ Np( , ). Then every linear function lty ~ Np(lt , ltl).
Proof: By definition, lty has a univariate normal distribution. Further E(lty) = lt and
V(lty) = ltl.
Let y ~ Np( , ). Let B be a fixed r x p matrix. Show that By ~ Nr(B , BBt).
E15.
Theorem 16. Let y ~ Np( , ). If is a diagonal matrix, then the components of y are
independent.
1 t
t
Proof: The characteristic function of y is (t) = E( eit y ) = eit 2t t =
t
it11 2 t12 11
K e
it p p 2 t 2p pp
t
2
2
(since t t t1 11 t p pp )
If is a diagonal matrix, then y1,, yp are uncorrelated. Thus we have shown in theorem
15 that uncorrelatedness implies independence if y1,, yp have a joint normal
distribution.
y1
Theorem 17. Let y ~ Np( , ). Write y = when y1 has r components. Partition
y2
12
1
and 11
. Then y1 and y2 are independently
and conformably as
2
21 22
distributed if and only if 12 = 0.
y1
independent pair-wise, then they are mutually independent. (In general, pair-wise
independence does not implies mutually independence, but it holds if y1,, yk have a
joint normal distribution.)
Solution:
11
k1
Partion
12
k 2
Thus
k
and
1k
.
kk
yi
has a multivariate normal distribution with mean vector i
Let i j. By E15,
yj
j
ii ij
(why?). Since yi and yj are independent, ij =
and variance-covariance matrix
ji jj
0.
11 O K
Thus M 22 O
O
O K
kk
function
1 t
t
it
t 1
1 1 t2 1 11
ti
K e
t
k k
1
t2 tk kkt k
of
y is
(t) = E( eit
) =
itt 2t t t
P1 : P2
Also
P1 x = P1 P1t (y - )
2.8 REFERENCES
1. Basu, D.(1956) A note on the multivariate extension of some theorems related to
the univariate normal distribution. Sankhya, Vol.17, pages 221-224.
2. Feller W (1966) An Introduction to Probability Theory and its applications. Vol.2,
Wiley, New York.
3. Rao A R and Bhimasankaram P (2000) Linear Algebra, Hindustan Book Agency,
Delhi.
4. Ross, S.(1976) A first course in probability. Second Edition. Macmillan
Publishing Company, New York.