Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Reduction or Structural Simplification

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 44

Chapter One

Introduction to Multivariate Analysis or Methods


1.1. Introduction
What is multivariate analysis?
Multivariate statistical analysis is the simultaneous statistical analysis of a collection of
random variables. That is, it is a collection of methods that deal with the simultaneous
analysis of multiple outcomes or response variables. It is partly a straightforward extension of
the analysis of a single variable, where we would calculate, for example, measures of location
and variation, check violations of a particular distributional assumption, and detect possible
outliers in the data. Multivariate analysis improves upon separate univariate analyses of each
variable in a study because it incorporates information into the statistical analysis about the
relationships between all the variables. Multivariate analysis typically deals in applications
with fairly large data sets on potentially many subjects and in particular on possibly
numerous interrelated variables of main interest.
Why to Learn Multivariate Analysis?
Explanation of a social or physical phenomenon must be tested by gathering and
analyzing data.
Complexities of most phenomena require an investigator to collect observations on
many different variables.
Objectives of Multivariate Analysis
 To gain a thorough understanding of the details of various multivariate techniques,
their purposes, their assumptions, their limitations, and so on.
 To be able to select one or more appropriate techniques for a given multivariate data
set.
 To be able to interpret the results of a computer analysis of a multivariate data set.
Major Uses of Multivariate Analysis
1. Data reduction or structural simplification
The phenomenon being studied is represented as simply as possible without losing
valuable information.
The reduction of the complexity of available data to several meaningful indices,
quantities (parameters), or dimensions.
2. Sorting and grouping
Groups of similar objects or variables are created based upon measured
characteristics or rules for classifying objects into well defined groups may be
required.
3. Investigation of the dependence among variables
The nature of the relationships among variables is of interest. Are all the variables
mutually independent or are one or more variables dependent on the other? If so,
how?
4. Prediction

1
Relationships between variables must be determined for the purpose of predicting the
values of one or more variables on the basis of observations on the other variables.
5. Hypothesis construction and testing
Specific statistical hypotheses, formulated in terms of the parameters of multivariate
populations are tested. This may be done to validate assumptions.
1.2. Areas of application
The applications of multivariate techniques have been in the behavioral and biological
sciences. However, interest in multivariate methods has now spread to numerous other fields
of investigation. Many organizations today are faced with the same challenge: too much
data. These include:
 Business - customer transactions
 Communications - website use
 Government – intelligence /news
 Industry - process data and etc
1.3. Organizing multivariate data

We are going to be concerned with analyzing measurements made on several variables or


characteristics. These measurements (commonly called data) must frequently be arranged and
displayed in various ways like graphs and tabular arrangements are important aids in data
analysis. We now introduce the preliminary concepts underlying these first steps of data
organization.
Arrays (data usually in form of matrix)

The values of the variables are all recorded for each distinct item, individual, or experimental

unit. We will use the notation


x jk to indicate the particular value of the k th variable that
th
is observed on the j item. That is
x jk = Measurement of the k th variable on the jth item
Consequently, n measurement on p variables can be displayed as follows:

Or we can display these data as a rectangular array called X, of n rows and p columns:

2
[X=¿x1 x12⋯x1k⋯x1p¿][x21 x2⋯x2k⋯x2p¿][⋮ ⋮⋮¿][xj1 xj2 xjk⋯xjp¿][⋮ ⋮⋮⋮¿]¿
¿
The array X contains the data consisting of all of the observations on all of the variables.
1.4. Descriptive Statistics
Summary numbers to assess the information contained in data. Basic descriptive statistics are
sample mean, sample variance, sample standard deviation, sample covariance, and sample
correlation coefficient.
x , x , ..., x
Let 11 21 n 1 be n measurements on the first variables. Then the arithmetic average of
these measurements is
n
1
x̄ 1= ∑ x j1
n j=1
1.1
The sample mean can be computed from the n measurements on each of the p variables, so
that, in general, there will be p sample means:
n
1
x̄ k = ∑ x jk k =1 , 2 , . .. , p
n j=1
1.2
A measure of spread is provided by the sample variance, defined for n measurements on the
first variable as
n
1
s 21 = ∑ ( x j 1− x̄ 1 ) 2
n−1 j=1
1.3
x 's
Where, x̄ 1 is the sample mean of the j 1 . In general, for p variables, we have
n
1
s 2k = ∑ ( x jk − x̄ k )2 k =1 , 2 , .. . , p
n−1 j=1
1.4
Note:
1. Many authors define the sample variance with a divisor of n-1 rather than n. Later,
we shall see that there are theoretical reasons for doing this, and it is particularly
appropriate if the number of measurements, n is small. The two versions of the sample
variance will always be differentiated by displaying the appropriate expression.

3
2
2. The sample variance s σ 2 (the
is generally never equal to the population variance
2
probability of such an occurrence is zero), but it is an unbiased estimator for σ ;
2 2 2
that is E( s )=σ . Again, the notation E( s ) indicates the mean of all possible
sample variances. The square root of either the population variance or sample variance
is called standard deviation.
In this situation, it is convenient to use double subscripts on the variances in order to indicate
their positions in the array. Therefore, we introduce the notation
s kk to denote the same
variance computed from measurements on the kth variable as follows:
n
1
s 2k = s kk= ∑ ( x − x̄ )2 k =1 , 2 , .. . , p
n−1 j=1 jk k
1.5

The square root of sample variance, √s


kk is known as the sample standard deviation. This
measure of variation uses the same units as the observations.
1.5. Measures of Linear Association
Sample covariance is used to measure of linear association between two measurements of
variables. It is denoted by
s ik and defined as follows
n
1
s ik = ∑ ( x − x̄ ) ( x − x̄ ) i=1 , 2 , . .. , p , k=1 , 2 , . .. , p
n−1 j=1 ji i jk k
1.6
th th
This measures the association between i and k variables.
Note: The covariance reduces to the sample variance when i=k . Moreover, ik ki for
s =s
all i and k . The sample covariance measures only linear relationships because covariance
depends on the scale of measurement of two variables; it is difficult to compare covariances
between different pairs of variables. For example, if we change a measurement from inches to
centimeters, the covariance will change.
Note: If data were available on all members of a given finite population, then using Equations
1.5 and 1.6 yet with the divisor (denominator) n would allow one to determine all elements of
the population covariance matrix ∑ ¿¿ .

r
Sample correlation coefficient ( ik ) is used to measure of the linear association between
two variables does not depend on the units of measurement. To find a measure of linear
relationship that is invariant to changes of scale, we can standardize the covariance by
dividing by the standard deviations of the two variables. This standardized covariance is
called a correlation. The sample correlation coefficient for the ith and kth variables is
defined as:

4
n

s ik ∑ ( x ji− x̄ i ) ( x jk − x̄ k )
j=1
r ik = = for i=1 , 2 , . . ., p and k =1 , 2 , . .. , p
√ sii √ s kk n n

√∑ (
j=1
x ji− x̄ i )2
√∑ (j=1
x jk − x̄ k ) 2

1.7
r =r
Note: ik ki for all i and k. The sample correlation coefficient is a standardized version of
the sample covariance, where the product of the square roots of the sample variances provides
the standardization.
Notice that
r ik has the same value whether n or n-1 is chosen as the common divisor for
s ii , s kk , and s ik .
Properties of Sample Correlation Coefficient
Its value is between -1 and 1
Magnitude measure the strength of the linear association. If r=0 , this implies a
lack of linear association between the components. If r <0 implies a tendency for
one value in the pair to be larger than its average when the other is smaller than its
average, and r >0 implies a tendency for one value in the pair to be larger when
the other is large and also for both values to be small together.
Sign indicates the direction of the association
Its value remains unchanged if all
x ji ' s and x jk ' s are changed to y ji =ax ji +b

and
y jk =cx jk +d , respectively, provided that the constants a and c have the same
sign.

The descriptive statistics computed from n measurements on p variables can also be organized
into arrays.
Arrays of Basic Descriptive Statistics

[x̄=¿ x̄1¿][x̄2¿][⋮¿]¿¿
Sample means ¿
Sample variances and co-variances

[S=¿ s1 s12 ⋯s1p¿][s21 s2 ⋯s2p¿][⋮ ⋮⋱⋮¿]¿¿


Sample correlations
¿
5
[R=¿ 1 r12⋯r1p¿][r21 1⋯r2p¿][⋮ ⋮⋱⋮¿]¿¿
¿ x 1 , x 2 , and x 3
Example: The following are five measurements on the variables
X1 9 2 6 5 8
X2 12 8 6 4 10
X3 3 4 0 2 1
Find the arrays of X̄ , S, and R
Solutions:
Since, we have a total of five measurements (observations) on each variable.
Sample means are
5
1
x̄ 1=
n
∑ x j1 =9+
5
2+ 6+5+6
=5 . 6
j =1
5
1
x̄ 2=
n
∑ x j 2=12+8+
5
6+4 +10
=8
j =1
5
1 3+4 +0+2+1
x̄ 3= ∑ x j 3=5 =2
n j =1

x̄ x̄ ¿
X̄=¿ [ 1 ¿ ] [ 2 ¿] ¿¿
Therefore, ¿
The sample variances and covariances are

5 2 2 2 2
1 2 ( 9−5 .6 ) + ( 2−5. 6 ) +2∗( 6−5 .6 ) + ( 5−5. 6 )
s 11 = ∑ ( x j 1− x̄ 1 ) = =6 .3
4 j =1 4
5 2 2 2 2 5 2 2 2 2
1 ( 12−8 ) +(6−8) +( 4−8) + (10−8) 1 2 ( 12−8 ) + ( 6−8 ) + ( 4−8 ) + (10−8 )
s22= ∑ ( x j2− x̄2)2= =10 s 22= ∑ ( x j2 − x̄ 2 ) = =10
4 j=1 4 4 j=1 4
5 2 2 2 2
1 2 ( 3−2 ) + ( 4−2 ) + ( 0−2 ) + ( 1−2 )
s 33= ∑ ( x j3 − x̄ 3 ) = =2. 5
4 j=1 4
5
1 ( 9−5 .6 )( 12−8 ) +…+ ( 5−5 . 6 ) ( 10−8 )
s 12= ∑ ( x j 1 − x̄ 1 ) ( x j 2 − x̄ 2 ) = =5
4 j =1 4
5
1 ( 9−5 . 6 ) ( 3−2 ) +…+ ( 5−5 .6 )( 1−2 )
s 13= ∑ ( x j1 − x̄ 1 ) ( x j3 − x̄ 3 ) = =−1. 75
4 j=1 4

6
5
1 ( 3−2 ) ( 12−8 )+…+ (1−2 ) (10−8 )
s 23= ∑ ( x j2 − x̄ 2 ) ( x j3 − x̄ 3 )= =1 .5
4 j=1 4

S=¿[6.3 5 −1.75¿][ 5 10 1.5 ¿]¿¿¿


Therefore, ¿
The sample correlation is
s12 5
r 12= = =0 .63
√ s 11 √ s 22 √ 6 .3 √ 10
r 21=r 12
s13 −1 .75
r 13= = =− 0. 44
√ s11 √ s33 √ 6 .3 √2 .5
r 31=r 13
s23 1. 5
r 23= = =0 .30
√ s22 √ s33 √10 √ 2 .5
r 23=r 32

R=¿[ 1 0.63 −0.44¿][0 .63 1 0.30¿]¿¿¿


Therefore, ¿

7
Chapter Two
Review of Matrix Algebra and Random Vectors
2.1. Basics concepts
The multivariate data can be easily displayed as an array of numbers. In general, a rectangular
array of numbers with n rows and p columns is called a matrix of dimension n×p . The
study of multivariate methods is greatly facilitated by the use of matrix algebra. The matrix
algebra results presented in this chapter will enable us to concisely state statistical models.
2.2. Vector and matrix
2.2.1. Vectors
A vector is a matrix with a single column or row. An array x of n real numbers
x 1 , x 2 ,..., xn is called a vector, and it is written as

[x1¿][x2¿][⋮¿] ¿
x=
'
¿ is a vector of length n. That is x ~ n×1
x =[ x1 , x 2 , ⋯, x n ] is a transpose of x. That is x ' ~ 1×n

A set of vectors
x 1 , x 2 ,…, x k is said to be linearly dependent if there exist constants k
c , c , …, c k ), not all zero, such that
numbers ( 1 2
c 1 x 1 +c 2 x 2 +…+ c k x k =0
k
∑ c j x j =0
Otherwise, the set of vectors is said to be linearly independent. That is, j=1

implies
c =0
j for all j. where, 1 2
c , c , …, c
k are scalars.
Note: Linear dependence implies that at least one vector in the set can be written as a linear
combination of the other vectors.

8
Example 2.1: Consider the following set of vectors, and then identify linear independent
vectors.

=¿[1¿][2¿]¿¿ =¿[ 1 ¿] [ 0¿]¿¿ =¿[ 1¿][−2¿]¿¿


x1 ¿ ,x ¿ ,x ¿
2 3

Setting
c 1 x 1 +c 2 x 2 +c 3 x 3 =0

c1 ¿ [1 ¿] [2 ¿] ¿ ¿
¿
Implies that
c 1 +c 2 +c 3=0
2 c1 −2 c 3=0
c 1−c 2 +c 3 =0
With the unique solution
c 1=c 2=c3 =0 . as we cannot find three constants c 1 , c 2 , and
c 3 , not all zero, such that c 1 x 1 +c 2 x 2 +c 3 x 3 =0 , the vectors x1 , x2 , and
x 3 are
linearly independent.
2.2.2. Matrix

A matrix is a rectangular or square array of numbers or variables arranged in rows and


columns. We use uppercase boldface letters to represent matrices. All entries in matrices
will be real numbers or variables representing real numbers. In general, if a matrix A has n
rows and p columns, it is said to be n × p matrix or we say the size of A is n × p.

[An×p=¿ a1 a12 …a1p ¿][a21 a2 …a2p¿][⋮ ⋮ ⋮¿]¿¿


¿ is n × p matrix
We use small letters with subscript to denote elements of a matrix. Example a21 is the
element in the 2nd row and 1st column of A.
 The transpose of A matrix
A n×p is denoted by A ' or A T is a p×n matrix
' '
where the rows of A are the columns of A and the columns of A are the rows
of A.
Example 2.2: Find the transpose of the following matrix

9
A=¿ [2 −1¿] [4 8¿ ]¿ ¿¿ A =¿ [ 2 4 5 ¿ ] ¿ ¿¿
'
¿ and its transpose is ¿
 The identity matrix
I p is defined by

Ip=¿[1 0… 0¿][0 1…0¿][⋮ ⋮⋮¿]¿¿


¿
is a square matrix with one on the diagonal and zero elsewhere.
Determinants of square matrix

The determinant is an important concept of matrix algebra. The determinant of A matrix is a


scalar denoted by |A| or by det( A ) . The preceding definition is not useful in evaluating
determinants, except in the case of 2 × 2 or 3 × 3 matrices. For larger matrices, other methods
are available for manual computation, but determinants are typically evaluated by computer.
 The determinant of any 2 × 2 matrix

A=¿[ a11 a12 ¿ ] ¿ ¿¿ det ( A )=det ¿ [ a 11 a12 ¿ ] ¿ ¿


¿
is given by ¿
 The determinant of any 3x3 matrix

aaa aaa ¿
A=¿[ 11 12 13 ¿][ 21 22 23 ¿] ¿¿
¿ is given by
|A|=a11 a 22 a33 +a 12 a23 a31 +a13 a32 a21−a 31 a22 a13−a32 a 23 a11 −a 33 a12 a21
If the square matrix A is singular, its determinant is 0: i.e det( A )=0 if A is singular.
If A is near singular, then there exists a linear combination of the columns that is close
to 0, and det( A ) is also close to 0.
If A is nonsingular, then its determinant is nonzero: i.e det( A )≠0 if A is
nonsingular.
If A is positive definite, then its determinant is positive: i.e det( A )>0 if A is
positive definite.
Inverse of Square Matrix
If the determinant of A matrix is nonzero ( det( A )≠0 ), then A is said to be
invertible matrix.
If determinant of A matrix is zero ( det( A )=0 ), then A has no regular inverse.
The inverse of any 2x2 matrix

10
A=¿[ a11 a12 ¿ ] ¿ ¿¿
¿ is given by

1
A−1 = ¿ [ a22 −a12 ¿ ] ¿ ¿
|A| ¿
|Aij|
In general , A
−1
has j, i
th
entry
(−1 ) i+ j
[ ]
|A| , where
A ij is the matrix obtained from
th th
A by deleting the i row and j column.

The technical condition that an inverse exists is that the k columns


x 1 , x 2 ,…, x n of A are
−1
linearly independent. That is, the existence of A is equivalent to
c 1 x 1 +c 2 x 2 +…+c k x k =0 only if c 1=c 2=…=x k =0
Example 2.3: Find the inverse of the following matrix

A=¿ [3 2¿] ¿ ¿¿
¿
−1
You may verify that A A=I
[ −0 . 2 0 . 4 ¿] ¿ ¿ ¿
¿
Hence,
[−0.2 0.4¿] ¿ ¿¿
−1
¿ is A
We note that

c1 ¿ [3 ¿ ] ¿ ¿
¿
Implies that c 1=c 2=0 , so the columns of A are linearly independent
Eigenvalues and eigenvectors of matrix
 Let A be a k×k matrix and I k be a k×k identity matrix, then the scalars
det ( A−λI k ) =0
satisfying the characteristic equation: are called the eigenvalues of
matrix A.
Example 2.4: Consider the following matrix

A=¿ [1 2¿ ] ¿ ¿¿ I2=¿ [ 1 0¿ ] ¿ ¿¿
¿ and ¿
11
Find the eigenvalues and associated eigenvectors
Solutions:

det ( A−λI k)=det¿¿


¿
¿
⇒1−λ=−2 or 1− λ=2
⇒ λ 1=3 and λ2 =−1 → We have to order the eigenvalues.
The eigenvalues of matrix A are 3 and -1.
 The associated with eigenvalues of a square matrix A is an eigenvectors x, which
satisfies system of equations:
Ax i= λi x i or ( A− λI ) x=0

For λ1 =3

Ax= λ1 x ⇒ ¿ [ 1 2 ¿ ] ¿ ¿
¿
x +2 x ¿
⇒¿ [ 1 2 ¿ ] ¿ ¿
¿
¿
X1=¿ [ 1 ¿ ] ¿ ¿¿
Let x 1=1 ⇒ x 2 =1 , thus an eigenvector corresponding toλ1 =3 is ¿
Similarly, the eigenvectors corresponding to λ2 =−1 is Ax i= (−1 ) x i
[ 1 2 ¿] ¿ ¿ ¿
¿
x 1 +2 x 2 =−x 1 ⇒ x 1 =−x 2
2 x 1 + x 2 =−x 2 ⇒ x 1 =−x 2

X 2=¿ [ 1¿] ¿ ¿¿
Thus, an eigenvector corresponding to λ2 =−1 is ¿
 Note that the eigenvectors are not unique. So we often normalize them, that is, we
standardize them so that they have a unit length.

12
2 2 2 2
The norm of x 1 is
‖X 1‖= x 1 +x √
2 = ( 1 ) + ( 1 ) =√ 2 √
e1=¿ [ 1/ √ 2 ¿ ] ¿ ¿¿
Thus, the normalized eigenvector corresponding to λ1 =3 is ¿
2 2
The norm of e 1 is √
‖e 1‖= ( 1/ √ 2 ) + ( 1/ √ 2 ) =1

e 2=¿ [1 / √ 2 ¿ ] ¿ ¿¿
Similarly, the normalized eigenvector corresponding to λ2 =−1 is ¿
2 2
The norm of e 2 is √
‖e 2‖= ( 1/ √ 2 ) + (−1/ √ 2 ) =1
Let A be a k×k symmetric matrix having k eigenvalues λ1 , λ 2 , ..., λk with

corresponding normalized eigenvectors


e 1 , e 2 , ...,e k , then the spectral decomposition
k
A= λ1 e 1 e '1 + λ2 e2 e '2 + + λk ek e'k =∑ λi ei e'i
of matrix A is given by i =1
'
We can express this in matrix form as A=PΛ P
P=[ e1 , e 2 , ... , e k ] Λ=diag ( λ1 , λ 2 , . . . , λ k )
Where, and
Example 2.5: Consider the following matrix

A=¿ [1 2¿ ] ¿ ¿¿
¿
Λ=diag ( λ1 , λ 2)= ¿ [3 0 ¿ ] ¿ ¿¿
The eigenvalues of A matrix are λ1 =3 and λ 2=−1 , then ¿
e1=¿ [ 1/ √ 2 ¿ ] ¿ ¿¿ e 2=¿ [1/ √ 2 ¿ ] ¿ ¿¿
The normalized eigenvectors are ¿ and ¿
P=( e1 , e2 )=¿ [ 1/ √ 2 1/ √ 2 ¿ ] ¿ ¿¿
¿
λ1 0 ¿] ¿
A = PΛ P = [ e 1 , e2 ] ¿ [
'
¿
¿
2.3. Positive Definite Matrix
The symmetric matrix A is said to be positive definite if x ' Ax >0 for all possible vectors x
for all x≠0 . Where,
'
(except x = 0). Similarly, A is positive semi-definite if x Ax≥0

x ' Ax is a quadratic form matrix. The diagonal elements


aij of a positive definite matrix
are positive.

13
The eigenvalues and eigenvectors of positive definite and positive semidefinite matrices have
the followig properties.
1. The eigenvalues of a positive definite matrix are all positive.
2. The eigenvalues of positive semidefinite matrix are positive or zero, with the
number of positive eigenvalues equal to the rank of the matrix.

It is customary to list the eigenvalues of a positive definite matrix in desending order:


λ1 > λ2 >…> λk . The eigenvectors x 1 ,x 2 ,…, x n are listed in the same order; x 1

corresponds to λ1 , x 2 corresponds to λ2 , and so on. If all elements of the positive


definite matrix A are positive, then all elements of the first eigenvector are positive. (The first
eigenvector is the one associated with the first eigenvalues: λ1 ).
Example 2.6: Consider the following matrix

A=¿ [ 9 −2¿ ] ¿ ¿¿
¿
a. Is A symmetric?
b. Show that A is positive definite
Solution
a. Since A= A ' , A is symmetric

x Ax=[ x 1 , x 2 ] ¿ [ 9 −2 ¿ ] ¿ ¿
'

b. Since the quadratic form ¿


=( 2 x 1 −x 2 ) 2 +5 ( x 21 + x 22 ) >0 for [ x1 , x 2 ]≠[ 0 , 0 ]
Or
det( A )=9∗6−(−2∗−2)=54−4=50>0
Therefore, we conclude that A is positive definite matrix.
2.4. Square Root Matrix
k
'
A=∑ λi e i ei
If A is positive definite, the spectral decomposition of A ( i=1 ) can be modified
by taking the square roots of the eigenvalues to produce a square root matrix,
k
A 1/2 =PΛ 1/2 P ' =∑ √ λi e i e'i
i=1

[Λ =¿ √λ1 0…0¿][ 0 √λ2 …0¿][ ⋮ ⋮ ⋮¿]¿¿


1/ 2

Where, ¿
14
1/2
The square root matrix A is symmetric and serves as the square root of A:
2
A 1/2 A 1/2=( A 1/2 ) = A
k
1
A−1 =PΛ−1 P' =∑ e i e'i
Thus, i =1 λi
The square root matrix of a positive definite matrix A, has the following properties:

1. ( A1/2 ) = A1 /2 (That is, A
1/2
is symmetric).
1/2 1/2
2. A A =A
k
1 ' '
( A 1/2 )−1=∑ e i ei =PΛ−1/2 P
3. i=1 √ λi Where, Λ
−1/2
is a diagonal matrix with
1/ √ λi
th
as the i diagonal element
−1
4.
1/2 −1/2
A A = A A =I and A A
−1 /2 1/2 −1 /2 −1/2
=A
−1
where, A−1/2 =( A1/2 )
Example 2.7: Consider the following matrix

A=¿ [ 9 −2¿ ] ¿ ¿¿
¿
a. Determine the eigenvalues and eigenvectors of A.
b. Write the spectral decomposition of A
c. Find the square root of matrix A
−1
d. Find A
−1
e. Find the eigenvalues and eigenvectors of A
Solution

a. Eigenvalues:
|A−λI 2|=0

¿ ¿
Therefore, λ1 =10 , λ 2=5

Eigenvectors: i Ax = λx
i using this formula, we can find the eigenvectors and the
Normalized eigenvectors are

e 1=¿ [ 2/ √ 5 ¿ ] ¿ ¿ ¿ e 2=¿ [ 1/ √ 5 ¿ ] ¿ ¿¿
¿ and ¿
b. The spectral decomposition of A is

15
A=¿ [ 9 −2 ¿ ] ¿ ¿ ¿
¿
c. The square root of matrix A is
k
A 1/ 2
=∑ √ λi e i ei =√ 10
'
¿ [ 2 / √5 ¿ ] ¿ ¿
i=1 ¿
d. The inverse of matrix A is
adjiont of A 1
A−1 = = ¿ [6 2 ¿ ] ¿ ¿
|A| 9 ( 6 ) − (− 2 )(−2 ) ¿
=1
e. The eigenvalues and eigenvectors of A are
=1
Eigenvalues of A is obtained by taking the reciprocal of eigenvalues of matrix
A and then order the eigenvalues. That is λ1 =1/5=0. 2 and λ2 =1/10=0 .1

e 1=¿ [ 1/ √ 5 ¿ ] ¿ ¿¿ e2=¿ [2/ √ 5 ¿] ¿ ¿¿


The normalized eigenvectors are ¿ ¿
2.5. Random Vectors and Matrices

A random vector is a vector whose elements are random variables (a function that associates a
real number with each element in the sample space). Similarly, a random matrix is a matrix
whose elements are random variables. The expected value of a random matrix (or vector) is
the matrix (vector) is consisting of the expected values of each of its elements. The expected
value of X, denoted by E( X ) is the n×p matrix of numbers

[E(X)=¿ E(X1 ) E(X12) … E(X1p)¿][E(X21) E(X2 ) … E(X2p)¿][⋮ ⋮ ⋱ ⋮¿]¿¿


Where, for each element of the matrix,
¿
 If ij X
is a discrete random variable with probability mass function
pij ( x ij )
,
then its marginal mean is given by
E ( X ij )= ∑ x ij p ij ( x ij )
all xij

 If
X ij is a continuous random variable with probability density function
f ij ( xij )
, then its marginal mean is given by
¿
E ( X ij )=∫− ¿ xij f ij ( x ij ) dx ij
2.6. Mean Vectors and Covariance Matrices

16
Suppose

X   X 1 , X 2 , , X p  is a p×1 random vector, then each element of X is a
random variable with its own marginal probability distribution. The marginal means
μi and
2
2
μi=E ( X i ) σ 2i =E ( X i−μi ) , i=1,2 ,…, p
the variances σ i are defined as and
respectively. Specifically,
 If i X
is a discrete random variable with probability mass function
pi ( x i )
,
then its marginal mean is given by
μi= ∑ x i pi ( x i )
all x i

 If
X
i is a continuous random variable with probability density function
f i ( xi )
, then its marginal mean is given by
¿
μi=∫−¿ x i f i ( x i ) dx i
 If i X
is a discrete random variable with probability mass function
pi ( x i )
,
then its marginal variance is given by
σ 2i = ∑ ( x i−μi ) 2 pi ( x i)
all x i

 If
X
i is a continuous random variable with probability density function
f i ( xi )
, then its marginal variance is given by
¿
2
σ 2i =∫−¿ ( x i−μi ) f i ( x i ) dx i
It will be convenient in later sections to denote the marginal variances by
σ ii rather than
2
the more traditional σ i , and consequently, we shall adopt this notation.

X
The behavior of any pair of random variables i and k , is described by their joint
X
probability function, and a measure of the linear association between them is provided by the
covariances.
 If
X i , X k are discrete random variables with joint probability function
pik ( x i , x k )
, then its covariance is given by
σ ik =E ( X i−μi )( X k −μ k ) = ∑ ∑ ( X i−μ i )( X k−μ k ) pik ( x i , x k )
all x all x
i k

 If
X i , X k are continuous random variables with joint probability function
f ik ( x i , x k )
, then its covariance is given by
¿ ¿
σ ik =E ( X i−μi )( X k −μ k ) =∫−¿ ∫−¿ ( X i −μi )( X k −μk ) f ik ( x i , x k ) dxi dx k

17
and
μi and μk , i , k=1,2 ,…, p , are the marginal means.
Note: When i=k , the covariance becomes the marginal variance.

More generally, the collective behavior of the p random variables 1 2 p or X , X , …, X

equivalently, the ramdom vector


X   X 1 , X 2 , , X p  
, is described by a joint probability
f ( x 1 , x 2 ,…, x p ) =f ( x )
density function .
p [ X i≤x i and X k≤x k ]
 If the joint probability can be written as the product of the
corresponding marginal probabilities, so that
p [ X i≤x i and X k≤x k ]= p [ X i ≤x i ] p [ X k ≤x k ] x ,x
for all pairs of values i k , then
X i and X k are said to be statistically independent.

 When
X i and X k are continuous random variables with joint density
f ik ( x i , x k ) f i ( xi ) f k ( xk )
and marginal densities and , the independence
condition becomes
f ik ( x i , x k )=f i ( x i ) f k ( x k ) ( xi , xk )
for all pairs .
 The p continuous random variables 1 2 X , X , ⋯X p are mutually statistically
independent if their joint density can be factored as
f 1,2⋯ p ( x 1 , x 2 ,⋯ x p ) =f 1 ( x1 ) f 2 ( x 2 ) ⋯f p ( x p )
for all p-tuples ( 1 2 p ). x ,x ,⋯x
Statistical independence has an important implication for covariance. Statistical independence
Cov ( X i , X k ) =0
implies that . Thus,
Cov ( X i , X k ) =0
if
X i and X k are independent

Note: There are situations where


Cov ( X i , X k ) =0
, but
X i and X k are not
independent.

The means and covariances of the p×1 random vector X can be set out as matrices. The
expected value of each element is contained in the vector of means μ=E ( X ) , and the p
variances
σ ii and the p ( p−1 ) /2 distinct covariances σ ik ( i <k ) are contained in the

symmeric variance-covariance matrix ∑ ¿ E ( X −μ ) ( X−μ )' . Specifically,

E(X)=¿ [ E ( X 1 ) ¿ ] [ E ( X 2 ) ¿ ] [ ⋮¿ ] ¿ ¿¿
¿
18
and

∑ ¿ E ( X −μ ) ( X−μ )'

=E ¿ ¿
2 2
[ ][ ]
=Ealignl (X1−μ1) (X1−μ1)(X2−μ2) ⋯ (X1−μ1)(Xp−μp) ¿ (X2−μ2)(X1−μ1)(X2−μ2) … (X2−μ2)(Xp−μp)¿ [ ⋮ ⋮ ⋱ ⋮¿]¿¿
¿
2 2
[=¿ E(X1−μ1) E(X1−μ1)(X2−μ2)…E(X1−μ1)(Xp−μp)¿][E(X1−μ1)(X2−μ2) E(X2−μ2) …E(X2−μ2)(Xp−μp)¿][ ⋮ ⋮ ⋱ ⋮¿]¿¿
Or
¿
[ σ σ ⋯σ ¿ ] [ σ σ ⋯σ ¿ ] [ ⋮ ⋮⋱⋮¿ ] ¿
∑¿Cov X =¿ ¿ ¿¿
( ) 1 1 2 1p 2 1 2 2p

Example 2.8: Find the covariance matrix for the two random variables X 1 and X2 when their
joint probability function given below
X2
X1 0 1 p1 ( x 1 )
-1 0.24 0.06 0.3
0 0.16 0.14 0.3
1 0.40 0.00 0.4
p2 ( x 2 ) 0.8 0.2 1
μ1 =E ( X 1 )=0 . 1 μ2 =E ( X 2 )=0. 2
We have already shown that and . In addition,
2
σ 11 =E ( X 1−μ1 )2 = ∑ ( x 1 −0.1 ) p1 ( x 1 )
all x 1
2 2 2
=(−1−0.1 ) 0 . 3+ ( 0−0.1 ) 0. 3+ ( 1−0. 1 ) 0. 4=0 .69

19
σ 22=E ( X 2 −μ2 ) 2= ∑ ( x 2−0 . 2 )2 p 2 ( x 2 ) =( 0−0 .2 )2 0 . 8+ ( 1−0 . 2 )2 0 . 2=0 . 16
all x 2

σ 12=E ( X 1 −μ 1 )( X 2−μ 2 ) = ∑ ( x1 −0 . 1 ) ( x2 −0 . 2 ) p12 ( x1 , x 2) =−0 .08


all pairs x1 , x2

σ 21=E ( X 2 −μ2 ) ( X 1 −μ 1 ) =E ( X 1 −μ1 )( X 2 −μ2 )=−0 . 08

Consequently, with X ' =[ X 1 , X 2 ]

μ=E ( X )=¿ [ E ( X 1 ) ¿ ] ¿ ¿ ¿
¿
and
2
∑ ¿ E ( X −μ ) ( X−μ ) =Ealignl ['( X 1 −μ 1 ) ( X 1−μ1 ) ( X 2−μ2 ) ¿ ] ¿ ¿ ¿
¿
=

[ E ( X 1− μ 1 )2 E ( X 1− μ1 ) ( X 2− μ2 ) ¿ ¿ ¿ ¿ ]
¿
We note that the computation of means, variances, and covariances for discrete random
variables involves summation, while analogous computations for continuous random
variables involve integration.
σ =E ( X i−μi ) ( X k−μ k ) =σ ki
Because ik , it is convenient to write the ∑ ¿ ¿ matrix as:

' σ1 σ12 ⋯σ1p ¿ σ21 σ 22 ⋯σ2p ¿ ⋮ ⋮⋱⋮¿ ¿


[ ][ ][ ]
∑ ¿E(X−μ)( X−μ) =¿ ¿¿
¿
We shall refer to μ and ∑ ¿ ¿ as the population mean (vector) and population variance-
covariance matrix respectively.

The population correlation coefficient


ρik is defined in terms of the covariance σ ik and

variances
σ ik and σ ik as:
σ ik
ρ ik =
√σ ii √σ kk
The correlation coefficient measures the amount of linear association between the random
variables
X i and X k . Let the population correlation matrix ( ρ ) be the p× p
symmetric matrix and defined as:

20
σ1 σ12 σ1p σ21 σ22 σ2p
[ ][
ρ=¿ σ σ σ σ … σ σ ¿ σ σ σ σ … σ σ ¿ [ ⋮ ⋮ ⋱ ⋮¿]¿¿¿
√ 1 √ 1 √ 1 √ 12 √ 1 √ pp √ 1 √ 22 √ 22√ 22 √ 22√ pp
¿
]
And let the p× p standard deviation matrix be

V =¿[√σ1 0⋯0¿][ 0 √σ2 ⋯0¿][⋮ ⋮⋱⋮¿]¿¿


1/ 2

¿ Where, V is the diagonal of variances


The ‘1/2’ reminds us that this is a diagonal matrix with the square roots of the variances on
the diagonal.
Then, it is easily verified that
V 1/2 ρ V 1 /2 =∑ ¿ ¿ 1/2 −1 1/2 −1
And ρ=( V ) (V ) ∑
1/2
That is, ∑ ¿ ¿ can be obtained from V and ρ , whereas ρ can be obtained ∑ ¿ ¿ .
Example2.9: Using the following covariance matrix

∑ ¿¿ [ 4 1 2¿ ][ 1 9 −3¿ ] ¿ ¿
¿
1/2
Obtain V and ρ
Solution

V 1/2=¿ [√ σ11 0 0 ¿] [ 0 √ σ22 0 ¿] ¿ ¿¿ (V1/2)−1=¿[1/2 0 0¿][0 1/3 0¿] ¿¿¿


¿ and ¿
Consequently, the correlation matrix ρ is given by

=¿ [ 1/2 0 0¿ ][ 0 1/3 0¿ ] ¿ ¿¿
1/2 −1 1/2 −1
ρ=( V ) ∑ (V ) ¿

21
Chapter Three
3. Multivariate Normal Distribution
3.1. Multivariate normal density
A probability distribution that plays a pivotal role in much of the multivariate analysis is
multivariate normal distribution. Since the multivariate normal density is an extension of the
univariate normal density and shares many of its features, we review the univariate normal
density function
Review of the Univariate Normal Density
A random variable x is said to follow a univariate normal distribution with mean μ and
2
variance σ , if x has the density is
1 2 2 given by
f ( x )= e−( x−μ ) /2 σ , −¿ < x<¿
√2 π σ
3.1

x ~ N ( μ , σ 2)
or simply x is distributed as N ( μ , σ )
2

Multivariate Normal Density


X =( X 1 , X 2 ,…, X n )T
A random vector is said to follow a multivariate normal distribution
with mean vector μ and a positive definite covariance matrix Σ if X has the density is given
by
1
f ( x )= p
¿¿
( √2 π ) ¿ ¿
3.2
Where, p is the number of variables. We say that x is distributed as
N p ¿ ¿ , or simply
−1
x ~ N p ( μ , σ 2 ) . The term ( x−μ )2 / σ 2 =( x−μ ) ( σ 2 ) ( x−μ ) in the exponent of the
univariate normal density measures the squared distance from x to μ in standard deviation
−1
units. Similarly, the term ( x−μ )′ ∑ ( x−μ )
in the exponent of the multivariate normal
density is the squared generalized distance from x to μ . In the coefficient of the

exponential function in multivariate normal density, ¿¿ appears as the analogue of


√σ2 in univariate normal density.

This p-dimensional normal density function is denoted by N p (μ , ∑ ) where

22
[ μ1 ¿ ] [ μ2 ¿ ] [⋮¿] ¿
μ=¿ ¿¿
¿
The simplest multivariate normal distribution is the bivariate (2- dimensional) normal
distribution, which has the density function
1
f ( x )= ¿¿
(2 π ) ¿ ¿
Where,
−¿< x i < ∝, i=1,2

This 2- dimensional normal density function is denoted by N 2 ( μ , ∑ ) where

X =¿ [ X1 ¿] ¿ ¿ ¿
¿
We can easily find the inverse of the covariance matrix
1
¿ [ σ 22 −σ 12 ¿ ] ¿ ¿
−1
∑ ¿
σ 11 σ 22−σ 122 ¿

If we replace σ 12 by ρ12 σ 11 σ 22 , then we get


√ √
1
¿ [ σ 22 −ρ12 √ σ 11 √ σ 22 ¿ ] ¿ ¿
−1
∑ ¿
σ 11 σ 22 1−ρ212
( ) ¿
By substitution, we can now write the squared distance as
1
¿ [ σ 22 −ρ12 √ σ 11 √ σ 22 ¿ ] ¿ ¿
−1
( x−μ )′ ∑ ( x−μ )=[ x 1−μ1 , x 2 −μ2 ]
σ 11 σ 22 1− ρ212
( ) ¿
2 2
σ 22 ( x 1−μ1 ) + σ 11 ( x 2 −μ 2 ) −2 ρ12 √ σ 11 √ σ 22 ( x 1 −μ1 ) ( x 2 −μ2 )
=
σ 11 σ 22 ( 1−ρ212 )
2 2
x1 −μ1 x 2 −μ2 x1 −μ1 x 2−μ 2
=
1
1−ρ212 [( ) ( ) ( )( )]
√ σ 11
+
√σ 22
−2 ρ12
√ σ 11 √ σ 22
1
= z 2 + ( z 2 )2 −2 ρ12 z 1 z 2
[
2 ( 1) ]
1−ρ12
Then, we can rewrite the bivariate normal probability density function as
2 2
x 1− μ1 x 2−μ 2 x 1− μ1 x2 −μ 2

) [( √ ) ( √ ) ( √ )( √ )]
1
− + −2 ρ 12
1 2( 1− ρ212 σ 11 σ 22 σ 11 σ 22
= e
( 2 π ) σ 11 σ 22 (1− ρ212)

23
If σ 12=0 or equivalently ρ12=0 , then X1 and X2 are uncorrelated. For bivarite normal,
σ 12=0 implies that X and X are statistically independent and then the joint density can be
1 2

written as the product of two univariate normal densities


2 2
x 1 −μ1 x 2−μ 2

f ( x )=
1
e

1
2 [( ) ( ) ]
√σ 11
+
√ σ 22
( 2 π ) √ σ 11 σ 22
2
1 x1− μ1 2

=
1
exp
[ ( ) ]∗1

2 √ σ 11
exp
[ ( )]

1 x 2− μ2
2 √σ
22

√ 2 πσ 11 √ 2 πσ 22
=f ( x 1 )∗f ( x 2 )
Slices of multivariate normal density
All points of equal density are called a contour. The multivariate normal density is
′ −1
constant on surfaces where the square of the distance ( x−μ ) ( x−μ ) is constant. ∑
Contours of constant density for the p- dimensional normal distribution are ellipsoids
defined by x such that
−1
( x−μ )′ ∑ ( x−μ )≤c 2

These ellipsoids are centered at μ and have axes


±c √ λ i ei , where ∑ e i =λi e i
for i=1,2,…, p
For bi-variate normal, you get an ellipse whose equation is
−1
( x−μ )′ ∑ ( x−μ )=c 2 which gives all (x1, x2) pairs with constant
probability.
The ellipses are call contours and all are centered at μ.

A constant probability contour equals


−1
{
= all x such that ( x−μ ) ∑ ( x −μ )=c 2

}
¿ { surface of ellipsoid centered at μ }
Probability Contours: Axes of ellipsoid
Important Points:
−1
( x−μ )′ ∑ ( x−μ ) ~ χ 2p ( α )if ¿¿
The solid ellipsoid of values x that satisfy
−1
( x−μ )′ ∑ ( x−μ )≤ c 2 = χ 2p ( α )

24
−1
has probability ( 1−α ). That is
[ ]
p ( x−μ )′ ∑ ( x−μ )≤ χ 2p ( α ) =1−α
where
χ 2p ( α ) th
is the (1−α ) 100 % point of the chi-square distribution with p- degrees of
freedom.
Example: Using the following bivariate normal ( x ~ N 2 (μ , ∑ ) ) finds the major and minor

μ=¿ ( 5 ¿ ) ¿ ¿ ¿
axis of ellipses. ¿
and we want the 95% probability contour. The upper 5% point of the chi-square distribution
2 2
with 2 degrees of freedom is χ p α = χ 2 0 .05 =5 . 9915 . So,
( ) ( ) c=√ 5.9915=2.4478
Axes:
μ±c √ λi e i where ( λ i , ei ) is the ith ( i=1,2 ) eigenvalues/ eigenvector pair of
∑ ¿¿ .
λ1 =68 . 316 e '1 =( 0 .2604 ,0 . 9655 )
λ2 =4 .684 e'2 =( 0 . 9655 ,−0 . 2604 )
Major Axis
Using the largest eigenvalues and corresponding eigenvector:

(5 ¿ )¿ ¿ ¿
¿ ¿
¿
Minor Axis
Same process but now use λ2 and e2, the smallest eigenvalues and corresponding eigenvector:

(5 ¿ )¿ ¿ ¿
¿ ¿
¿
Graph of 95% probability contour

25
Equation for Contour

′ −1
( x−μ ) ∑ ( x−μ )≤5.99
( 9 16 ¿ ) ¿ ¿¿
((x1−5),( x2−10)) ¿
¿−1
( x−μ )′ ∑ ( x−μ ) is a quadratic form, which is equation for a polynomial
Points inside or outside
Are the following points inside or outside the 95% probability contour?
Is the point (10, 20) inside or outside the 95% probability contour?
( 10 ,20 ) →0. 2 ( 10−5 )2 +0 .028 ( 20−10 )2 −0 .1 ( 10−5 )( 20−10 )
=0. 2 ( 25 ) +0 . 028 (100 )−0 . 1 ( 50 )
=2 .8
Is the point (16, 20) inside or outside the 95% probability contour?
( 16 , 20 ) →0. 2 ( 16−5 )2 +0. 028 (20−10 )2 −0 . 1 ( 16−5 ) ( 20−10 )
=0 .2 ( 121 ) +0 . 028 ( 100 ) −0 .1 ( 11 ) ( 10 )
=16
Graph of inside or outside the 95% probability contour

26
Example: The general form of contours for a bivariate normal probability distribution where
the variables have equal variance ( σ 11 =σ 22 ) is relative easy to derive:
First we need the eigenvalues of ∑¿ ¿
|∑ − λI |=0
0= ¿|σ 11 − λ σ 12 ¿|¿ ¿ ¿ ¿
¿
¿
Consequently, the eigenvalues are λ1 =σ 11 +σ 12 and λ2 =σ 11−σ 12 .
Next we need the eigenvectors of ∑¿ ¿
∑ e i =λi e i
|σ 11 σ 12 ¿|¿ ¿ ¿
¿
σ 11 e 1 + σ 12 e 2 =( σ 11 + σ 12 ) e1
σ 12 e1 + σ 11 e 2 =( σ 11 + σ 12 ) e2

Which implies e 1=e2 , and after normalization, the first eigenvector is e '1= [ 1/ √ 2 ,1 / √ 2 ]
'
and Similarly, λ2 =σ 11−σ 12 yields the eigenvectors e 2= [ 1 / √ 2 ,−1/ √ 2 ]
When the covariance ( σ 12 ) or correlation ( ρ12 ) is positive, λ1 =σ 11 +σ 12 is the
'
largest eigenvalues, and its associated eigenvectors e 1= [ 1/ √ 2 ,1 / √ 2 ] lies along the 45 0
'
line through the point μ =[ μ1 , μ2 ] . This is true for any positive value of the covariance

(correlation). Since the axes of the constant density ellipses are given by ±c √ λ 1 e1 and
±c √ λ 2 e 2 , and the eigenvectors each have length unity, the major axis will be associated

27
with the largest eigenvalues. For positively correlated normal random variables, then, the
0
major axis of the constant density ellipses will be along the 45 line through μ .

To summarize, the axes of the ellipses of constant density for a bivariate normal distribution
with σ 11 =σ 22 are determined by

±c √ σ 11 +σ 12 ¿ [ 1 / √ 2 ¿ ] ¿ ¿
¿
Properties of the Multivariate Normal Distribution
For any multivariate normal random vector X
1. The density
1
f ( x )= p
¿¿
( √2 π ) ¿ ¿
has maximum values at

[μ=¿ μ1¿][μ2¿][⋮¿]¿¿
2. The density
¿
1
f ( x )= p
¿¿
( √2 π ) ¿ ¿
is symmetric along its constant density contours and is centered at m, i.e., the mean is
equal to the median!
3. If
X ~ N ¿¿
p , then the linear combinations of the components of X are
(multivariate) normally distributed
4. If
X ~ N p ¿ ¿ , then all subsets of the components of X have a (multivariate) normal.

5. If
X ~ N p ¿ ¿ , then zero covariance implies that the corresponding components of
X are independently distributed.

28
6. If
X ~ N p ¿ ¿ , then conditional distributions of the components of X are
(multivariate) normal
Some Important Results Regarding the Multivariate Normal Distribution

1. If
X ~ N p ¿ ¿ , then any linear combination of variables
'
a X=a1 X 1 + a2 X 2 +…+a p X p is distributed as N ( a' μ ,a ' ∑ a ) . Also, if

' ' '


a X ~ N ( a μ , a ∑ a ) for every a, then X ~ N p¿¿ .

2. If X ~ N p (μ , ∑ ) , then any set of q linear combinations

p p

[ ][ ]
AX=¿ ∑ a1i Xi ¿ ∑ a2i Xi ¿ [⋮¿] ¿ ¿¿
i= 1 i=1
¿
Furthermore, if d is a vector of constants, then
X +d ~ N p ¿ ¿

3. If X ~ N p (μ , ∑ ) , then all subsets of X are (multivariate) normally distributed, i.e., for


any partition

X 1 ¿ ( ¿ __ ¿ ) ¿
X ( p×1 )=¿ ( q×1 ) ( ¿¿ )
¿
X 1 ~ N p ( μ1 , ∑ 11 ) X 2 ~ N p−q (μ 2 , ∑ 22 )
then ,
X 1 ~ N q 1 ( μ1 , ∑11 ) X 2 ~ N q 2 (μ 2 , ∑22 )
4. If and are independent, then
Cov ( X 1 , X 2 ) =∑12 ¿ 0
and if

( X 1 ¿ ) ¿¿ ¿
¿ , then X 1 and X 2 are independent if and only if
∑12 ¿ 0
X 1 ~ N q 1 ( μ1 , ∑11 ) X 2 ~ N q 2 (μ 2 , ∑22 )
And if and are independent, then

29
( X 1 ¿ ) ¿¿ ¿
¿
−1
5. If X ~ N p (μ , ∑ ) and ¿¿ ( )′ ( ) 2
, then x−μ ∑ x−μ ~ χ p and the N p (μ , ∑ )
distribution assigns probability 1−α to the solid ellipsoid
−1
{ x : ( x−μ ) ∑ ′
( x −μ )≤ χ 2p ( α ) }
3.2. Sampling from multivariate normal distribution
Let the observation vectors
X 1 , X 2 ,…, X n denote a sample (independent) from p-variate
normal distribution with mean vector μ and covariance matrix ∑ ¿¿ , and then the joint
density function of the X is
n
f ( x 1 , x 2 ,… x n )=∏ ¿ ¿ ¿
i =1
1
= np
¿¿
(√2 π ) ¿¿
Trace
Let A be a k×k symmetric matrix and x be a k×1 vector. Then
a. x Ax=tr ( x ' Ax)=tr ( Ax { x¿¿' )
'

n
tr( A )=∑ λi
b. i=1 , where i are the eigenvalues of A.
λ
Now the exponent in the joint density can be simplified as
( xi −μ ) ′ ∑ ( xi −μ ) =tr [( xi −μ ) ′ ∑ ( xi− μ ) ]=tr [∑ ( xi −μ )( xi −μ )′ ]
−1 −1 −1

Next
n
n
−1 −1
∑ ( xi −μ )′ ∑ [
( x i−μ ) = ∑ tr ( Xi −μ )′ ∑
i=1
( X i −μ ) ]
i=1
n
−1
= ∑ tr
i=1
[∑ ( X i −μ ) ( X i −μ )′ ]
n
=tr
[ (∑
−1
∑ ( X i− μ )( X i −μ ) ′
i=1 )]
Since the trace of a sum of matrices is equal to the sum of the traces of the matrices.
n
n
x̄=∑ x i / n ∑ ( X −μ ) ( X −μ) ′

We can add and subtract i =1 in each term ( x i−μ ) in i=1


i i
to give

30
n

∑ ( X −x̄ + x̄−μ ) ( X −x̄ + x̄ −μ )


i=1
i i

= ∑ ( xi − x̄ )( x i − x̄ )′ + ∑ ( x̄ −μ ) ( x̄−μ )
n n

i=1 i=1

= ∑ ( xi − x̄ )( x i − x̄ )′ +n
n
( x̄−μ ) ( x̄−μ )′
i=1
n n
∑ ( x i − x̄ ) ( x̄ −μ )′ ∑ ( x̄−μ ) ( x i − x̄ ) ′
The cross products terms i =1 and i =1 are both matrices of
zeros. We can write the joint density of a random sample from a multivariate normal
population as
1
f ( x 1 , x 2 , … x n )= np / 2
¿¿
(2 π ) ¿¿
3.3. Maximum likelihood estimation
When the numerical values of the observations become available, they may be substituted for
the
x i in the joint density. The resulting expression, now considered as a function of μ and
∑ ¿¿ for the fixed set of observations x 1 , x 2 ,…, x n , is called the likelihood. One meaning of
best is to select the parameter values that maximize the joint density evaluated at the
observations. This technique is called maximum likelihood estimation, and the maximizing
parameter values are called maximum likelihood estimates.

Maximum likelihood estimation of μ and ∑ ¿¿


Let
X 1 , X 2 ,…, X n be a random sample from a normal population with mean μ and
covariance ∑ ¿¿ . Then
n
^ 1 n−1
∑ ¿= ∑ ( X i− X̄ ) ( X i− X̄ )′ = S¿
μ^ = X̄ and n i=1 n are the maximum likelihood
estimators of μ and ∑ ¿¿ respectively. Their observed value x̄ and
n
1
∑ ( x − x̄ ) ( x i− x̄ )′
n i=1 i are the maximum likelihood estimates of μ and ∑ ¿¿ .

We can derive the mean vector μ and covariance matrix ∑ ¿¿ as follows:

L¿ ¿

31
n
1
=∏ p
¿ ¿¿
i=1 ( √ 2 π ) ¿ ¿
1
= np
¿¿
(√ 2 π ) ¿ ¿
The Likelihood function is:

L¿ ¿
and the Log-likelihood function is:

l ¿ ¿
To find the Maximum Likelihood estimators of μ and ∑ ¿¿ , we need to find μ^
and ∑^ ¿¿ to maximize

L¿ ¿
or equivalently maximize

l ¿ ¿
Note:
n n n ′
′ −1
∑ ( X i−μ ) ∑ ( X i−μ )=∑ X i ∑
i=1 i=1
' −1
X i −2
( )
∑ Xi ∑
i=1
−1
μ+n μ' ∑ μ
−1

Thus,
dl ¿ ¿ ¿ ¿
n n ′
=
−1 d
2 dμ ( ∑ Xi∑
i=1
' −1
X i −2
( ) ∑ Xi ∑
i=1
−1 −1
μ +n μ' ∑ μ =0
)
n n
¿∑
−1
(∑ )
i=1
X i −n ∑ μ=0⇒
−1
(∑ ) i=1
X i =nμ

n
1
Hence,
μ^ =
n ( )
∑ X i = X̄
i =1

Now,

32
l¿¿¿
Now,
¿
l ¿ ¿
dl ¿ ¿ ¿ ¿ n
n −1 1 −1 −1
=− ∑ + ∑ ∑ ( X i −μ ) ( X i−μ )′ ∑ ¿ 0
2 2 i=1

n n
^ 1 ′ 1 ′ n−1
∑ n ∑( i )( i ) n ∑ ( i )( i ) =n S
¿= X μ
−^ X μ
−^ ¿= X − X̄ X − X̄
i=1 i= 1
Where, S is sample covariance matrix
In general, the Maximum Likelihood estimators of μ and ∑ ¿¿ are
n
1
μ^ =
n ( )
∑ X i = X̄
i =1

and
n
∑ ¿= 1n ∑ ( X i− X̄ ) ( X i− X̄ )′= n−1
^
n
S¿
i=1
Sufficient Statistics
A sufficient statistic is one that, from a certain perspective, contains all the necessary
information for making inferences about the unknown parameters in a given model. By
making inferences, we mean the usual conclusions about parameters such as estimators,

significance tests and confidence intervals. Let


X 1 , X 2 ,…, X n be a random sample from a

multivariate normal population with mean μ and covariance ∑ ¿¿ . Then


n
1
S= ∑ ( X i− X̄ )( X i − X̄ ) ′
X̄ and n−1 i=1 are sufficient statistics

33
The importance of sufficient statistics for normal populations is that all of the
information about μ and ∑ ¿¿ in the data matrix X is contained in X̄ and S,
regardless of the sample size n.
This generally is not true for non-normal populations.
Since many multivariate techniques begin with sample means and covariances, it is
prudent to check on the adequacy of the multivariate normal assumption.
If the data cannot be regarded as multivariate normal, techniques that depend solely on
X̄ and S may be ignoring other useful sample information.

3.4. Sampling distribution of X̄ and S


The univariate case
Let
x 1 , x 2 ,…, x n be a random sample size n from univariate normal distribution with mean
2
μ and variance σ . Then
2
 The distribution of x̄ as N ( μ , σ /n )
2 2
 ( n−1 ) s /σ had a chi-square distribution with n − 1 degrees of freedom
 x̄ and S are independent random variables.
The multivariate case
Let 1 2
X , X ,…, X
n be a random sample size n from a p-variate normal distribution with

mean vector μ and covariance matrix ∑ ¿¿ . Then

The distribution of X̄ as N ¿¿
n
( n−1 ) S=∑ ( X i − X̄ )( X i − X̄ )′
The matrix i=1 is distributed as a Wishart random
matrix, p W ¿¿ with n − 1 degrees of freedom.
X̄ and S are independent.
Wishart Distribution:
The sampling distribution of the sample covariance matrix is called the Wishart distribution,
after it’s discover, it is defined as the sum of independent products of multivariate normal
random vectors. Specifically
W m ¿ ¿ = Wishart distribution with m degree of freedom
m
∑ Z i Z 'i
= distribution of i=1

Where,
Z i are each independently distributed as N p (0, ∑ )
Wishart distribution is the multivariate analogue to a chi-square distribution.

34
Properties of the Wishart Distribution
A 1 is distributed as Wm ¿¿
1. If 1 independently of A 2 , which is distributed as
Wm ¿¿ Wm + m2 ¿ ¿
2 , then A 1 + A 2 is distributed as 1 . That is, the
degree of freedom adds.
2. If A is distributed as
Wm ¿¿ , then
'
CA { C ¿ is distributed as
' '
W m ( CA { C ¿¿C ∑ C )

Assignment 1
1. Find the major and minor axes for a bivariate normal probability distribution where

μ ¿
μ=¿ ( 1 ¿) ¿¿
the variables have equal variance ( σ 11 =σ 22 ). Where ¿ and

∑ ¿¿ [ σ 11 σ¿12 ¿] ¿ ¿
2. Let
x 1 , x 2 ,…, x n be a random sample size n from univariate normal distribution with
2
mean μ and variance σ has the probability density function
1 2
−[ ( x −μ ) /σ ] /2
f ( x )= e −¿< x <¿
√ 2 πσ 2
n
n ∑ ( x i−μ )2
x̄=∑ x i / n=^μ σ^ 2 = i=1
Then, show that i =1 and n `and also give your
comment
3. Let
X 1 , X 2 ,…, X n be a random sample size n from a p-variate normal distribution
with mean μ and covariance matrix ∑ ¿¿ has the following joint density function
1
f ( x 1 , x 2 ,… x n )= np
¿¿
(√ 2 π ) ¿ ¿
n n
1
μ^ =
n (∑ ) X i = X̄ ∑ ¿= 1n ∑ ( X i− X̄ ) ( X i− X̄ )′= n−1
^
n
S¿
Then show that i =1 and i=1

35
4. Inference about a Multivariate Mean Vector
4.1. Inference about a mean vector
A large part of any analysis is concerned with inference. That is, reaching valid conclusions
concerning a population on the basis of information from a sample.
At this point, we shall concentrate on inferences about a population mean vector and its
component parts. One of the central messages of multivariate analysis is that p correlated
variables must be analyzed jointly.
4.2. Hypothesis testing
A hypothesis is a conjecture about the value of a parameter, in this section a population mean
or means. Hypothesis testing assists in making a decision under uncertainty.
4.2.1. Univariate case
We are interested in the mean of a population and we have a random sample of n observations from
the population,
X 1 , X 2 ,…, X n
Where, (i.e., Assumptions):

 Observations are independent (i.e.,


X i is independent from X i′ for i≠i ' ).
E ( X i )=μ
 Observations are from the same population; that is, for all i
2
If the sample size is “small”, we’ll also assume that X i ~ N μ , σ
 ( )
Hypothesis:
H 0 : μ=μ 0 and H 1 :μ≠μ0
μ
Where, 0 is some specified value. In this case, H 1 is two–sided alternative hypothesis.
Test Statistic:

If
X 1 , X 2 ,…, X n be a random sample from a normal population, the appropriate test statistic
is
X̄−μ0
t=
s/ √ n
n n
1 1
Where,
X̄ =
n

i=1
( ) Xi
and
s = 2

n−1 i=1
( X i − X̄ )2
.
Sampling Distribution: If Ho and assumptions are true, then the sampling distribution of t is
Student’s - t distribution with n-1 degrees of freedom.

36
Decision: Reject Ho when t is “large” (i.e., small p–value). Or we reject the null hypothesis at

level α when
|t|>t α /2 ( n−1 ) . If we fail to reject H , then we conclude that μ0 is close to
0

X̄ .
Confidence Interval: A region or range of plausible µ’s (given observations/data). The set of
all µ’s such that
x̄−μ0
| |≤t α /2 ( n−1 )
s/ √ n
Where, t α /2 ( n−1 ) is the upper ( α /2 ) 100% percentile of Student’s t-distribution with n-1
degrees of freedom.
s s
{μ 0 such that X̄ −t α /2 ( n−1 )
th
√n
≤μ 0 ≤ X̄ +t α /2 ( n−1 )
√n }
A 100 (1 − α) confidence interval or region for μ is
s s
X̄ −t α /2 ( n−1 ) ≤μ 0≤ X̄ +t α /2 ( n−1 )
√n √n
Remark: Before for sample is selected, the ends of the interval depend on random variables
X̄ and s ; this is a random interval. 100(1−α )th percent of the time such intervals with
contain the “true” mean μ.

Example 4.1: Suppose we had the following fifteen sample observations on some random
variable X1:
5.76 6.68 6.79 7.88 2.46 2.48 2.97 4.47
1.62 1.43 7.46 8.92 6.61 4.03 9.42
At a significance level of α=0 .10 , do these data support the assertion that they were drawn
from a population with a mean of 4.0?
In other words, test the null hypothesis 0 1 0 H : μ =μ =4 .0 and H : μ ≠μ =4 .0
1 1 0
Solution
Let’s use the five steps of hypothesis testing to assess the potential validity of this conjecture:
1. State the null and alternative hypotheses
H 0 : μ1 =4 .0 vs H 1 : μ1 ≠4 . 0
2. State the desired level of significance α=0 .10
3. Select the appropriate test statistic: n = 15 < 30 but the data appear normal, so use
t- distribution and calculate the test statistic
We have
2
X̄ 1 =5 .26 , μ0 =4 . 0 , s 1=s 11=7 . 12→s 1=2 . 669=√ s11
So

37
X̄ 1 −μ 0 X̄ 1 −μ0 5. 26−4 . 0
t= = = =1. 84
s x̄ s/√n 2 . 669/ √ 15
4. Find the critical value(s) and state the decision rule
Critical Value: We have a two-tailed test t α /2 ( n−1 ) =±1.761
Decision rule: Do not reject H0 if –1.761 £t £ 1.761, otherwise reject H0
Since the observed (calculated) value of t is 1.84 which falls in the rejection region
and thus we reject Ho at 10% level.
5. Conclusion: At α=0 .10 , the sample evidence does not support the claim that the
mean of X1 is 4.0.

Example 4.2: Suppose we had the following fifteen sample observations on some random
variable X2:
-3.97 -3.24 -3.56 -1.87 -1.13 -5.20 -6.39 -7.88
-5.00 -0.69 1.61 -6.60 2.32 2.87 -7.64
At a significance level of a = 0.10, do these data support the assertion that they were drawn
from a population with a mean of -1.5?
In other words, test the null hypothesis 0 2 H : μ =μ =−1 .5 and H : μ ≠μ =−1 .5
0 1 2 0
Solution:
Let’s use the five steps of hypothesis testing to assess the potential validity of this conjecture:
1. State the null and alternative hypotheses
H 0 : μ2 =−1 .5 vs H 1 : μ2 ≠−1. 5
2. State the desired level of significance α=0 .10
3. Select the appropriate test statistic: n = 15 < 30 but the data appear normal, so use
t- distribution and calculate the test statistic
We have
X̄ 2 =−3 . 09 , μ 0=−1. 5 , s 22=s 22=12. 43 →s2 =3 . 526= √ s22
So
X̄ 2 −μ 0 X̄ 2 −μ0 −3. 09−(−1. 5 )
t= = = =−1. 748
s x̄ s /√n 3 .526 / √15
4. Find the critical value(s) and state the decision rule
Critical Value: We have a two-tailed test t α /2 ( n−1 ) =t 0 . 05 ( 14 )=±1 .761
Decision rule: Do not reject H0 if –1.761 £t £ 1.761, otherwise reject H0
Since the observed (calculated) value of t is –1.748 which is not fall in the rejection
region and thus we do not reject Ho at 10% level. That is
–1.761 £ t £ 1.761, i.e., –1.761 £ -1.748 £ 1.761

38
5. Conclusion: At α=0 .10 , the sample evidence does not refute the claim that the mean
of X2 is -1.5.
Square the test statistic t:
2
2 ( X̄ −μ0 ) −1
t = 2
=n ( X̄−μ 0 ) ( s2 ) ( X̄−μ0 )
s /n
2
So, t is a squared statistical distance between the sample mean x̄ and the
hypothesized value
μ0 .
2
Remember that t df =F 1, df ?
That is, the sampling distribution of
2
2 ( X̄ −μ0 ) −1
t = 2
=n ( X̄−μ 0 ) ( s2 ) ( X̄−μ0 ) ~ F ( 1 ,n−1 )
s /n

5.3. Multivariate Case: Hotelling’s test statistic


(Inference about mean vector)
A natural generalization of the squared univariate distance t is the multivariate analog
Hotelling’s T2:
−1
1
T 2 =( X̄−μ 0 )′ ( )
n
S ′
( X̄ −μ 0 ) =n ( X̄−μ0 ) S−1 ( X̄ −μ0 )
Where,
n
∑ Xi
X̄ = i =1 =¿ [ x̄ 1 ¿ ] [ x̄ 2 ¿ ] [⋮¿ ] ¿ ¿ ¿
n ¿
μ0 =( μ10 ,…, μ p 0 )′
This gives us a framework for testing hypotheses about a mean vector, where the null and
alternative hypotheses are
H 0 : μ=μ 0 Vs H 1 : μ≠μ0
2 2
T is “Hotelling’s T ”
2
The sample distribution of T
( n−1 ) p
T2 ~ F (α )
n− p p , (n− p )

39
We can use this to test
H 0 : μ=μ 0 assuming that observations are a random sample from
N p¿¿ .
2
We can compute T and compare it to
( n−1 ) p
T2 ~ F (α )
n− p p , (n− p )
Or use the fact that
n− p 2
T ~ F p ,( n− p)( α)
( n−1 ) p
2
Compute T as

T 2 =n ( X̄−μ 0 ) S−1 ( X̄−μ0 )
n− p 2
and the p-value=
{
prob F p ,( n− p) (α )≥
( n−1 ) p
T
}
2
Reject Ho when p-value is small (i.e., when T is large)

Recall that for the univariate case

2 ( X̄ −μ0 )2 −1
t = 2
=n ( X̄−μ 0 ) ( s2 ) ( X̄−μ0 )
s /n
1
Since,
X̄ ~ N μ , σ 2
n ( ) ,
√ n ( X̄ −μ 0 ) ~ N ( √n ( μ−μ0 ) , σ 2 )
This is a linear function of X̄ , which is a random variable.
We also know that
n

n ∑ ( X i − X̄ ) 2 n
i=1
( n−1 ) s2 =∑ ( X i− X̄ )2 ~ σ 2 χ 2( n−1 ) =∑ Zi2 ~ χ 2(n−1 )
i =1 because σ2 i =1
n
∑ ( X i − X̄ ) 2
Chi−square random var iable
S 2 = i=1 =
So, n−1 deg rees of freedom
Putting this all together, we find

2
t =¿ ( univariate¿ )( normal¿ )( random¿ ) ¿ ¿¿
¿
Now we will go through the same thing but with the multivariate case

40

T 2 =√ n ( X̄−μ0 ) S−1 √ n ( X̄ −μ0 )
1
X̄ ~ N p (μ ,
n ∑ and
) √ n ( X̄ −μ0 )
Since is a linear combination of X̄ ,
√ n ( X̄ − μ0 ) ~ N p ¿ ¿
Also,
n n
∑ ( X i− X̄ )( X i − X̄ ) ′ ∑ Z i Z 'i
S= i =1
n−1
= i =1
n−1
= ( Wishart random n−1
matrix with df =n−1
)
Where, Z i ~ N p (0 , ∑ ) if H0 is true
Recall that a Wishart distribution is a matrix generalization of the chi-square distribution.
The sampling distribution of ( n−1 ) S is Wishart where

Wm ¿¿
So,

T =¿ ( Multivariate¿)( normal¿)( random¿) ¿ ¿¿


2
¿
Example 3: Suppose we had the following fifteen sample observations on some random
variables X1 and X2:

x j 1 1.43 1.62 2.46 2.48 2.97 4.03 4.47 5.76 6.61 6.68 6.79 7.46 7.88 8.88 8.92

x j 2 -0.69 -5.0 -1.13 -5.2 -6.39 2.87 -7.88 -3.56 2.32 -3.24 -3.56 1.61 -1.87 -6.6 -7.64

At a significance level of α=0 .10 , do these data support the assertion that they were drawn
from a population with a centroid (4.0, -1.5)?
In other words, test the null hypothesis

H 0 : μ=¿ [ μ 1 ¿ ] ¿ ¿ ¿
¿
Let’s go through the five steps of hypothesis testing to assess the potential validity of our
assertion.
1. State the null and alternative hypothesis

H 0 : μ= ¿ [ μ 1 ¿ ] ¿ ¿ ¿
¿
2. State the level of significance α=0 .10

41
3. Select the appropriate test statistic: n – p = 15 – 2 = 13 is not very large, but the data
appear relatively bivariate normal, so use

T 2 =n ( X̄−μ 0 ) S−1 ( X̄−μ0 )
We have

X̄ =¿ [ 5 . 26 ¿ ] ¿ ¿ ¿
¿
Calculate test statistics

T 2=n ( X̄−μ0)′ S−1 ( X̄−μ0 )


[5 .26−4.0¿] ¿ ¿¿
¿15 ¿
¿
4. Find the Critical Value(s) and State the Decision Rule
Critical value:
Since, α=0 .10 , p=2, n-p=15-2=13 degrees of freedom, we have
F p, (n− p) (α )=F 2, 13 ( 0. 10 )=2 . 76
( n−1 ) p ( 15−2 ) 2
F p, ( n−p ) ( α )= ∗F 2, 13 ( 0. 10 )=5 . 95
Thus, our critical value is n− p 15−2
Decision rule: Do not reject H0 if T2 £ 5.95, otherwise reject H0
2
Since the observed value of T is 5.970 which falls in the rejection region and thus
we reject Ho at 10% level. That is T2 = 5.970 ≥ 5.95.

5. Conclusion:
At α=0 .10 , the sample evidence supports the claim that the mean vector differs
from

μ0=¿ [ 4.0¿] ¿ ¿¿
¿
.
Likelihood Ratio Test and Hotelling's T2
Compare the maximum value of the multivariate normal likelihood function under no
restrictions against the restricted maximized value with the mean vector held at
μ0 . The
μ
hypothesized value 0 will be plausible if it produces a likelihood value almost as large as
the unrestricted maximum.
To test
H 0 : μ=μ 0 against H 1 : μ≠μ 0 we construct the ratio:

42
Likelihood ratio =Λ=max¿¿¿ ¿¿¿
Where, the numerator in the ratio is the likelihood at the MLE of ∑ ¿¿
given that
μ=μ 0 and
the denominator is the likelihood at the unrestricted MLEs for both μ, ∑ ¿ ¿ .
Since
n n
∑^ ¿ 0=n−1 ∑ ( x i −μ 0)( xi −μ0 )′ ¿ −1
μ^ =n ∑ x i= x̄
i=1 and i=1
n
∑^ ¿=n−1 ∑ ( xi − x̄ )( x i− x̄ )′ ¿
i=1
, then under the assumption of multivariate normality

Λ=¿ ¿ ¿¿
Derivation of Likelihood Ratio Test

¿¿¿
¿
43
¿¿¿
¿
Λ=¿ ¿ ¿¿
μ0 is a plausible value for μ if Λ is close to one.
Relationship between Λ and T2

2 /n
Λ =¿ ¿ ¿ ¿
For large T2, the likelihood ratio is small and both lead to rejection of H0.
From the previous equation,

2
T =¿¿
which provides another way to compute T2 that does not require inverting a covariance
matrix.
When
H : μ=μ 0 is true, the exact distribution of the likelihood ratio test statistic is
0
obtained from

2
Example:
T =¿ ¿
4.4. Confidence regions and simultaneous comparison of components
4.5. Large sample inference about the mean vector

44

You might also like