Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Projection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

The Hilbert Space of Random Variables

Electrical Engineering 126 (UC Berkeley)


Spring 2018

1 Outline
Fix a probability space and consider the set

H := {X : X is a real-valued random variable with E[X 2 ] < ∞}.

There are natural notions of length and orthogonality for objects in H, which allow
us to work with random variables geometrically, as if they were vectors in Euclidean
space. Such a space is known as a Hilbert space.
Geometric reasoning leads to an insightful view of mean square error estimation
as a projection onto an appropriate space.
First, we will review linear algebra and explain what it means for H to be a
Hilbert space. Then, we will study projections in detail and solve the constrained
optimization problem of finding the closest point on a linear space to a given point.
Using the ideas of projection and orthogonality, we will derive the linear least squares
estimator (LLSE). We will then extend the ideas to the non-linear case, arriving at
the minimum mean square error (MMSE) estimator.

2 Vector Spaces
A (real) vector space V is a collection of objects, including a zero vector 0 ∈ V ,
equipped with two operations, vector addition (which allows us to add two vectors
u, v ∈ V to obtain another vector u + v ∈ V ) and scalar multiplication (which
allows us to “scale” a vector v ∈ V by a real number c to obtain a vector cv), satisfying
the following axioms: for all u, v, w ∈ V and all a, b ∈ R,
• vector addition is associative, commutative, and 0 is the identity element, that
is, u + (v + w) = (u + v) + w, u + v = v + u, and u + 0 = u;

1
• scalar multiplication is compatible with vector operations: 1v = v, a(bv) = (ab)v,
a(u + v) = au + av, and (a + b)u = au + bu.

It is not important to memorize all of the axioms; however, it is important to recognize


that the axioms capture a lot of useful mathematical structures, including the space
of random variables, which is why linear algebra plays a key role in many disciplines.
To gain intuition for these axioms, the most natural example of a vector space
is Rn , for any positive integer n. The space Rn consists of n-tuples of real numbers.
Vector addition and scalar multiplication are defined componentwise:

(x1 , . . . , xn ) + (y1 , . . . , yn ) = (x1 + y1 , . . . , xn + yn ),


c(x1 , . . . , xn ) = (cx1 , . . . , cxn ).

Given a set S ⊆ V , the span of S is the set of vectors we can reach from vectors
in S using a finite number of vector addition and scalar multiplication operations:

span S := {c1 v1 + · · · + cm vm : m ∈ N, v1 , . . . , vm ∈ S, c1 , . . . , cm ∈ R}.

We say that an element of span S is a linear combination of the vectors in S. Also,


span S is itself a vector space. Whenever we have a subset U ⊆ V such that U is also
a vector space in its own right, then we call U a subspace of V .
Notice that if any vector v ∈ S can be written as a linear combination of the
other vectors in S, then we can safely remove v from S without decreasing the span
of the vectors, i.e., span S = span(S \ {v}). This is because any linear combination
using v can be rewritten as a linear combination using only vectors in S \ {v}. From
the perspective of figuring out which vectors lie in span S, v is redundant. Thus, we
say that S is linearly independent if it contains no redundant vectors, i.e., if no
vector in S can be written as a linear combination of the other vectors in S.
A set S of vectors which is both linearly independent and has span S = V is called
a basis of V . The significance of a basis S is that any element of V can be written as
a unique linear combination of elements of S. One of the most fundamental results in
linear algebra says that for any vector space V , a basis always exists, and moreover,
the cardinality of any basis is the same. The size of any basis of V is called the
dimension of V , denoted dim V .
Here is a fact: any finite-dimensional vector space is essentially identical to Rn ,
which means that Rn is truly a model vector space. However, in this note, we will
have need for infinite-dimensional vector spaces too.

2
2.1 Inner Product Spaces & Hilbert Spaces
We have promised to talk geometrically, but so far, the definition of an vector space
does not have any geometry built into it. For this, we need another definition.
For a real vector space V , a map h·, ·i : V × V → [0, ∞) satisfying, for all
u, v, w ∈ V and c ∈ R,

• (symmetry) hu, vi = hv, ui,

• (linearity) hu + cv, wi = hu, wi + chv, wi, and

• (positive definiteness) hu, ui > 0 if u 6= 0

is called a (real) inner product on V . Then, V along with the map h·, ·i is called
a (real) inner product space. Note that combining symmetry and linearity gives
us linearity in the second argument too, hu, v + cwi = hu, vi + chu, wi.
The familiar inner product on Euclidean space Rn is hx, yi := ni=1 xi yi , also
P
sometimes called the dot product.
The first bit of geometry that the inner product gives us is a norm map
p
k·k : V → [0, ∞), given by kvk := hv, vi.

By analogy to Euclidean space, we can consider the norm to be the length of a vector.
The second bit of geometry is the notion of an angle θ between vectors u and v,
which we can define via the formula hu, vi = kukkvk cos θ. We are only interested in
the case when cos θ = 0, which tells us when u and v are orthogonal. Precisely, we
say that u and v are orthogonal if hu, vi = 0.
Now, it is your turn! Do the following exercise.

Exercise 1. Prove that hX, Y i := E[XY ] makes H into a real inner product space.
(Hint: You must first show that H is a real vector space, which requires H to be
closed under vector addition, i.e., if X, Y ∈ H, then X + Y ∈ H. For this, use
the Cauchy-Schwarz
p inequality, which says that for random variables X and Y ,
|E[XY ]| ≤ E[X ] E[Y 2 ].)
2

To motivate the definition of the inner product given above,


P first consider the
case when the probability space Ω is finite. Then, E[XY ] = P ω∈Ω X(ω)Y (ω)P(ω),
which bears resemblance to the Euclidean inner product hx, yi = ni=1 xi yi . However,
E[XY ] is a sum in which we weight each sample point ω ∈ Ω by its probability, which
makes sense in a probabilistic
R∞ context. In the case when X and Y have joint density
f , then E[XY ] = −∞ xyf (x, y) dx dy, which is again similar but with an integral

3
replacing the summation and f (x, y) dx dy standing in as the “probability” of the
point (x, y).
Finally, we are not quite at the definition of a Hilbert space yet. A (real) Hilbert
space is a real inner product space which satisfies an additional analytic property
called completeness, which we will not describe (for this, you will have to take a
course in functional analysis).
If Ω is finite, then H is finite-dimensional. Indeed, a basis is given by the indicators
{1ω }ω∈Ω . However, in general H is infinite-dimensional, and we will soon run into
analytical issues which obscure the core ideas. Therefore, from this point forward, we
will behave as if H were finite-dimensional, when in reality it is not.

3 Projection
Now that we know that H is a Hilbert space, we would like to apply our knowledge
to the problem of estimating a random variable Y ∈ H. Clearly, if we could directly
observe Y , then estimation would not be a difficult problem. However, we are often
not given direct access to Y , and we are only allowed to observe some other random
variable X which is correlated with Y . The problem is then to find the best estimator
of Y , if we restrict ourselves to using only functions of our observation X. Even still,
finding the best function of X might be computationally prohibitive, and it may be
desired to only use linear functions, i.e., functions of the form a + bX for a, b ∈ R.
Notice that “linear functions of the form a + bX” can also be written as span{1, X},
a subspace of V , so we may formulate our problem more generally as the following:

Given y ∈ V and a subspace U ⊆ V , find the closest point x ∈ U to y.

The answer will turn out to be the projection of y onto the subspace U . We will
explain this concept now.
Given a set S ⊆ V , we define the orthogonal complement of S, denoted S ⊥ :

S ⊥ := {v ∈ V : hu, vi = 0 for all u ∈ S}.

That is, S ⊥ is the set of vectors which are orthogonal to everything in S. Check for
yourself that S ⊥ is a subspace.
Given a subspace U ⊆ V , what does it mean to “project” y onto U ? To get a feel
for the idea, imagine a slanted pole in broad daylight. One might say that the shadow
it casts on the ground is a “projection” of the streetlight onto the ground. From
this visualization, you might also realize that there are different types of projections,
depending on the location of the sun in the sky. The projection we are interested in is

4
the shadow cast when the sun is directly overhead, because this projection minimizes
the distance from the tip of the pole to the tip of the shadow; this is known as an
orthogonal projection.
Formally, the orthogonal projection onto a subspace U is the map P : V → U
such that P y := arg minx∈U ky − xk. In words, given an input y, P y is the closest
point in U to y. We claim that P y satisfies the following two conditions (see Figure 1):

Py ∈ U and y − P y ∈ U ⊥. (1)

Why? Suppose that (1) holds. Then, for any x ∈ U , since P y − x ∈ U ,

ky − xk2 = ky − P y + P y − xk2 = ky − P yk2 + 2hy − P y, P y − xi + kP y − xk2


= ky − P yk2 + kP y − xk2 ≥ ky − P yk2 ,

with equality if and only if x = P y, i.e., P y is the minimizer of ky − xk2 over x ∈ U .

y − Py
0
U
Py

Figure 1: y − P y is orthogonal to U .

We now invite you to further explore the properties of P .

Exercise 2. (a) A map T : V → V is called a linear transformation if for all


u, v ∈ V and all c ∈ R, T (u + cv) = T u + cT v. Prove that P is a linear
transformation. (Hint: Apply the same method of proof used above.)

(b) Suppose that U is finite-dimensional, n := dim U , with basis {vi }ni=1 . Suppose
that the basis is orthonormal, that is, the vectors P are pairwise orthogonal and
kvi k = 1 for each i = 1, . . . , n. Show that P y = ni=1 hy, vi ivi . (Note: If we
take U = Rn with the standard
Pn inner product, then P can be represented as a
matrix in the form P = i=1 vi viT .)

5
3.1 Gram-Schmidt Process
n
Pn basis {vi }i=1 for U . The
Let us see what (1) tells us in the case when we have a finite
condition P y ∈ U says that P y is a linear combination i=1 ci vi of the basis {vi }ni=1 .
The condition y − P y ∈ U ⊥ is equivalent to saying that y − P y is orthogonal to vi
for i = 1, . . . , n. These two conditions gives us a system of equations which we can,
in principle, solve:
D n
X E
y− ci vi , vj = 0, j = 1, . . . , n. (2)
i=1

However, what if the basis {vi }ni=1 is orthonormal, as in Exercise 2(b)? Then, the
computation of P y is simple: for each i = 1, . . . , n, hy, vi i gives the component of the
projection in the direction vi .
Fortunately, there is a simple procedure for taking a basis and converting it into
an orthonormal basis. It is known as the Gram-Schmidt process. The algorithm
is iterative: at step j ∈ {1, . . . , n}, we will have an orthonormal set of vectors {ui }ji=1
so that span{ui }ji=1 = span{vi }ji=1 . To start, take u1 := v1 /kv1 k. Now, at step
j ∈ {1, . . . , n − 1}, consider Pu1 ,...,uj vj+1 , where Pu1 ,...,uj is the orthogonal projection
onto span{u1 , . . . , uj }. Because of (1), we know that wj+1 := vj+1 − Pu1 ,...,uj vj+1
lies in (span{ui }ji=1 )⊥ , so that wj+1 is orthogonal to u1 , . . . , uj . Also, we know that
wj+1 6= 0 because if vj+1 = Pu1 ,...,uj vj+1 , then vj+1 ∈ span{u1 , . . . , uj }, but this is
impossible since we started with a basis. Thus, we may take uj+1 := wj+1 /kwj+1 k
and add it to our orthonormal set.
Computation of the Gram-Schmidt process is not too intensive because subtracting
the projection onto span{ui }ji=1 only requires computing the projection onto an
orthonormal basis. From Exercise 2(b), we can explicitly describe the procedure:

1. Let u1 := v1 /kv1 k.

2. For j = 1, . . . , n − 1:
Pj
(a) Set wj+1 := vj+1 − i=1 hvj+1 , ui iui .
(b) Set uj+1 := wj+1 /kwj+1 k.

4 Linear Least Squares Estimation (LLSE)


Finally, we are ready to solve the problem of linear least squares estimation. Formally,
the problem is:

6
Given X, Y ∈ H, minimize E[(Y − a − bX)2 ] over a, b ∈ R. The solution
to this problem is called the linear least squares estimator (LLSE).
Using our previous notation, E[(Y − a − bX)2 ] = kY − a − bXk2 and thus the solution
is arg minŶ ∈U kY − Ŷ k2 , where U is the subspace U = span{1, X}. We have already
solved this problem! If we apply (2) directly, then we obtain the equations:

E[Y − a − bX] = 0,
E[(Y − a − bX)X] = 0.

We can solve for a and b. Alternatively, we can apply the Gram-Schmidt process√ to
{1, X} (assuming X is not constant) to convert it to the basis {1, (X −E[X])/ var X}.
Now, applying Exercise 2(b),
h  X − E[X] i X − E[X] cov(X, Y )
L[Y | X] := E[Y ] + E Y √ √ = E[Y ] + (X − E[X]).
var X var X var X
This is a nice result, so let us package it so:
Theorem 1 (LLSE). For X, Y ∈ H, where X is not a constant, the LLSE of Y
given X is
cov(X, Y )
L[Y | X] = E[Y ] + (X − E[X]).
var X
Furthermore, the squared error of the LLSE is
cov(X, Y )2
E[(Y − L[Y | X])2 ] = var Y − .
var X
Proof. Since we have already proven the formula for the LLSE, let us prove the
second assertion, geometrically. Notice that both sides of the equation for the squared
error are unaffected if we replace X and Y with X − E[X] and Y − E[Y ], so we may
assume that X and Y are zero mean. Now consider Figure 2. The angle at 0 is
hX, Y i
cos θ = .
kXkkY k
Thus, from geometry,
 hX, Y i2 
E[(Y − L[Y | X])2 ] = kY − L[Y | X]k2 = kY k2 (sin θ)2 = kY k2 1 −
kXk2 kY k2
cov(X, Y )2
= var Y − .
var X
7
Y

Y − L[Y | X]

0 X
L[Y | X]

Figure 2: The variance of the error of the LLSE can be found geometrically.

4.1 Orthogonal Updates


Given X, Y, Z ∈ H, how do we calculate L[Y | X, Z]? Here is where the Gram-
Schmidt process truly shines. The key idea in the Gram-Schmidt algorithm is to
take a new basis vector and subtract the orthogonal projection of the vector onto the
previous basis vectors (the part where we normalize the vector is just icing on the
cake). The point is that we can compute the projection onto an orthogonal basis,
one component at a time (by Exercise 2(b)).
Now, the “basis vectors” are interpreted as new observations in the context
of estimation. The projection of Z onto X is L[Z | X], so the new orthogonal
observation is Z − L[Z | X]. This is called the innovation, because it represents the
new information that was not already predictable from the previous observations.
Now, since X and Z − L[Z | X] are orthogonal, we have the following:

Theorem 2 (Orthogonal LLSE Update). Let X, Y, Z ∈ H, where X and Z are not


constant. Then, L[Y | X, Z] = L[Y | X] + L[Y | Z̃], where Z̃ := Z − L[Z | X].

This observation is crucial for the design of online algorithms, because it tells us
how to recursively update our LLSE after collecting new observations. We will revisit
this topic when we discuss tracking and the Kalman filter.

4.2 Non-Linear Estimation


Since we have placed such an emphasis on the techniques of linear algebra, you
may perhaps think that quadratic estimation might be far harder, but this is not
true. If we are looking for the best a, b, c ∈ R to minimize E[(Y − a − bX − cX 2 )],
then this is again the projection of Y onto the subspace span{1, X, X 2 }, so the
methods we developed apply to this situation as well. In general, we can easily
handle polynomial regression of degree d, for any positive integer d, by projecting Y

8
onto span{1, X, . . . , X d }. The equations become more difficult to solve and require
knowledge of higher moments of X and Y (in fact, we must now have X d ∈ H, i.e.,
E[X 2d ] < ∞), but nonetheless it is possible.
The same is true as long as we are projecting onto a linearly independent set
of random variables. The same techniques can be used to compute the best linear
combination of 1, sin X, and cos X to estimate Y .

4.3 Vector Case


1

More generally, what if we have n observations, where n is a positive integer, and


we would like to calculate L[Y | X1 , . . . , Xn ]? We could try to apply the orthogonal
update method, but if we are not interested in an online algorithm, then we may as
well try to solve the problem once and for all if we can.
First, assume that all random variables are centered. Let ΣX denote the co-
variance matrix of (X1 , . . . , Xn ): ΣX := E[XX T ]. Since ΣX is symmetric, it has
a decomposition ΣX = U ΛU T by the spectral theorem, where U is an orthogonal
matrix of eigenvectors (U T U = I) and Λ is a diagonal matrix of real eigenvalues.
Furthermore, ΣX is positive semi-definite: for any v ∈ Rn ,
2 
v T ΣX v = E[v T (X − E[X])(X − E[X])T v] = E (X − E[X])T v

≥ 0.

Thus, ΣX has only non-negative eigenvalues; we will assume that ΣX is invertible so


that it has strictly positive eigenvalues (it
pis positive definite). Therefore, Λ has a real
1/2 1/2
square root Λ defined by (Λ )i,i := Λi,i for each i = 1, . . . , n. Now, notice that
the covariance of the random vector Z := Λ−1/2 U T X is Λ−1/2 U T (U ΛU T )U Λ−1/2 = I,
so the components of Z are orthonormal. Moreover, L[Y | X] = L[Y | Z] because
left multiplication by Λ−1/2 U T is an invertible linear map (with inverse y 7→ U Λ1/2 y).
After performing this transformation,
n
X n X
X n X
n
L[Y | Z] = hY, Zi iZi = (Λ−1/2 U T )i,j (Λ−1/2 U T )i,k hY, Xj iXk
i=1 i=1 j=1 k=1
Xn Xn
= (Σ−1 −1
X )j,k hY, Xj iXk = ΣY,X ΣX X,
j=1 k=1

1
This section requires more advanced linear algebra concepts. Do not worry if you do not yet
have the required linear algebra background to understand all of the arguments. They are presented
here to offer a different perspective on the material; see also the treatment in Walrand’s textbook,
which is more direct.

9
where ΣY,X := E[Y X T ].
What if we wish to predict more than one observation, i.e., Y = (Y1 , . . . , Ym )
where m is a positive integer? Then, the LLSE of the vector Y is the vector whose
components are the estimates of the corresponding components of Y , so

L[(Y1 , . . . , Ym ) | X] = (L[Y1 | X], . . . , L[Ym | X]);

the resulting formula is still L[Y | X] = ΣY,X Σ−1X X.


We can also calculate the squared error of the LLSE. To prepare for the derivation,
recall that for two matrices A and B of compatible dimensions (for positive integers
m and n, A is m × n and B is n × m), then tr(AB) = tr(BA). Also, since the trace
is a linear function of the entries of the matrix, by linearity of expectation, the trace
commutes with expectation.

E[kY − L[Y | X]k22 ] = E[(Y − ΣY,X Σ−1 T −1


X X) (Y − ΣY,X ΣX X)]
= E[Y T Y − (ΣY,X Σ−1 T −1
X X) Y ] = tr(ΣY − ΣY,X ΣX ΣX,Y ),

where ΣY := E[Y Y T ] and ΣX,Y = ΣTY,X = E[XY T ]. In the first line, we use the fact
that Y − L[Y | X] is orthogonal to L[Y | X]. In the second line, we use

E[Y T Y − (ΣY,X Σ−1 T T −1 T


 
X X) Y ] = E tr Y Y − (ΣY,X ΣX X) Y
= E[tr(Y Y T − Y X T Σ−1
X ΣX,Y )]
= tr E[Y Y T − Y X T Σ−1
X ΣX,Y ]
−1
= tr(ΣY − ΣY,X ΣX ΣX,Y ).

4.4 Non-Bayesian Perspective: Linear Regression


So far, the method of estimation that we have developed is a Bayesian method
because it assumes that we have knowledge of the distributions of X and Y . Here,
we will explain how one can study the non-Bayesian perspective of regression, which
can be formulated without any mention of probabilities at all, using the results we
have already obtained.
Let n and d be positive integers, where n represents the number of samples
collected and d represents the number of features. Typically the data is organized
into a n × d matrix X called the design matrix, where the entry in the ith row and
jth column (for i ∈ {1, . . . , n}, j ∈ {1, . . . , d}) is the value of the jth feature for the
ith data point. Thus, the rows of the design matrix contain the feature values for a
single data point, and we typically denote the rows by xTi , for i = 1, . . . , n.

10
We are also given an n × 1 observation vector y. We may assume for simplicity
that the design matrix and observation vector have been centered, P i.e., subtract
n
appropriate
Pn matrices from X and y such that for each j ∈ {1, . . . , d}, i=1 Xi,j = 0,
and i=1 yi = 0. The problem is the following:

Find the d × 1 weight vector β to minimize the sum of squares ky − Xβk22 .

In other words, we would like to estimate y using a linear combination of the


features, where β represents the weight we place on each feature.
To study the problem of regression from the Bayesian perspective, let X be a
d × 1 random vector and Y be a scalar random variable with joint distribution

(X, Y ) ∼ Uniform{(xi , yi )}ni=1 .

In other words, (X, Y ) represents a uniformly randomly chosen row of the design
matrix and observation vector. Now, for β ∈ Rd , observe:
n
X
ky − Xβk22 =n n−1 (yi − xTi β)2 = n E[(Y − X T β)2 ].
i=1

Hence, finding the weight vector β that minimizes the sum of squared residuals in
the non-Bayesian formulation is the same as finding the weight vector β such that
L[Y | X] = β T X. However, we already know that L[Y | X] = ΣY,X Σ−1 X X, so the
−1
solution is given by β = ΣX ΣX,Y . Moreover,
n
1X T 1
ΣX = E[XX ] = xi xTi = XXT ,
n i=1 n
n
1X T
T 1
ΣY,X = E[Y X ] = yi xi = y T X.
n i=1 n

Therefore, β = (XXT )−1 XT y and the optimal estimate is ŷ := Xβ = X(XXT )−1 XT y.


This is not the last word on regression models. If one assumes that for each
i.i.d.
i = 1, . . . , n, yi = xTi β + εi for a true parameter vector β and (εi )ni=1 ∼ N (0, σ 2 ),
then there is a well-developed theory that describes how to find the best estimator β̂,
the distribution and expected mean squared error of β̂, confidence intervals for β̂, etc.

11
5 Minimum Mean Square Estimation (MMSE)
We will now drop the restriction to linear functions of X or linear functions of
observations X1 , . . . , Xn , and we will instead find the best arbitrary function of X to
estimate Y .

Given X, Y ∈ H, find the best function φ to minimize E[(Y − φ(X))2 ].


The solution to this problem is called the minimum mean square error
(MMSE) estimator.

As before, we can write down the orthogonality condition:

Y − φ(X) should be orthogonal to all other functions of X.

Unlike the case of linear estimation, where we looked at the span of a finite number
of random variables, we are now looking at the projection onto the subspace of all
functions of X, which is quite difficult to visualize or even imagine. In fact, you might
wonder whether such a function φ is always guaranteed to exist. The answer is yes,
such a function exists and is essentially unique, although the details are technical.
The conditional expectation of Y given X is formally defined as the function
of X, denoted E(Y | X), such that for all bounded continuous functions φ,
  
E Y − E(Y | X) φ(X) = 0. (3)

It is the solution to our estimation problem.


Interestingly, although we started out by looking only at random variables in H,
the definition (3) does not require X and Y to be in H. Even if they have infinite
second moments, as long as they have a well-defined first moment we can still define
the conditional expectation. A full exploration of the definition, properties, and
applications of conditional expectation would take up another note by itself, so we
will stop here.

12

You might also like