Projection
Projection
Projection
1 Outline
Fix a probability space and consider the set
There are natural notions of length and orthogonality for objects in H, which allow
us to work with random variables geometrically, as if they were vectors in Euclidean
space. Such a space is known as a Hilbert space.
Geometric reasoning leads to an insightful view of mean square error estimation
as a projection onto an appropriate space.
First, we will review linear algebra and explain what it means for H to be a
Hilbert space. Then, we will study projections in detail and solve the constrained
optimization problem of finding the closest point on a linear space to a given point.
Using the ideas of projection and orthogonality, we will derive the linear least squares
estimator (LLSE). We will then extend the ideas to the non-linear case, arriving at
the minimum mean square error (MMSE) estimator.
2 Vector Spaces
A (real) vector space V is a collection of objects, including a zero vector 0 ∈ V ,
equipped with two operations, vector addition (which allows us to add two vectors
u, v ∈ V to obtain another vector u + v ∈ V ) and scalar multiplication (which
allows us to “scale” a vector v ∈ V by a real number c to obtain a vector cv), satisfying
the following axioms: for all u, v, w ∈ V and all a, b ∈ R,
• vector addition is associative, commutative, and 0 is the identity element, that
is, u + (v + w) = (u + v) + w, u + v = v + u, and u + 0 = u;
1
• scalar multiplication is compatible with vector operations: 1v = v, a(bv) = (ab)v,
a(u + v) = au + av, and (a + b)u = au + bu.
Given a set S ⊆ V , the span of S is the set of vectors we can reach from vectors
in S using a finite number of vector addition and scalar multiplication operations:
2
2.1 Inner Product Spaces & Hilbert Spaces
We have promised to talk geometrically, but so far, the definition of an vector space
does not have any geometry built into it. For this, we need another definition.
For a real vector space V , a map h·, ·i : V × V → [0, ∞) satisfying, for all
u, v, w ∈ V and c ∈ R,
is called a (real) inner product on V . Then, V along with the map h·, ·i is called
a (real) inner product space. Note that combining symmetry and linearity gives
us linearity in the second argument too, hu, v + cwi = hu, vi + chu, wi.
The familiar inner product on Euclidean space Rn is hx, yi := ni=1 xi yi , also
P
sometimes called the dot product.
The first bit of geometry that the inner product gives us is a norm map
p
k·k : V → [0, ∞), given by kvk := hv, vi.
By analogy to Euclidean space, we can consider the norm to be the length of a vector.
The second bit of geometry is the notion of an angle θ between vectors u and v,
which we can define via the formula hu, vi = kukkvk cos θ. We are only interested in
the case when cos θ = 0, which tells us when u and v are orthogonal. Precisely, we
say that u and v are orthogonal if hu, vi = 0.
Now, it is your turn! Do the following exercise.
Exercise 1. Prove that hX, Y i := E[XY ] makes H into a real inner product space.
(Hint: You must first show that H is a real vector space, which requires H to be
closed under vector addition, i.e., if X, Y ∈ H, then X + Y ∈ H. For this, use
the Cauchy-Schwarz
p inequality, which says that for random variables X and Y ,
|E[XY ]| ≤ E[X ] E[Y 2 ].)
2
3
replacing the summation and f (x, y) dx dy standing in as the “probability” of the
point (x, y).
Finally, we are not quite at the definition of a Hilbert space yet. A (real) Hilbert
space is a real inner product space which satisfies an additional analytic property
called completeness, which we will not describe (for this, you will have to take a
course in functional analysis).
If Ω is finite, then H is finite-dimensional. Indeed, a basis is given by the indicators
{1ω }ω∈Ω . However, in general H is infinite-dimensional, and we will soon run into
analytical issues which obscure the core ideas. Therefore, from this point forward, we
will behave as if H were finite-dimensional, when in reality it is not.
3 Projection
Now that we know that H is a Hilbert space, we would like to apply our knowledge
to the problem of estimating a random variable Y ∈ H. Clearly, if we could directly
observe Y , then estimation would not be a difficult problem. However, we are often
not given direct access to Y , and we are only allowed to observe some other random
variable X which is correlated with Y . The problem is then to find the best estimator
of Y , if we restrict ourselves to using only functions of our observation X. Even still,
finding the best function of X might be computationally prohibitive, and it may be
desired to only use linear functions, i.e., functions of the form a + bX for a, b ∈ R.
Notice that “linear functions of the form a + bX” can also be written as span{1, X},
a subspace of V , so we may formulate our problem more generally as the following:
The answer will turn out to be the projection of y onto the subspace U . We will
explain this concept now.
Given a set S ⊆ V , we define the orthogonal complement of S, denoted S ⊥ :
That is, S ⊥ is the set of vectors which are orthogonal to everything in S. Check for
yourself that S ⊥ is a subspace.
Given a subspace U ⊆ V , what does it mean to “project” y onto U ? To get a feel
for the idea, imagine a slanted pole in broad daylight. One might say that the shadow
it casts on the ground is a “projection” of the streetlight onto the ground. From
this visualization, you might also realize that there are different types of projections,
depending on the location of the sun in the sky. The projection we are interested in is
4
the shadow cast when the sun is directly overhead, because this projection minimizes
the distance from the tip of the pole to the tip of the shadow; this is known as an
orthogonal projection.
Formally, the orthogonal projection onto a subspace U is the map P : V → U
such that P y := arg minx∈U ky − xk. In words, given an input y, P y is the closest
point in U to y. We claim that P y satisfies the following two conditions (see Figure 1):
Py ∈ U and y − P y ∈ U ⊥. (1)
y − Py
0
U
Py
Figure 1: y − P y is orthogonal to U .
(b) Suppose that U is finite-dimensional, n := dim U , with basis {vi }ni=1 . Suppose
that the basis is orthonormal, that is, the vectors P are pairwise orthogonal and
kvi k = 1 for each i = 1, . . . , n. Show that P y = ni=1 hy, vi ivi . (Note: If we
take U = Rn with the standard
Pn inner product, then P can be represented as a
matrix in the form P = i=1 vi viT .)
5
3.1 Gram-Schmidt Process
n
Pn basis {vi }i=1 for U . The
Let us see what (1) tells us in the case when we have a finite
condition P y ∈ U says that P y is a linear combination i=1 ci vi of the basis {vi }ni=1 .
The condition y − P y ∈ U ⊥ is equivalent to saying that y − P y is orthogonal to vi
for i = 1, . . . , n. These two conditions gives us a system of equations which we can,
in principle, solve:
D n
X E
y− ci vi , vj = 0, j = 1, . . . , n. (2)
i=1
However, what if the basis {vi }ni=1 is orthonormal, as in Exercise 2(b)? Then, the
computation of P y is simple: for each i = 1, . . . , n, hy, vi i gives the component of the
projection in the direction vi .
Fortunately, there is a simple procedure for taking a basis and converting it into
an orthonormal basis. It is known as the Gram-Schmidt process. The algorithm
is iterative: at step j ∈ {1, . . . , n}, we will have an orthonormal set of vectors {ui }ji=1
so that span{ui }ji=1 = span{vi }ji=1 . To start, take u1 := v1 /kv1 k. Now, at step
j ∈ {1, . . . , n − 1}, consider Pu1 ,...,uj vj+1 , where Pu1 ,...,uj is the orthogonal projection
onto span{u1 , . . . , uj }. Because of (1), we know that wj+1 := vj+1 − Pu1 ,...,uj vj+1
lies in (span{ui }ji=1 )⊥ , so that wj+1 is orthogonal to u1 , . . . , uj . Also, we know that
wj+1 6= 0 because if vj+1 = Pu1 ,...,uj vj+1 , then vj+1 ∈ span{u1 , . . . , uj }, but this is
impossible since we started with a basis. Thus, we may take uj+1 := wj+1 /kwj+1 k
and add it to our orthonormal set.
Computation of the Gram-Schmidt process is not too intensive because subtracting
the projection onto span{ui }ji=1 only requires computing the projection onto an
orthonormal basis. From Exercise 2(b), we can explicitly describe the procedure:
1. Let u1 := v1 /kv1 k.
2. For j = 1, . . . , n − 1:
Pj
(a) Set wj+1 := vj+1 − i=1 hvj+1 , ui iui .
(b) Set uj+1 := wj+1 /kwj+1 k.
6
Given X, Y ∈ H, minimize E[(Y − a − bX)2 ] over a, b ∈ R. The solution
to this problem is called the linear least squares estimator (LLSE).
Using our previous notation, E[(Y − a − bX)2 ] = kY − a − bXk2 and thus the solution
is arg minŶ ∈U kY − Ŷ k2 , where U is the subspace U = span{1, X}. We have already
solved this problem! If we apply (2) directly, then we obtain the equations:
E[Y − a − bX] = 0,
E[(Y − a − bX)X] = 0.
We can solve for a and b. Alternatively, we can apply the Gram-Schmidt process√ to
{1, X} (assuming X is not constant) to convert it to the basis {1, (X −E[X])/ var X}.
Now, applying Exercise 2(b),
h X − E[X] i X − E[X] cov(X, Y )
L[Y | X] := E[Y ] + E Y √ √ = E[Y ] + (X − E[X]).
var X var X var X
This is a nice result, so let us package it so:
Theorem 1 (LLSE). For X, Y ∈ H, where X is not a constant, the LLSE of Y
given X is
cov(X, Y )
L[Y | X] = E[Y ] + (X − E[X]).
var X
Furthermore, the squared error of the LLSE is
cov(X, Y )2
E[(Y − L[Y | X])2 ] = var Y − .
var X
Proof. Since we have already proven the formula for the LLSE, let us prove the
second assertion, geometrically. Notice that both sides of the equation for the squared
error are unaffected if we replace X and Y with X − E[X] and Y − E[Y ], so we may
assume that X and Y are zero mean. Now consider Figure 2. The angle at 0 is
hX, Y i
cos θ = .
kXkkY k
Thus, from geometry,
hX, Y i2
E[(Y − L[Y | X])2 ] = kY − L[Y | X]k2 = kY k2 (sin θ)2 = kY k2 1 −
kXk2 kY k2
cov(X, Y )2
= var Y − .
var X
7
Y
Y − L[Y | X]
0 X
L[Y | X]
Figure 2: The variance of the error of the LLSE can be found geometrically.
This observation is crucial for the design of online algorithms, because it tells us
how to recursively update our LLSE after collecting new observations. We will revisit
this topic when we discuss tracking and the Kalman filter.
8
onto span{1, X, . . . , X d }. The equations become more difficult to solve and require
knowledge of higher moments of X and Y (in fact, we must now have X d ∈ H, i.e.,
E[X 2d ] < ∞), but nonetheless it is possible.
The same is true as long as we are projecting onto a linearly independent set
of random variables. The same techniques can be used to compute the best linear
combination of 1, sin X, and cos X to estimate Y .
1
This section requires more advanced linear algebra concepts. Do not worry if you do not yet
have the required linear algebra background to understand all of the arguments. They are presented
here to offer a different perspective on the material; see also the treatment in Walrand’s textbook,
which is more direct.
9
where ΣY,X := E[Y X T ].
What if we wish to predict more than one observation, i.e., Y = (Y1 , . . . , Ym )
where m is a positive integer? Then, the LLSE of the vector Y is the vector whose
components are the estimates of the corresponding components of Y , so
where ΣY := E[Y Y T ] and ΣX,Y = ΣTY,X = E[XY T ]. In the first line, we use the fact
that Y − L[Y | X] is orthogonal to L[Y | X]. In the second line, we use
10
We are also given an n × 1 observation vector y. We may assume for simplicity
that the design matrix and observation vector have been centered, P i.e., subtract
n
appropriate
Pn matrices from X and y such that for each j ∈ {1, . . . , d}, i=1 Xi,j = 0,
and i=1 yi = 0. The problem is the following:
In other words, (X, Y ) represents a uniformly randomly chosen row of the design
matrix and observation vector. Now, for β ∈ Rd , observe:
n
X
ky − Xβk22 =n n−1 (yi − xTi β)2 = n E[(Y − X T β)2 ].
i=1
Hence, finding the weight vector β that minimizes the sum of squared residuals in
the non-Bayesian formulation is the same as finding the weight vector β such that
L[Y | X] = β T X. However, we already know that L[Y | X] = ΣY,X Σ−1 X X, so the
−1
solution is given by β = ΣX ΣX,Y . Moreover,
n
1X T 1
ΣX = E[XX ] = xi xTi = XXT ,
n i=1 n
n
1X T
T 1
ΣY,X = E[Y X ] = yi xi = y T X.
n i=1 n
11
5 Minimum Mean Square Estimation (MMSE)
We will now drop the restriction to linear functions of X or linear functions of
observations X1 , . . . , Xn , and we will instead find the best arbitrary function of X to
estimate Y .
Unlike the case of linear estimation, where we looked at the span of a finite number
of random variables, we are now looking at the projection onto the subspace of all
functions of X, which is quite difficult to visualize or even imagine. In fact, you might
wonder whether such a function φ is always guaranteed to exist. The answer is yes,
such a function exists and is essentially unique, although the details are technical.
The conditional expectation of Y given X is formally defined as the function
of X, denoted E(Y | X), such that for all bounded continuous functions φ,
E Y − E(Y | X) φ(X) = 0. (3)
12