Linear Algebra Review: CSC2515 - Machine Learning - Fall 2002
Linear Algebra Review: CSC2515 - Machine Learning - Fall 2002
Linear Algebra Review: CSC2515 - Machine Learning - Fall 2002
f (x + y) = f (x) + f (y)
(1)
for all scalars ; and all vectors x; y. In other words,
scaling the input scales the output and summing inputs
sums their outputs. Now here is the amazing thing. All
functions which are linear, in the sense defined above, can
be written in the form of a matrix F which left multiplies
the input argument x:
y = Fx
(2)
(3)
F(a + b) = Fa + Fb
(4)
(5)
Fa =
aF
(6)
CSC251502
AND
D ETERMINANTS
A=I
(7)
The two most important matrix equations are the system of linear equations:
Ax =
(8)
(9)
which betweeen them cover a large number of optimization and constraint satisfaction problems. As weve written them above, x is a vector but these equations also have
natural extensions to the case where there are many vectors simultaneously satisfying the equation: AX = B or
AX = X.
V. S YSTEMS
OF
L INEAR E QUATIONS
x + 4y + 2z
x+y+z
12
Typically, we express this system as a single matrix equation something like this: Ax = b, where A is an m by
n matrix, x is an n column vector and b is an m column
vector. The number of unknowns is n and the number of
equations or constraints is m. Here is another simple example:
2
x1
x2
1
5
(10)
How do we go about solving this system of equations? Well, if A is known, then we are trying to find an
x corresponding to the b on the right hand side. (Why?
Well, Finding b given A and x is pretty easyjust multiply. And for a single x there are usually a great many matrices A which satisfy the equation: one example assumP
ing the elements of x do not sum to zero is b1> = (x).
The only interesting question problem left, then, is to find
x.) This kind of equation is really a problem statement. It
says hey, we applied the function A and got the output b;
what was the input x ? The matrix A is dictated to us by
our problem, and represents our model of how the system
we are studying converts inputs to outputs. The vector b
is the output that we observe (or desire) we know it. The
vector x is the set of inputs it is what we are trying to
find.
Remember that there are two ways of thinking about
this kind of equation. One is rowwise as a set of m equations, or constraints that correspond geometrically to m
intersecting constraint surfaces:
Ax =
x x2
x1 + x2
2 1
LINEAR ALGEBRA
x1
+
x2
=
x = (A> A)
Ax = A
Ix = A
x=A
(11)
(12)
b
b
1
b
(13)
>b
(15)
[1
2x A
1x = 4
(16)
in which A = [1; 1. This equation constrains the difference between the two elements of x to be 4 but the sum
can be as large or small as we want. As you can read in
the appendix, this happens because the matrix A has a null
space and we can add any amount of any vector in the null
space to x without affecting Ax.
We can take things one step further to get around this
problem also. The answer is to ask for the minimum norm
vector x that still minimizes the above error. This breaks
the degeneracies in both the exact and inexact cases and
leaves us with solution vectors that have no projection into
the null space of A. In terms of our cost function, this
corresponds to adding an infinitesial penalty on x> x:
e = lim[x> A>Ax
!0
2x A
(17)
x = lim[(A> A + I)
!0
> b
(18)
Now, of course actually computing these solutions efficiently and in a numerically stable way is the topic of
much study in numerical methods. However, in MATLAB
you dont have to worry about any of this, you can just
type xx=AA n bb and let someone else worry about it.
CSC251502
xn and bn , all of which we want to satisfy the some equation Axn = bn . If we stack the vectors xn beside each
other as the columns of a large matrix X and do the same
for bn to form B, we can write the problem as a large
matrix equation:
AX = B
(19)
There are two things we could do here. If, as before, A
is known, we could find X given B. (Once again finding
B given X is trivial.) To do this we would just need to
apply the techniques above to solve the system Axn = bn
independently for each column n.
But there is something else we could do. If we were
given both X and B we could try to find a single A which
satisfied the equations. In essence we are fitting a linear
function give its inputs X and corresponding outputs B.
This problem is called linear regression. (Dont forget to
add a column of ones to X if you want to fit an affine
function, i.e.one with an offset.)
Once again, there are only very few cases in which there
exists an A which exactly satisfies the equations. (If there
is, X will be square and invertible.)
But we can set things up the same way as before and
ask for the least-squares A which minimizes:
e=
X
n
kAxn
bn k2
(20)
>
>
A = BX (XX )
(21)
e=
X
n
kbn
Axn
k2 + kAk2
(22)
>
A = BX (XX
> + I)
(23)
LINEAR ALGEBRA
then mapped into the column space. This means that all
of its null space components disappear and all of its row
space components remain. In other words, A cleans it up
by first removing any of its unfortunate attributes until
it looks just like one of the lucky vectors. Then A maps
this cleaned up version of average into the column space
in <m .
The number of linearly independent rows (or columns)
of A is called the rank (denoted r above) and it is the
dimension of the column space and also of the row space.
The rank is of course no bigger than the smaller dimension
of A. It is the dimension of the bottleneck through which
vectors processed by A must pass.
The column space (or range) of A is the space spanned
by its column vectors, or in other words, all the vectors
that could ever be created as linear combinations of its
columns. It is a subspace of the entire b space <m . So
when we form a product like Ax, no matter what we pick
for x we can only end up in a limited subspace of <m
called the column space. The row space is a similar thing,
except that it is the space spanned by the rows of A. It is
of the same dimension as the column space but not necessarily the same space as the column space. When we form
a product Ax, no matter what we pick for x only the part
of x that lives in row space determines what the answer
is, the part of x that lives outside the rowspace (the null
space component) is irrelevant because it gets projected
out by the matrix.
It is clear that the zero vector is in every column space
since we can combine any columns to get it by simply
setting the coefficient of every column to zero, namely
x = z. The smallest possible column space is produced
by the zero matrix: its column space consists of only the
zero vector. The largest possible column space is produced by a square matrix A with linearly independent
columns; its column space is all of <n (where n is the
size of A).
However, it may be possible to combine the columns
of a matrix using some nonzero coefficients and still have
them all cancel each other out to give zero; any such solutions for x are said to lie in the null space of the matrix A. That is, all solutions to Ax = z except x = z
form the null space. The null space is the part of the input space that is orthogonal to the rowspace. Intuitively,
any vectors that lie purely in the null space are killed
(projected out) by A since they map to the zero vector.
A completely complementary picture exists when we talk
about the space <m and the matrix AT . In particular, AT
has a row space (which is the column space of A) and a
column space (which is the row space of A) and also a
null space (which is curiously called the left null space of
Null
Space
(NS)
Left
Null
Space
(LNS)
dim
n-r
dim
m-r
Row (RS)
Space
dim r
Mixed
LNS+CS
Column (CS)
Space
dim r
All of R m
Mixed NS+RS
All of R n
CSC251502
LINEAR ALGEBRA
A).
Invertibility
We saw above that any matrix maps its row space invertibly into its column space. Some special matrices map
their entire input space invertibly into their entire output
space. These are known as invertible or full rank or nonsingular matrices. It is clear upon some reflection that
such matrices have no null space since if they did then
some non-zero input vectors would get mapped onto the
zero vector and it would be impossible to recover them
(making the mapping non-invertible). In other words, for
such matrices, the row space fills the whole input space.
Formally, we say that a matrix A is invertible if there
exists a matrix A 1 such that AA 1 = I. The matrix A 1
is called the inverse of A and is unique if it exists. The
most common case is square, full rank matrices, for which
the inverse can be found explicitly using many methods,
for example Gauss-Jordan.2 It is one of the astounding
facts of computational algebra that such methods run in
only O (n3 ) time which is the same as matrix multiplication.
R EFERENCES
[1] Strang, Linear Algebra and Applications