The LMS Objective Function. Global Solution. Pseudoinverse of A Matrix. Optimization (Learning) by Gradient Descent. LMS or Widrow-Hoff Algorithms. Convergence of The Batch LMS Rule

The LMS Algorithm
The LMS Objective Function.

Global solution.
Pseudoinverse of a matrix.
Optimization (learning) by gradient descent.
LMS or Widrow-Hoff Algorithms.
Convergence of the Batch LMS Rule.
Compiled by LATEX at 3:50 PM on February 27, 2013.
CS 295 (UVM)
The LMS Algorithm
Fall 2013
1 / 23
LMS Objective Function

Again we consider the problem of programming a linear threshold function
x1
x2
w1
w2
x3
w3
xn
wn
w0
(
T
y D sgn.w x C w0 / D
1; if wT x C w0 > 0;
1; if wT x C w0 0:
so that it agrees with a given dichotomy of m feature vectors,
Xm D f.x1 ; `1 /; : : : ; .xm ; `m /g:

where xi 2 Rn and ì 2 f 1; C1g, for i D 1; 2; : : : ; m.
CS 295 (UVM)
The LMS Algorithm
Fall 2013
2 / 23
LMS Objective Function (cont.)

Thus, given a dichotomy
Xm D f.x1 ; `1 /; : : : ; .xm ; `m /g;

we seek a solution weight vector w and bias w0 , such that,

sgn wT xi C w0 D ì ;
for i D 1; 2; : : : ; m.
Equivalently, using homogeneous coordinates (or augmented feature vectors),
T
sgn b
w b
xi D ì ;
for i D 1; 2; : : : ; m, where b
xi D 1; xTi
T
Using normalized coordinates,

T 0
w b
x i D 1;
sgn b
or, b
w b
x0 i > 0;
for i D 1; 2; : : : ; m, where b
x0 i D ìb
xi .
CS 295 (UVM)
The LMS Algorithm
Fall 2013
3 / 23
LMS Objective Function (cont.)

Alternatively, let b 2 Rm satisfy bi > 0 for i D 1; 2; : : : ; m. We call b a margin vector.
Frequently we will assume that bi D 1.
Our learning criterion is certainly satisfied if
T 0
b
w b
x i D bi
for i D 1; 2; : : : ; m. (Note that this condition is sufficient, but not necessary.)

Equivalently, the above is satisfied if the expression
E.b
w/ D
m
X
bi
T 0
b
w b
xi
2
i D1
equals zero. The above expression we call the LMS objective function, where LMS
stands for least mean square. (N.B. we could normalize the above by dividing the
right side by m.)
CS 295 (UVM)
The LMS Algorithm
Fall 2013
4 / 23
LMS Objective Function: Remarks

Converting the original problem of classification (satisfying a system of inequalities)
into one of optimization is somewhat ad hoc. And there is no guarantee that we can
?
?
find a b
w 2 RnC1 that satisfies E.b
w / D 0. However, this conversion can lead to
practical compromises if the original inequalities possess inconsistencies.
Also, there is no unique objective function. The LMS expression,
E.b
w/ D
m
X
T 0 2
b
w b
xi :
bi
i D1
can be replaced by numerous candidates, e.g.

m
X
bi
T 0
b
w b
xi ;
m
X
T 0
sgn b
w b
xi

i D1
i D1
m
X
T 0
sgn b
w b
xi
2
etc.
i D1
However, minimizing E.b

w / is generally an easy task.
CS 295 (UVM)
The LMS Algorithm
Fall 2013
5 / 23
Minimizing the LMS Objective Function

Inspection suggests that the LMS Objective Function
E.b
w/ D
m
X
bi
T 0
b
w b
xi
2
i D1
describes a parabolic function. It may have a unique global minimum, or an infinite

number of global minima which occupy a connected linear set. (The latter can occur
if m < n C 1.) Letting,
0 0T 1 0
b
x
`1
B 10T C B `
x2 C
Bb
2
C B:
X DB
B :: C D B
:
@
:
@ : A
0T
b
xm
`m
xO10 ;1
xO20 ;1
xO10 ;2
xO20 ;2
::
:

::
:
xO10 ;n
xO20 ;n C
C
xOm0 ;1
xOm0 ;2
xOm0 ;n
::
:
m.nC1/
;
:: C 2 R
: A
then,
E.b
w/ D
m
X
bi
i D1
CS 295 (UVM)
0
1 2
0T
b b

w
1 x1 b

B

0T C
x2 b
w C
B b2 b

0T 2

B
C

b
xi b
w D B
::
C D b
@
:
A

0T
bm b

xm b
w
The LMS Algorithm
0 0T 1 2

b
x

B 10T C
x2 C
Bb
B : Cb

B : C w D kb
@ : A

0T

b
xm
Xb
wj2 :
Fall 2013
6 / 23
Minimizing the LMS Objective Function (cont.)

E.b
w/ D kb
Xb
wk2
D .b X b
w/T .b X b
w/

T
D bT b
w X T .b X b
w/
T
T
b
w XTb
2bTX b
w C kbk2
Db
w X TX b
w
Db
w X TX b
w
bTX b
w C bT b
As an aside, note that
0 0T 1
b
x1
m
B :: C X 0 0T
T
0
0
b
X X D .b
x 1 ; : : : ;b
x m/ @ : A D
x ib
xi 2 R.nC1/.nC1/ ;
0T
b
xm
i D1
0 0T 1
b
x1
m
B :: C X 0T
T
b X D .b1 ; : : : ; bm / @ : A D
bib
xi 2 RnC1 :
0T
b
xm
CS 295 (UVM)
i D1
The LMS Algorithm
Fall 2013
7 / 23

Okay, so how do we minimize
T
E.b
w/ D b
w X TX b
w
2bTX b
w C kbk2
Using calculus (e.g., Math 121), we can compute the gradient of E.b
w/, and
?
algebraically determine a value of b
w which makes each component vanish. That is,
solve
1
0
@E
C
B @b
B w0 C
C
B
B @E C
C
B
C
rE.b
w/ D B
B @b
w 1 C D 0:
B : C
B :: C
B
C
@ @E A
@b
wn
It is straightforward to show that
rE.b
w/ D 2X T X b
w
CS 295 (UVM)
The LMS Algorithm
2X T b:
Fall 2013
8 / 23

Thus,
rE.b
w/ D 2X T X b
w
if
2X T b D 0
?
b
w D .X T X / 1 X T b D X b;
where the matrix,

X D .X T X /
def
X T 2 R.nC1/m
is called the pseudoinverse of X . If X T X is singular, one defines

X D lim .X T X C I /
def
!0
XT:
Observe that if X T X is nonsingular,

X X D .X T X /
CS 295 (UVM)
X T X D I:
The LMS Algorithm
Fall 2013
9 / 23
Example
The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
Classification, Second Edition, Wiley, NY, 2001, p. 241.
Given the dichotomy,
X4 D

.1; 2/T ; 1 ; .2; 0/T ; 1 ; .3; 1/T ; 1 ; .2; 3/T ; 1
we obtain,
0 0T 1
0
b
x1
B 0T C
Bb
C
Bx2 C B
X D B 0T C D B
Bb
C @
@x3 A
0T
b
x4
Whence,
4
X TX D @8
6
8
18
11
CS 295 (UVM)
1
1
1
1
1
2
3
2
6
11A ;
14
and, X D .X TX /
The LMS Algorithm
2
0C
C:
1A
3
5
4
13
12
3
4
7 1
12
1
2
1
6
1
2
1
6
1
3
1
3
XT D B
@
Fall 2013
C
C:
A
10 / 23
Example (cont.)
Letting, b D .1; 1; 1; 1/T , then
5
4
13
12
3
4
7 1
12
B
b
w D X b D B
@
1
2
1
6
1
2
1
6
1
3
1
3
0 1
1
11 1
3
CB
1C
C B
CB
DB
B
A @1C
A @
4
3
C
C:
A
2
3
1
x2
Whence,
5
w0 D
11
3

and
wD
4
3
T
CS 295 (UVM)
The LMS Algorithm
x1
Fall 2013
11 / 23
Method of Steepest Descent

An alternative approach is the method of steepest descent.
We begin by representing Taylors theorem for functions of more than one variable:
let x 2 Rn , and f W Rn ! R, so
f .x/ D f .x1 ; x2 ; : : : ; xn / 2 R:
Now let x 2 Rn , and consider
f .x C x/ D f .x1 C x1 ; : : : ; xn C xn /:
Define F W R ! R, such that,
F .s/ D f .x C s x/:
Thus,
F .0/ D f .x/;
CS 295 (UVM)
and,
F .1/ D f .x C x/:
The LMS Algorithm
Fall 2013
12 / 23
Method of Steepest Descent (cont.)

Taylors theorem for a single variable (Math 21/22),
F .s/ D F .0/ C
1
1
F 0 .0/s C
1
2
F 00 .0/s2 C
1
3
F 000 .0/s3 C :
Our plan is to set s D 1 and replace F .1/ by f .x C x/, F .0/ by f .x/, etc.
To evaluate F 0 .0/ we will invoke the multivariate chain rule, e.g.,
d
ds
f u .s/; v .s/ D
@f
@f
.u ; v / u 0 .s/ C
.u ; v / v 0 .s/:
@u
@v
Thus,
F 0 .s/ D
dF
ds
.s/ D
df
ds
.x1 C s x1 ; : : : ; xn C s xn /
@f
d
@f
d
.x C s x/ .x1 C s x1 / C C
.x C s x/ .xn C s xn /
@x1
ds
@xn
ds
@f
@f
.x C s x/ x1 C C
.x C s x/ xn :
@x1
@xn
CS 295 (UVM)
The LMS Algorithm
Fall 2013
13 / 23
Thus,
F 0 .0/ D
CS 295 (UVM)
@f
@f
.x/ x1 C C
.x/ xn D r f .x/T x:
@x1
@ xn
The LMS Algorithm
Fall 2013
14 / 23

Thus, it is possible to show
f .x C x/ D f .x/ C r f .x/T x C O k xk2

D f .x/ C kr f .x/kk xk cos C O k xk2 ;
where defines the angle between r f .x/ and x. If k xk 1, then
f D f .x C x/
f .x/ kr f .x/kk xk cos :
Thus, the greatest reduction f occurs if cos D 1, that is if x D r f , where

> 0. We thus seek a local minimum of the LMS objective function by taking a
sequence of steps
b
w.t C 1/ D b
w.t /
CS 295 (UVM)

rE b
w.t / :
The LMS Algorithm
Fall 2013
15 / 23
Training an LTU using Steepest Descent

We now return to our original problem. Given a dichotomy
Xm D f.x1 ; `1 /; : : : ; .xm ; `m /g
of m feature vectors xi 2 Rn with ì 2 f 1; 1g for i D 1; : : : ; m, we construct the set
of normalized, augmented feature vectors
b 0 D f.ì ; ì xi ;1 ; : : : ; ì xi ;n /T 2 RnC1 ji D 1; : : : ; mg:

X
m
Given a margin vector, b 2 Rm , with bi > 0 for i D 1; : : : ; m, we construct the LMS
objective function,
E.b
w/ D
(the factor of
1
2
m
1X
T 0
.b
w b
xi
bi /2 D
i D1
1
2
kX b
w
bk2 ;
is added with some foresight), and then evaluate its gradient
rE.b
w/ D
m
X
T 0
.b
w b
xi
bi /b
x0 i D X T .X b
w
b/:
i D1
CS 295 (UVM)
The LMS Algorithm
Fall 2013
16 / 23
Training an LTU using Steepest Descent (cont.)

Substitution into the steepest descent update rule,
b
w.t C 1/ D b
w.t /

rE b
w.t / ;
yields the batch LMS update rule,
b
w.t C 1/ D b
w.t / C
m
X
bi

b
w.t /Tb
x0 i b
x0 i :
i D1
Alternatively, one can abstract the sequential LMS, or Widrow-Hoff rule, from the
above:

b
w.t C 1/ D b
w.t / C b b
w.t /Tb
x0 .t / b
x0 .t /:
b 0 is the element of the dichotomy that is presented to the LTU at

where b
x0 .t / 2 X
m
epoch t. (Here, we assume that b is fixed; otherwise, replace it by b.t /.)
CS 295 (UVM)
The LMS Algorithm
Fall 2013
17 / 23
Sequential LMS Rule
The sequential LMS rule
h
b
w.t C 1/ D b
w.t / C b
i
b
w.t /Tb
x0 .t / b
x0 .t /;
resembles the sequential perceptron rule,
b
w.t C 1/ D b
w.t / C
h
2
i 0
sgn b
w.t /Tb
x0 .t / b
x .t /:
Sequential rules are well suited to real-time implementations, as only the current
values for the weights, i.e. the configuration of the LTU itself, need to be stored. They
also work with dichotomies of infinite sets.
CS 295 (UVM)
The LMS Algorithm
Fall 2013
18 / 23
Convergence of the batch LMS rule

Recall, the LMS objective function in the form
E.b
w/ D
1

2
w
X b
b ;
has as its gradient,
rE.b
w/ D X TX b
w
X T b D X T.X b
w
b/:
Substitution into the rule of steepest descent,
b
w.t C 1/ D b
w.t /

rE b
w.t / ;
yields,
b
w.t C 1/ D b
w.t /
CS 295 (UVM)
w.t /
XT Xb
The LMS Algorithm
Fall 2013
19 / 23
Convergence of the batch LMS rule (cont.)

?
The algorithm
is said to converge to a fixed point b
w, if for every finite initial value

b
w.0/ < 1,
?
lim b
w.t / D b
w :
t !1
The fixed points b

w satisfy rE.b
w / D 0, whence
?
X TX b
w D X T b:
The update rule becomes,
def
Let b
w.t / D b
w.t /
b
w.t C 1/ D b
w.t /
XT Xb
w.t /
Db
w.t /
X TX b
w.t /
?
b
w
?
b
w . Then,
b
w.t C 1/ D b
w.t /
CS 295 (UVM)
X TX b
w.t / D I
The LMS Algorithm
X TX /b
w.t /:
Fall 2013
20 / 23
?
?
Convergence
w. t / ! b
w
w.t / D b
w.t / b
w ! 0. Thus we require that

b
occurs if b
b
w.t C 1/ < b
w.t / . Inspecting the update rule,

b
w.t C 1/ D I X TX b
w.t /;
this reduces to the condition that all the eigenvalues of

I
X TX
have magnitudes less than 1.

We will now evaluate the eigenvalues of the above matrix.
CS 295 (UVM)
The LMS Algorithm
Fall 2013
21 / 23

Let S 2 R.nC1/.nC1/ denote the similarity transform that reduces the symmetric
matrix X TX to a diagonal matrix 2 R.nC1/.nC1/ . Thus, S TS D SS T D I, and
S X TX S T D D diag.0 ; 1 ; : : : ; n /;
The numbers, 0 ; 1 ; : : : ; n , represent the eigenvalues of X TX . Note that 0 i for
i D 0; 1; : : : ; n. Thus,
S b
w.t C 1/ D S .I
D .I
w.t /;
X TX /S T S b
/S b
w.t /:
Thus convergence occurs if
kS b
w.t C 1/k < kS b
w.t /k;
which occurs if the eigenvalues of I
CS 295 (UVM)
all have magnitudes less than one.
The LMS Algorithm
Fall 2013
22 / 23
The eigenvalues of I equal 1 i , for i D 0; 1; : : : ; n. (These are in fact the

eigenvalues of I X TX .) Thus, we require that
1<1
i < 1;
or
0 < i < 2;
for all i. Let max D max i denote the largest eigenvalue of X TX , then convergence
0i n
requires that
0<<
CS 295 (UVM)
max
The LMS Algorithm
Fall 2013
23 / 23

The LMS Objective Function. Global Solution. Pseudoinverse of A Matrix. Optimization (Learning) by Gradient Descent. LMS or Widrow-Hoff Algorithms. Convergence of The Batch LMS Rule

Uploaded by

Copyright:

Available Formats

The LMS Objective Function. Global Solution. Pseudoinverse of A Matrix. Optimization (Learning) by Gradient Descent. LMS or Widrow-Hoff Algorithms. Convergence of The Batch LMS Rule

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The LMS Objective Function. Global Solution. Pseudoinverse of A Matrix. Optimization (Learning) by Gradient Descent. LMS or Widrow-Hoff Algorithms. Convergence of The Batch LMS Rule

Uploaded by

Copyright:

Available Formats

The LMS Algorithm