The LMS Objective Function. Global Solution. Pseudoinverse of A Matrix. Optimization (Learning) by Gradient Descent. LMS or Widrow-Hoff Algorithms. Convergence of The Batch LMS Rule
The LMS Objective Function. Global Solution. Pseudoinverse of A Matrix. Optimization (Learning) by Gradient Descent. LMS or Widrow-Hoff Algorithms. Convergence of The Batch LMS Rule
The LMS Objective Function. Global Solution. Pseudoinverse of A Matrix. Optimization (Learning) by Gradient Descent. LMS or Widrow-Hoff Algorithms. Convergence of The Batch LMS Rule
CS 295 (UVM)
Fall 2013
1 / 23
w1
w2
x3
w3
xn
wn
w0
(
T
y D sgn.w x C w0 / D
1; if wT x C w0 > 0;
1; if wT x C w0 0:
CS 295 (UVM)
Fall 2013
2 / 23
sgn wT xi C w0 D `i ;
for i D 1; 2; : : : ; m.
Equivalently, using homogeneous coordinates (or augmented feature vectors),
T
sgn b
w b
xi D `i ;
for i D 1; 2; : : : ; m, where b
xi D 1; xTi
T
or, b
w b
x0 i > 0;
for i D 1; 2; : : : ; m, where b
x0 i D `ib
xi .
CS 295 (UVM)
Fall 2013
3 / 23
E.b
w/ D
m
X
bi
T 0
b
w b
xi
2
i D1
equals zero. The above expression we call the LMS objective function, where LMS
stands for least mean square. (N.B. we could normalize the above by dividing the
right side by m.)
CS 295 (UVM)
Fall 2013
4 / 23
E.b
w/ D
m
X
T 0 2
b
w b
xi :
bi
i D1
bi
T 0
b
w b
xi ;
m
X
T 0
sgn b
w b
xi
i D1
i D1
m
X
T 0
sgn b
w b
xi
2
etc.
i D1
CS 295 (UVM)
Fall 2013
5 / 23
E.b
w/ D
m
X
bi
T 0
b
w b
xi
2
i D1
0 0T 1 0
b
x
`1
B 10T C B `
x2 C
Bb
2
C B:
X DB
B :: C D B
:
@
:
@ : A
0T
b
xm
`m
xO10 ;1
xO20 ;1
xO10 ;2
xO20 ;2
::
:
::
:
xO10 ;n
xO20 ;n C
C
xOm0 ;1
xOm0 ;2
xOm0 ;n
::
:
m.nC1/
;
:: C 2 R
: A
then,
E.b
w/ D
m
X
bi
i D1
CS 295 (UVM)
0
1
2
0T
b b
w
1 x1 b
B
0T C
x2 b
w C
B b2 b
0T 2
B
C
b
xi b
w D
B
::
C
D
b
@
:
A
0T
bm b
xm b
w
The LMS Algorithm
0 0T 1
2
b
x
B 10T C
x2 C
Bb
B : Cb
B : C w
D kb
@ : A
0T
b
xm
Xb
wj2 :
Fall 2013
6 / 23
Xb
wk2
D .b X b
w/T .b X b
w/
T
D bT b
w X T .b X b
w/
T
T
b
w XTb
2bTX b
w C kbk2
Db
w X TX b
w
Db
w X TX b
w
bTX b
w C bT b
0 0T 1
b
x1
m
B :: C X 0 0T
T
0
0
b
X X D .b
x 1 ; : : : ;b
x m/ @ : A D
x ib
xi 2 R.nC1/.nC1/ ;
0T
b
xm
i D1
0 0T 1
b
x1
m
B :: C X 0T
T
b X D .b1 ; : : : ; bm / @ : A D
bib
xi 2 RnC1 :
0T
b
xm
CS 295 (UVM)
i D1
Fall 2013
7 / 23
E.b
w/ D b
w X TX b
w
2bTX b
w C kbk2
Using calculus (e.g., Math 121), we can compute the gradient of E.b
w/, and
?
algebraically determine a value of b
w which makes each component vanish. That is,
solve
1
0
@E
C
B @b
B w0 C
C
B
B @E C
C
B
C
rE.b
w/ D B
B @b
w 1 C D 0:
B : C
B :: C
B
C
@ @E A
@b
wn
rE.b
w/ D 2X T X b
w
CS 295 (UVM)
2X T b:
Fall 2013
8 / 23
rE.b
w/ D 2X T X b
w
if
2X T b D 0
?
b
w D .X T X / 1 X T b D X b;
X T 2 R.nC1/m
!0
XT:
CS 295 (UVM)
X T X D I:
Fall 2013
9 / 23
Example
The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
Classification, Second Edition, Wiley, NY, 2001, p. 241.
Given the dichotomy,
X4 D
.1; 2/T ; 1 ; .2; 0/T ; 1 ; .3; 1/T ; 1 ; .2; 3/T ; 1
we obtain,
0 0T 1
0
b
x1
B 0T C
Bb
C
Bx2 C B
X D B 0T C D B
Bb
C @
@x3 A
0T
b
x4
Whence,
4
X TX D @8
6
8
18
11
CS 295 (UVM)
1
1
1
1
1
2
3
2
6
11A ;
14
and, X D .X TX /
2
0C
C:
1A
3
5
4
13
12
3
4
7 1
12
1
2
1
6
1
2
1
6
1
3
1
3
XT D B
@
Fall 2013
C
C:
A
10 / 23
Example (cont.)
Letting, b D .1; 1; 1; 1/T , then
5
4
13
12
3
4
7 1
12
B
b
w D X b D B
@
1
2
1
6
1
2
1
6
1
3
1
3
0 1
1
11 1
3
CB
1C
C B
CB
DB
B
A @1C
A @
4
3
C
C:
A
2
3
1
x2
Whence,
5
w0 D
11
3
and
wD
4
3
T
CS 295 (UVM)
x1
Fall 2013
11 / 23
CS 295 (UVM)
and,
F .1/ D f .x C x/:
Fall 2013
12 / 23
1
1
F 0 .0/s C
1
2
F 00 .0/s2 C
1
3
F 000 .0/s3 C :
Our plan is to set s D 1 and replace F .1/ by f .x C x/, F .0/ by f .x/, etc.
To evaluate F 0 .0/ we will invoke the multivariate chain rule, e.g.,
d
ds
f u .s/; v .s/ D
@f
@f
.u ; v / u 0 .s/ C
.u ; v / v 0 .s/:
@u
@v
Thus,
F 0 .s/ D
dF
ds
.s/ D
df
ds
.x1 C s x1 ; : : : ; xn C s xn /
@f
d
@f
d
.x C s x/ .x1 C s x1 / C C
.x C s x/ .xn C s xn /
@x1
ds
@xn
ds
@f
@f
.x C s x/ x1 C C
.x C s x/ xn :
@x1
@xn
CS 295 (UVM)
Fall 2013
13 / 23
Thus,
F 0 .0/ D
CS 295 (UVM)
@f
@f
.x/ x1 C C
.x/ xn D r f .x/T x:
@x1
@ xn
Fall 2013
14 / 23
D f .x/ C kr f .x/kk xk cos C O k xk2 ;
where defines the angle between r f .x/ and x. If k xk 1, then
f D f .x C x/
b
w.t C 1/ D b
w.t /
CS 295 (UVM)
rE b
w.t / :
Fall 2013
15 / 23
Xm D f.x1 ; `1 /; : : : ; .xm ; `m /g
of m feature vectors xi 2 Rn with `i 2 f 1; 1g for i D 1; : : : ; m, we construct the set
of normalized, augmented feature vectors
E.b
w/ D
(the factor of
1
2
m
1X
T 0
.b
w b
xi
bi /2 D
i D1
1
2
kX b
w
bk2 ;
rE.b
w/ D
m
X
T 0
.b
w b
xi
bi /b
x0 i D X T .X b
w
b/:
i D1
CS 295 (UVM)
Fall 2013
16 / 23
b
w.t C 1/ D b
w.t /
rE b
w.t / ;
b
w.t C 1/ D b
w.t / C
m
X
bi
b
w.t /Tb
x0 i b
x0 i :
i D1
Alternatively, one can abstract the sequential LMS, or Widrow-Hoff rule, from the
above:
b
w.t C 1/ D b
w.t / C b b
w.t /Tb
x0 .t / b
x0 .t /:
CS 295 (UVM)
Fall 2013
17 / 23
h
b
w.t C 1/ D b
w.t / C b
i
b
w.t /Tb
x0 .t / b
x0 .t /;
b
w.t C 1/ D b
w.t / C
h
2
i 0
sgn b
w.t /Tb
x0 .t / b
x .t /:
Sequential rules are well suited to real-time implementations, as only the current
values for the weights, i.e. the configuration of the LTU itself, need to be stored. They
also work with dichotomies of infinite sets.
CS 295 (UVM)
Fall 2013
18 / 23
E.b
w/ D
1
2
w
X b
b ;
rE.b
w/ D X TX b
w
X T b D X T.X b
w
b/:
b
w.t C 1/ D b
w.t /
rE b
w.t / ;
yields,
b
w.t C 1/ D b
w.t /
CS 295 (UVM)
w.t /
XT Xb
Fall 2013
19 / 23
The algorithm
is said to converge to a fixed point b
w, if for every finite initial value
b
w.0/
< 1,
?
lim b
w.t / D b
w :
t !1
X TX b
w D X T b:
The update rule becomes,
def
Let b
w.t / D b
w.t /
b
w.t C 1/ D b
w.t /
XT Xb
w.t /
Db
w.t /
X TX b
w.t /
?
b
w
?
b
w . Then,
b
w.t C 1/ D b
w.t /
CS 295 (UVM)
X TX b
w.t / D I
X TX /b
w.t /:
Fall 2013
20 / 23
?
?
Convergence
w.
t / ! b
w
w.t / D b
w.t / b
w ! 0. Thus we require that
b
occurs if b
b
w.t C 1/
<
b
w.t /
. Inspecting the update rule,
b
w.t C 1/ D I X TX b
w.t /;
X TX
CS 295 (UVM)
Fall 2013
21 / 23
D .I
w.t /;
X TX /S T S b
/S b
w.t /:
kS b
w.t C 1/k < kS b
w.t /k;
which occurs if the eigenvalues of I
CS 295 (UVM)
Fall 2013
22 / 23
i < 1;
or
for all i. Let max D max i denote the largest eigenvalue of X TX , then convergence
0i n
requires that
0<<
CS 295 (UVM)
max
Fall 2013
23 / 23