Andrew Rosenberg - Lecture 14: Neural Networks
Andrew Rosenberg - Lecture 14: Neural Networks
Andrew Rosenberg - Lecture 14: Neural Networks
14
Neural
Networks
Machine
Learning
March
18,
2010
Last
Time
Perceptrons
Perceptron
Loss
vs.
LogisAc
Regression
Loss
Training
Perceptrons
and
LogisAc
Regression
Models
using
Gradient
Descent
Today
MulAlayer
Neural
Networks
Feed
Forward
Error
Back-PropagaAon
MulAply inputs by weights along edges Apply some funcAon to the set of inputs at each node
1 2 D
1 Types of Neurons 0 f ( , ) x 1 0 f ( , ) x
Linear Neuron
1 2 D
f ( , ) x
1 2 D
1 0
LogisAc Neuron
Perceptron
PotenAally
more.
Require
a
convex
loss
funcAon
for
gradient
descent
training.
5
MulAlayer
Networks
Cascade
Neurons
together
The
output
from
one
layer
is
the
input
to
the
next
Each
Layer
has
its
own
sets
of
weights
x0 x1 x2
xP
x0 x1
0,1
f (x, ) =
D i=0
D i=0
T 1,i [0,i ] x
T x [i ]
8
Neural
Networks
We
want
to
introduce
non-lineariAes
to
the
network.
Non-lineariAes
allow
a
network
to
idenAfy
complex
regions
in
space
0,0 x0 1,0
x1 x2 0,1 1,1 1,2 0,2 0,D 1,D f (x, )
Linear
Separability
1-layer
cannot
handle
XOR
More
layers
can
handle
more
complicated
spaces
but
require
more
parameters
Each
node
splits
the
feature
space
with
a
hyperplane
If
the
second
layer
is
AND
a
2-layer
network
can
represent
any
convex
hull.
10
Feed-Forward
Networks
PredicAons
are
fed
forward
through
the
network
to
classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2
11
Feed-Forward
Networks
PredicAons
are
fed
forward
through
the
network
to
classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2
12
Feed-Forward
Networks
PredicAons
are
fed
forward
through
the
network
to
classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2
13
Feed-Forward
Networks
PredicAons
are
fed
forward
through
the
network
to
classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2
14
Feed-Forward
Networks
PredicAons
are
fed
forward
through
the
network
to
classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2
15
Feed-Forward
Networks
PredicAons
are
fed
forward
through
the
network
to
classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2
16
Error
BackpropagaAon
We
will
do
gradient
descent
on
the
whole
network.
Training
will
proceed
from
the
last
layer
to
the
rst.
x0 0,0 1,0 x1 x2 0,2 xP 0,1 2,0 1,1 1,2
17
2,1 2,2
f (x, )
Error
BackpropagaAon
Introduce
variables
over
the
neural
network
= {wij , wjk , wkl } wij wjk wkl f (x, )
x0 x1 x2 xP
18
Error BackpropagaAon
f (x, )
19
Error
BackpropagaAon
aj =
zj = g(aj )
wij zi
ak =
zk = g(ak )
wjk zj
zl = g(al )
zi x0 x1 x2 xP wij
aj
zj wjk
ak
zk
al wkl
zl
f (x, )
20
Error
BackpropagaAon
aj =
i
Training:
Take
the
gradient
of
the
last
component
and
iterate
backwards
wij zi
ak =
zj = g(aj )
zk = g(ak )
wjk zj
k zl = g(al )
zi x0 x1 x2 xP wij
aj
zj wjk
ak
zk
al wkl
zl
f (x, )
21
R() = =
Error
BackpropagaAon
Empirical
Risk
FuncAon
zi
aj
zj
ak
zk
al
zl
x0 x1 x2 xP
wij
wjk
wkl f (x, )
22
Error
BackpropagaAon
OpAmize
last
layer
weights
wkl
Ln =
1 2 (yn f (xn )) 2
ak
zk
al wkl
zl
f (x, )
23
Error
BackpropagaAon
OpAmize
last
layer
weights
wkl
Ln =
1 2 (yn f (xn )) 2
ak
zk
al wkl
zl
24
Error
BackpropagaAon
OpAmize
last
layer
weights
wkl
Ln =
1 2 (yn f (xn )) 2
ak
zk
al wkl
zl
25
Error
BackpropagaAon
OpAmize
last
layer
weights
wkl
Ln =
R 1 1 (yn g(al,n ))2 zk,n wkl 1 2 = = [(yn zl,n )g (al,n )] zk,n wkl N n al,n wkl N n
zi x0 x1 x2 xP f (x, )
wij aj zj wjk
1 2 (yn f (xn )) 2
ak
zk
al wkl
zl
26
Error
BackpropagaAon
Ln al,n R 1 Calculus
chain
rule
= wkl N n al,n wkl 1 (yn g(al,n ))2 zk,n wkl R 1 1 2 = = [(yn zl,n )g (al,n )] zk,n wkl N n al,n wkl N n 1 l,n nzk,n = N n a z
zi
j j
Ln =
1 2 (yn f (xn )) 2
ak
zk
al
zl
x0 x1 x2 xP
wij
wjk
wkl f (x, )
27
Error
BackpropagaAon
OpAmize
last
hidden
weights
wjk
R wkl
1 l,n zk,n N n
zi x0 x1 x2 xP
wij
aj
zj wjk
ak
zk
al wkl
zl
f (x, )
28
Error
BackpropagaAon
OpAmize
last
hidden
weights
wjk
ak,n wjk
R wkl
1 l,n zk,n N n
zi x0 x1 x2 xP
wij
aj
zj wjk
ak
zk
al wkl
zl
f (x, )
29
Error
BackpropagaAon
OpAmize
last
hidden
weights
wjk
R 1 Ln al,n ak,n = MulAvariate
chain
rule
wjk N n al,n ak,n wjk l R 1 al,n = [zj,n ] l wjk N n ak,n
l
R wkl
1 l,n zk,n N n
zi x0 x1 x2 xP
wij
aj
zj wjk
ak
zk
al wkl
zl
f (x, )
30
Error
BackpropagaAon
OpAmize
last
hidden
weights
wjk
R 1 Ln al,n ak,n = MulAvariate
chain
rule
wjk N n al,n ak,n wjk l R 1 al,n al = wkl g(ak ) = [zj,n ] l wjk N n ak,n k
l
R wkl
1 l,n zk,n N n
zi x0 x1 x2 xP
wij
aj
zj wjk
ak
zk
al wkl
zl
f (x, )
31
Error
BackpropagaAon
OpAmize
last
hidden
weights
wjk
R 1 Ln al,n ak,n = MulAvariate
chain
rule
wjk N n al,n ak,n wjk l R 1 1 = [k,n ] [zj,n ] l wkl g (ak,n ) [zj,n ] = wjk N n N n l al = wkl g(ak )
k
R wkl
1 l,n zk,n N n
zi
aj
zj
ak
zk
al
zl
x0 x1 x2 xP
wij
wjk
wkl f (x, )
32
Error
BackpropagaAon
Repeat
for
all
previous
layers
R 1 Ln al,n 1 1 = = [(yn zl,n )g (al,n )] zk,n = l,n zk,n wkl N n al,n wkl N n N n Ln ak,n R 1 1 1 = = k,n zj,n l,n wkl g (ak,n ) zj,n = wjk N n ak,n wjk N n N n l Ln R 1 aj,n 1 1 = = j,n zi,n k,n wjk g (aj,n ) zi,n = wij N n aj,n wij N n N n k aj zj zk ak al zl z
i
x0 x1 x2 xP
wij
wjk
wkl f (x, )
33
Error
BackpropagaAon
Now
that
we
have
well
dened
gradients
for
each
parameter,
update
using
Gradient
Descent
t+1 wij t+1 wjk t+1 wkl
zi x0 x1 x2 xP
wij
aj
zj wjk
ak
zk
al wkl
zl
f (x, )
34
Error
Back-propagaAon
Error
backprop
unravels
the
mulAvariate
chain
rule
and
solves
the
gradient
for
each
parAal
component
separately.
The
target
values
for
each
layer
come
from
the
next
layer.
This
feeds
the
errors
back
along
the
network.
zi x0 x1 x2 xP
35
aj wij
zj wjk
ak
zk
al wkl
zl
f (x, )
36
37
38
We
can
do
the
same
here.
Error
Backprop
then
becomes
Maximum
A
Posteriori
(MAP)
rather
than
Maximum
Likelihood
(ML)
training
R() =
N 1 L(yn f (xn )) + ||||2 N n=0
39
HandwriAng
RecogniAon
Demo:
hgp://yann.lecun.com/exdb/lenet/ index.html
40
ConvoluAonal
Network
The
network
is
not
fully
connected.
Dierent
nodes
are
responsible
for
dierent
regions
of
the
image.
This
allows
for
robustness
to
transformaAons.
41
42
MulAple
Outputs
x0 x1 x2 0,2 xP 0,0 0,1 1,0 1,1 1,2
Used
for
N-way
classicaAon.
Each
Node
in
the
output
layer
corresponds
to
a
dierent
class.
No
guarantee
that
the
sum
of
the
output
vector
will
equal
1.
43
x0 x1 x2 f (x, )
44
Hidden Layer
Context Layer
Input Layer
45
Hidden Layer
Context Layer
Input Layer
46
Input Layer
47
Maximum
Margin
Perceptron
can
lead
to
many
equally
valid
choices
for
the
decision
boundary
Max
Margin
How
can
we
pick
which
is
best?
Maximize
the
size
of
the
margin.
Small
Margin
LargeMargin
Next
Time
Maximum
Margin
Classiers
Support
Vector
Machines
Kernel
Methods
50