Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Andrew Rosenberg - Lecture 14: Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Lecture

14 Neural Networks
Machine Learning March 18, 2010

Last Time
Perceptrons
Perceptron Loss vs. LogisAc Regression Loss Training Perceptrons and LogisAc Regression Models using Gradient Descent

Today
MulAlayer Neural Networks
Feed Forward Error Back-PropagaAon

Recall: The Neuron Metaphor


Neurons
accept informaAon from mulAple inputs, transmit informaAon to other neurons.

MulAply inputs by weights along edges Apply some funcAon to the set of inputs at each node

1 2 D

1 Types of Neurons 0 f ( , ) x 1 0 f ( , ) x

Linear Neuron

1 2 D
f ( , ) x

1 2 D

1 0

LogisAc Neuron

Perceptron

PotenAally more. Require a convex loss funcAon for gradient descent training.
5

MulAlayer Networks
Cascade Neurons together The output from one layer is the input to the next Each Layer has its own sets of weights

x0 x1 x2

0,0 0,1 0,2

1,0 2,0 1,1 1,2 2,1 2,2 f (x, )

xP

Linear Regression Neural Networks


What happens when we arrange linear neurons in a mulAlayer network?
x0 x1 x2 0,0 1,0 0,1 1,1 1,2 0,2 0,D 1,D f (x, )

Linear Regression Neural Networks


Nothing special happens.
The product of two linear transformaAons is itself a linear D N 1 transformaAon. f (x, ) = 1,i 0,i,n xn 0,0
1,0 1,1 1,2 x2 0,2 0,D 1,D
f (x, ) =
i=0 n=0

x0 x1

0,1

f (x, ) =

D i=0
D i=0

T 1,i [0,i ] x

T x [i ]
8

Neural Networks
We want to introduce non-lineariAes to the network.
Non-lineariAes allow a network to idenAfy complex regions in space 0,0 x0 1,0
x1 x2 0,1 1,1 1,2 0,2 0,D 1,D f (x, )

Linear Separability
1-layer cannot handle XOR More layers can handle more complicated spaces but require more parameters Each node splits the feature space with a hyperplane If the second layer is AND a 2-layer network can represent any convex hull.

10

Feed-Forward Networks
PredicAons are fed forward through the network to classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2

11

Feed-Forward Networks
PredicAons are fed forward through the network to classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2

12

Feed-Forward Networks
PredicAons are fed forward through the network to classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2

13

Feed-Forward Networks
PredicAons are fed forward through the network to classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2

14

Feed-Forward Networks
PredicAons are fed forward through the network to classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2

15

Feed-Forward Networks
PredicAons are fed forward through the network to classify
x0 x1 x2 0,2 xP 0,0 0,1 1,0 2,0 1,1 1,2 2,1 2,2

16

Error BackpropagaAon
We will do gradient descent on the whole network. Training will proceed from the last layer to the rst.
x0 0,0 1,0 x1 x2 0,2 xP 0,1 2,0 1,1 1,2
17

2,1 2,2

f (x, )

Error BackpropagaAon
Introduce variables over the neural network
= {wij , wjk , wkl } wij wjk wkl f (x, )

x0 x1 x2 xP

18

Error BackpropagaAon

Introduce variables over the neural network


DisAnguish the input and output of each node
zi x0 x1 x2 xP wij aj zj wjk wkl ak zk al zl

= {wij , wjk , wkl }

f (x, )

19

Error BackpropagaAon
aj =
zj = g(aj )

wij zi

ak =

zk = g(ak )

wjk zj

= {wij , wjk , wkl } al = wkl zk


k

zl = g(al )

zi x0 x1 x2 xP wij

aj

zj wjk

ak

zk

al wkl

zl

f (x, )

20

Error BackpropagaAon
aj =
i

= {wij , wjk , wkl }


al = wkl zk

Training: Take the gradient of the last component and iterate backwards
wij zi
ak =
zj = g(aj )

zk = g(ak )

wjk zj

k zl = g(al )

zi x0 x1 x2 xP wij

aj

zj wjk

ak

zk

al wkl

zl

f (x, )

21

R() = =

N 1 L(yn f (xn )) N n=0

Error BackpropagaAon
Empirical Risk FuncAon

N 1 1 2 (yn f (xn )) N n=0 2 2 N 1 1 wjk g wij xn,i yn g wkl g N n=0 2 j i

zi

aj

zj

ak

zk

al

zl

x0 x1 x2 xP

wij

wjk

wkl f (x, )

22

Error BackpropagaAon
OpAmize last layer weights wkl

Ln =

1 2 (yn f (xn )) 2

Ln al,n R 1 = wkl N n al,n wkl


zi x0 x1 x2 xP
wij aj zj wjk

Calculus chain rule

ak

zk

al wkl

zl

f (x, )

23

Error BackpropagaAon
OpAmize last layer weights wkl

Ln =

Ln al,n R 1 = wkl N n al,n wkl

1 2 (yn f (xn )) 2

Calculus chain rule

R 1 1 (yn g(al,n ))2 al,n 2 = wkl N n al,n wkl


zi x0 x1 x2 xP f (x, )
wij aj zj wjk

ak

zk

al wkl

zl

24

Error BackpropagaAon
OpAmize last layer weights wkl

Ln =

Ln al,n R 1 = wkl N n al,n wkl

1 2 (yn f (xn )) 2

Calculus chain rule

R 1 1 (yn g(al,n ))2 zk,n wkl 2 = wkl N n al,n wkl


zi x0 x1 x2 xP f (x, )
wij aj zj wjk

ak

zk

al wkl

zl

25

Error BackpropagaAon
OpAmize last layer weights wkl

Ln =

R 1 1 (yn g(al,n ))2 zk,n wkl 1 2 = = [(yn zl,n )g (al,n )] zk,n wkl N n al,n wkl N n
zi x0 x1 x2 xP f (x, )
wij aj zj wjk

Ln al,n R 1 = wkl N n al,n wkl

1 2 (yn f (xn )) 2

Calculus chain rule

ak

zk

al wkl

zl

26

Error BackpropagaAon
Ln al,n R 1 Calculus chain rule = wkl N n al,n wkl 1 (yn g(al,n ))2 zk,n wkl R 1 1 2 = = [(yn zl,n )g (al,n )] zk,n wkl N n al,n wkl N n 1 l,n nzk,n = N n a z
zi
j j

OpAmize last layer weights wkl

Ln =

1 2 (yn f (xn )) 2

ak

zk

al

zl

x0 x1 x2 xP

wij

wjk

wkl f (x, )

27

Error BackpropagaAon
OpAmize last hidden weights wjk

R 1 Ln ak,n = wjk N n ak,n wjk

R wkl

1 l,n zk,n N n

zi x0 x1 x2 xP
wij

aj

zj wjk

ak

zk

al wkl

zl

f (x, )

28

Error BackpropagaAon
OpAmize last hidden weights wjk

R 1 Ln al,n = wjk N n al,n ak,n


l

ak,n wjk

R wkl

MulAvariate chain rule

1 l,n zk,n N n

zi x0 x1 x2 xP
wij

aj

zj wjk

ak

zk

al wkl

zl

f (x, )

29

Error BackpropagaAon
OpAmize last hidden weights wjk

R 1 Ln al,n ak,n = MulAvariate chain rule wjk N n al,n ak,n wjk l R 1 al,n = [zj,n ] l wjk N n ak,n
l

R wkl

1 l,n zk,n N n

zi x0 x1 x2 xP
wij

aj

zj wjk

ak

zk

al wkl

zl

f (x, )

30

Error BackpropagaAon
OpAmize last hidden weights wjk

R 1 Ln al,n ak,n = MulAvariate chain rule wjk N n al,n ak,n wjk l R 1 al,n al = wkl g(ak ) = [zj,n ] l wjk N n ak,n k
l

R wkl

1 l,n zk,n N n

zi x0 x1 x2 xP
wij

aj

zj wjk

ak

zk

al wkl

zl

f (x, )

31

Error BackpropagaAon
OpAmize last hidden weights wjk

R 1 Ln al,n ak,n = MulAvariate chain rule wjk N n al,n ak,n wjk l R 1 1 = [k,n ] [zj,n ] l wkl g (ak,n ) [zj,n ] = wjk N n N n l al = wkl g(ak )
k

R wkl

1 l,n zk,n N n

zi

aj

zj

ak

zk

al

zl

x0 x1 x2 xP

wij

wjk

wkl f (x, )

32

Error BackpropagaAon
Repeat for all previous layers R 1 Ln al,n 1 1 = = [(yn zl,n )g (al,n )] zk,n = l,n zk,n wkl N n al,n wkl N n N n Ln ak,n R 1 1 1 = = k,n zj,n l,n wkl g (ak,n ) zj,n = wjk N n ak,n wjk N n N n l Ln R 1 aj,n 1 1 = = j,n zi,n k,n wjk g (aj,n ) zi,n = wij N n aj,n wij N n N n k aj zj zk ak al zl z
i

x0 x1 x2 xP

wij

wjk

wkl f (x, )

33

Error BackpropagaAon
Now that we have well dened gradients for each parameter, update using Gradient Descent
t+1 wij t+1 wjk t+1 wkl

R wij R t = wjk wkl R t = wkl wkl


t = wij

zi x0 x1 x2 xP
wij

aj

zj wjk

ak

zk

al wkl

zl

f (x, )

34

Error Back-propagaAon
Error backprop unravels the mulAvariate chain rule and solves the gradient for each parAal component separately. The target values for each layer come from the next layer. This feeds the errors back along the network.
zi x0 x1 x2 xP
35

aj wij

zj wjk

ak

zk

al wkl

zl

f (x, )

Problems with Neural Networks


InterpretaAon of Hidden Layers Overang

36

InterpretaAon of Hidden Layers


What are the hidden layers doing?! Feature ExtracAon The non-lineariAes in the feature extracAon can make interpretaAon of the hidden layers very dicult. This leads to Neural Networks being treated as black boxes.

37

Overang in Neural Networks


Neural Networks are especially prone to overang. Recall Perceptron Error
Zero error is possible, but so is more extreme overang
Perceptron LogisAc Regression

38

Bayesian Neural Networks


Bayesian LogisAc Regression by inserAng a prior on the weights
Equivalent to L2 RegularizaAon

We can do the same here. Error Backprop then becomes Maximum A Posteriori (MAP) rather than Maximum Likelihood (ML) training
R() =
N 1 L(yn f (xn )) + ||||2 N n=0

39

HandwriAng RecogniAon
Demo: hgp://yann.lecun.com/exdb/lenet/ index.html

40

ConvoluAonal Network
The network is not fully connected. Dierent nodes are responsible for dierent regions of the image. This allows for robustness to transformaAons.
41

Other Neural Networks


MulAple Outputs Skip Layer Network Recurrent Neural Networks

42

MulAple Outputs
x0 x1 x2 0,2 xP 0,0 0,1 1,0 1,1 1,2

Used for N-way classicaAon. Each Node in the output layer corresponds to a dierent class. No guarantee that the sum of the output vector will equal 1.
43

Skip Layer Network


Input nodes are also sent directly to the output layer.

x0 x1 x2 f (x, )

44

Recurrent Neural Networks


Output or hidden layer informaAon is stored in a context or memory layer.
Output Layer

Hidden Layer

Context Layer

Input Layer

45

Recurrent Neural Networks


Output or hidden layer informaAon is stored in a context or memory layer.
Output Layer

Hidden Layer

Context Layer

Input Layer

46

Time Delayed Recurrent Neural Networks (TDRNN)


Output layer from Ame t are used as inputs to the hidden layer at Ame t+1.
Output Layer

With an opAonal decay Hidden Layer

Input Layer

47

Maximum Margin
Perceptron can lead to many equally valid choices for the decision boundary

Are these really equally valid?


48

Max Margin
How can we pick which is best? Maximize the size of the margin.
Small Margin LargeMargin

Are these really equally valid?


49

Next Time
Maximum Margin Classiers
Support Vector Machines Kernel Methods

50

You might also like