Neural Networks For Machine Learning: Lecture 4a Learning To Predict The Next Word
Neural Networks For Machine Learning: Lecture 4a Learning To Predict The Next Word
Neural Networks For Machine Learning: Lecture 4a Learning To Predict The Next Word
Lecture 4a
Learning to predict the next word
Geoffrey Hinton
with
Nitish Srivastava
Kevin Swersky
Andrew = Christine
Victoria = James
Colin
Charlotte
Roberto = Maria
Gina = Emilio
Alfonso
Jennifer = Charles
Pierro = Francesca
Lucia = Marco
Sophia
Angela = Tomaso
output
inputs
Christopher = Penelope
Margaret = Arthur
Andrew = Christine
Victoria = James
Colin
Charlotte
Jennifer = Charles
A large-scale example
Suppose we have a database of millions of relational facts of the
form (A R B).
We could train a net to discover feature vector
representations of the terms that allow the third term to be
predicted from the first two.
Then we could use the trained net to find very unlikely triples.
These are good candidates for errors in the database.
Instead of predicting the third term, we could use all three terms
as input and predict the probability that the fact is correct.
To train such a net we need a good source of false facts.
Nitish Srivastava
Kevin Swersky
Nitish Srivastava
Kevin Swersky
Softmax
The output units in a softmax group
use a non-local non-linearity:
yi
zi
softmax
group
yi =
zi
zj
jgroup
yi
= yi (1 yi )
zi
C = t j log y j
j
target value
C
C y j
=
= yi ti
zi
y j zi
j
Nitish Srivastava
Kevin Swersky
Take a huge amount of text and count the frequencies of all triples of words.
Use these frequencies to make bets on the relative probabilities of words given
the previous two words:
p( w3 = c | w2 = b, w1 = a) count (abc)
=
p( w3 = d | w2 = b, w1 = a) count (abd )
units that learn to predict the output word from features of the input words
learned distributed
encoding of word t-2
table look-up
learned distributed
encoding of word t-1
table look-up
Nitish Srivastava
Kevin Swersky
A serial architecture
Try all candidate next
words one at a time.
learned distributed
encoding of word t-1
table look-up
learned distributed
encoding of candidate
table look-up
index of candidate
1 (v ui )
1 (vT u j )
ul
uj
ui
(vT ui )
uk
(vT u j )
um
un
learned
distributed
encoding of
word t-2
learned
distributed
encoding of
word t-1
table look-up
index of
word at t-2
index of
word at t-1
1 (v ui )
1 (vT u j )
ul
uj
ui
(vT ui )
uk
(vT u j )
um
w(t)
un
A convenient decomposition
Maximizing the log probability of picking the target word is
equivalent to maximizing the sum of the log probabilities of taking all
the branches on the path that leads to the target word.
So during learning, we only need to consider the nodes on the
correct path. This is an exponential win: log(N) instead of N.
For each of these nodes, we know the correct branch and we
know the current probability of taking it so we can get derivatives
for learning both the prediction vector v and that node vector u.
Unfortunately, it is still slow at test time.
right or random?
units that learn to predict the output from features of the input words
word code
word code
word code
word code
word
at t-2
word
at t-1
word at t or
random word
word
at t+1
word code
word
at t+2