Unit 4 LSTM
Unit 4 LSTM
Unit 4 LSTM
Pushpak Bhattacharyya
CSE Dept,
IIT Patna and Bombay
LSTM
= ∑ (w δ
k∈next layer
kj k )o j (1 − o j )oi for hidden layers
c1
a14
a11 a12 a13
o1 o2 o3 o4
h0 h1 h2
c2
a24
a21 a23
a22
o1 o2 o3 o4
h0 h1 h2 h3
c3
a31 a34
a32
a33
o1 o2 o3 o4
h0 h1 h2 h3 h4
c4
a41 a44
a42 a43
o1 o2 o3 o4
h0 h1 h2 h3 h4 h5
Positive
sentiment
c5
a51 a54
a52 a53
o1 o2 o3 o4
tanh = e −e
x −x
e +e tanh =
f ( x) = max( 0, x)
x
g ( x) = ln(1 + e )
15 jun, 2017 lgsoft:nlp:lstm:pushpak 15
Notation: output
n ot is the output at step t
n ot=softmax(V.st)
= ∑ (w δ
k∈next layer
kj k )o j (1 − o j )oi for hidden layers
Vanilla RNN
n and I am regretting/appreciating
my decision to have bought the
refrigerator from this company
(appreciating à ‘to’; regretting à ‘by’)
n ‘Appreciating’/’Regretting’: transparent;
available on the surface
tanh produces a cell state vector; multiplied with input gate which again
0-1 controls what and how much input goes FOWARD
= ∑ (w δ
k∈next layer
kj k )o j (1 − o j )oi for hidden layers
How does a sequence to sequence model work? Let’s see two paradigms
s0 s1 s1 s3 Decodin
s4
g
(2) A representation
of the sentence is
generated
I read the book
Encoding
eg. To generate the 3rd POS tag, the 3rd annotation vector (hence 3rd word) is
most important
c1
a14
a11 a12 a13
o1 o2 o3 o4
h0 h1 h2
c2
a24
a21 a23
a22
o1 o2 o3 o4
h0 h1 h2 h3
c3
a31 a34
a32
a33
o1 o2 o3 o4
h0 h1 h2 h3 h4
c4
a41 a44
a42 a43
o1 o2 o3 o4
h0 h1 h2 h3 h4 h5
c5
a51 a54
a52 a53
o1 o2 o3 o4
aij = A(oj,hi;o)
Then training the attention network with the rest of the network ensures that
the attention weights are learnt to minimize the translation loss
15 jun, 2017 lgsoft:nlp:lstm:pushpak 63
OK, but do the attention weights actually show focus on
certain parts?
The RNN units model only the sequence seen so far, cannot see the sequence
ahead
● Can use a bidirectional RNN/LSTM
● This is just 2 LSTM encoders run from opposite ends of the sequence and
resulting output vectors are composed
Both types of RNN units process the sequence sequentially, hence parallelism is
limited
n NLP happens in
layers
Part of Speech
Tagging
Discourse and Coreference
Morph
Increased Analysis Marathi French
Complexity Semantics
Of HMM
Processing
Hindi English
Parsing
CRF
Language
MEMM
Algorithm
Chunking
POS tagging
Morphology
n Role in NLP?