Transformers
Transformers
Transformers
Mausam
IIT Delhi
(some figures taken from Jay Alammar’s blog)
Attention
Sentence Representation
• Encoding a single vector is too restrictive.
Instead of producing a single vector for the sentence,
produce one vector for each word.
• But, eventually need 1 vector.
Multiple vectors Single vector
Sum/Avg operators give equal importance to each input
• We dynamically decide which input is more/less
important for a task.
• Create a weighted sum to reflect this variation: Attention
You can’t cram the meaning of
the whole *%#@ing sentence • query (q): decides importance of each input
in a single *%#@ing vector. attention weights (αi): normalized importance of input
unnormalized attention weights (αi): intermediate step to
compute αi
attended summary: weighted avg of input with α weights
LSTM Encoder
c
;
Encoder
(Bi-LSTM)
Multiple Encoded Vectors Single Summary
ℎ1:𝑇 = biLSTM(𝑥1:𝑇 )
𝑐 = 𝑎𝑡𝑡𝑒𝑛𝑑(ℎ1:𝑇 , 𝑞)
Need to convert his to c
c
𝑇
𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1 h1 h2 h3 h4 h5
Multiple Encoded Vectors Single Summary
𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1
𝛼1:𝑇 = softmax(𝛼1 , 𝛼2 , … , 𝛼 𝑇 )
c
𝛼1 , ℎ1 𝛼5 , ℎ5
𝛼3 , ℎ3
𝛼2 , ℎ2 𝛼4 , ℎ4
Multiple Encoded Vectors Single Summary
𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1
𝛼1:𝑇 = softmax(𝛼1 , 𝛼2 , … , 𝛼 𝑇 )
att c
𝛼𝑖 = 𝜙 (𝑞, ℎ𝑖 )
𝛼1 , ℎ1 𝛼5 , ℎ5
𝛼3 , ℎ3
𝛼2 , ℎ2 𝛼4 , ℎ4
Attention: Encoding
𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1
𝛼 = softmax 𝛼1 , … , 𝛼 𝑇
att
𝛼𝑖 = 𝜙 (𝑞, ℎ𝑖 )
att ′
• Bahadanau Attention: 𝜙 𝑞, ℎ = u. g(Wq + W h + b)
att
• Luong Attention: 𝜙 𝑞, ℎ = q. h
att q.h
• Scaled Dot Product Attention: 𝜙 𝑞, ℎ =
𝑑
att
• Bilinear Attention: 𝜙 𝑞, ℎ = hWq
Additive vs Multiplicative
Paper’s Justification:
𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1
𝛼 = softmax 𝛼1 , … , 𝛼 𝑇
att
𝛼𝑖 = 𝜙 (𝑞, ℎ𝑖 )
what is q?
Attention and/vs Interpretation
Multi-head Key-Value Self Attention
Self-attention (single-head, high-level)
”The animal didn't cross the street because it was too tired”
𝑐= 𝛼𝑖 . 𝑥𝑖
𝑖=1
𝛼 = softmax 𝛼1 , … , 𝛼 𝑇
att
𝛼𝑖 = 𝜙 (𝑞, 𝑥𝑖 )
Attention: Encoding
𝑐= 𝛼𝑖 . 𝑥𝑖
Each vector (x)
𝑖=1
playing two roles
𝛼 = softmax 𝛼1 , … , 𝛼 𝑇 (1) computing
importance
𝛼𝑖 = 𝜙 att
(𝑞, 𝑥𝑖 ) (2) weighted sum
Key-Value Attention
• Project an input vector xi into two vectors
k: key vector ki=WKxi
v: value vector vi=WVxi
𝑐= 𝛼𝑖 . 𝑣𝑖
𝑖=1
Key-Value Single-Head Self Attention
• Project an input vector xi into three vectors
k: key vector: ki=WKxi
v: value vector: vi=Wvxi
q: query vector: qi=WQxi
• Use key and query vectors for computing attention of ith word at word j
𝑘𝑖 .𝑞𝑗
φatt (xj;xi)= //scaled multiplicative
𝑑
𝑇
• Use value vector for computing attended summary 𝑗
𝑐 = 𝛼𝑖 . 𝑣𝑖
𝑖=1
Key-Value Single-Head Self Attention
Creation of query, key and
value vectors by
multiplying by trained
weight matrices
Matrix multiplications
are quite efficient and
can be done in
aggregated manner
Images from https://jalammar.github.io/illustrated-transformer/
Key-Value Single-Head
Self Attention
Positional embeddings can be extended to any sentence length but if any test input
is longer than all training inputs then we will face issues.
Solution: use a functional form (as in Transformer paper – sinuisoidal encoding)
Images from https://jalammar.github.io/illustrated-transformer/
Adding residual
connections...
MLP+Softmax
Transformer
Pros
● Current state-of-the-art.