Transformers

Attention & Transformers
Mausam
IIT Delhi
(some figures taken from Jay Alammar’s blog)
Attention
Sentence Representation
• Encoding a single vector is too restrictive.
Instead of producing a single vector for the sentence,
produce one vector for each word.
• But, eventually need 1 vector.
Multiple vectors  Single vector
Sum/Avg operators give equal importance to each input
• We dynamically decide which input is more/less
important for a task.
• Create a weighted sum to reflect this variation: Attention
You can’t cram the meaning of
the whole *%#@ing sentence • query (q): decides importance of each input
in a single *%#@ing vector. attention weights (αi): normalized importance of input
unnormalized attention weights (αi): intermediate step to
compute αi
attended summary: weighted avg of input with α weights
LSTM Encoder
c
;
Encoder
(Bi-LSTM)
Multiple Encoded Vectors  Single Summary
ℎ1:𝑇 = biLSTM(𝑥1:𝑇 )
𝑐 = 𝑎𝑡𝑡𝑒𝑛𝑑(ℎ1:𝑇 , 𝑞)
Need to convert his to c
c
𝑇
𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1 h1 h2 h3 h4 h5
𝑖=1
𝛼1:𝑇 = softmax(𝛼1 , 𝛼2 , … , 𝛼 𝑇 )
c
𝛼1 , ℎ1 𝛼5 , ℎ5
𝛼3 , ℎ3
𝛼2 , ℎ2 𝛼4 , ℎ4
𝑖=1
𝛼1:𝑇 = softmax(𝛼1 , 𝛼2 , … , 𝛼 𝑇 )
att c
𝛼𝑖 = 𝜙 (𝑞, ℎ𝑖 )
𝛼1 , ℎ1 𝛼5 , ℎ5
𝛼3 , ℎ3
𝛼2 , ℎ2 𝛼4 , ℎ4
Attention: Encoding
𝑖=1
ℎ1:𝑇 = biLSTM𝑒𝑛𝑐 𝑥1:𝑇
𝛼 = softmax 𝛼1 , … , 𝛼 𝑇
att
what is 𝜙 att ? what is q?

att
Attention Functions 𝜙
att ′
• Bahadanau Attention: 𝜙 𝑞, ℎ = u. g(Wq + W h + b)
att
• Luong Attention: 𝜙 𝑞, ℎ = q. h
att q.h
• Scaled Dot Product Attention: 𝜙 𝑞, ℎ =
𝑑
att
• Bilinear Attention: 𝜙 𝑞, ℎ = hWq
Additive vs Multiplicative
Paper’s Justification:
d is the dimensionality of q and h To illustrate why the dot products get

large, assume that the components of q
and h are independent random
variables with mean 0 and variance 1 
Then their dot product, q · h has mean 0
Scaled dot product attention
and variance d
Attention: Encoding
𝑖=1
ℎ1:𝑇 = biLSTM𝑒𝑛𝑐 𝑥1:𝑇
att
what is q?
Attention and/vs Interpretation
Multi-head Key-Value Self Attention
Self-attention (single-head, high-level)
”The animal didn't cross the street because it was too tired”
There is no external query q.

The input is also the query.
Many approaches:
https://ruder.io/deep-learning-nlp-best-practices/
Transformers: query q is another xj: φatt(xj,xi)

Attention: Encoding (h  x)
𝑐= 𝛼𝑖 . 𝑥𝑖
𝑖=1
att
𝛼𝑖 = 𝜙 (𝑞, 𝑥𝑖 )
Attention: Encoding
𝑐= 𝛼𝑖 . 𝑥𝑖
Each vector (x)
𝑖=1
playing two roles
𝛼 = softmax 𝛼1 , … , 𝛼 𝑇 (1) computing
importance
𝛼𝑖 = 𝜙 att
(𝑞, 𝑥𝑖 ) (2) weighted sum
Key-Value Attention
• Project an input vector xi into two vectors
k: key vector ki=WKxi
v: value vector vi=WVxi
• Use key vector for computing attention

𝑘𝑖 .𝑞
φatt(q,xi)= att
φ (q,ki)= //scaled multiplicative
𝑑
• Use value vector for computing attended summary

𝑇
𝑐= 𝛼𝑖 . 𝑣𝑖
𝑖=1
Key-Value Single-Head Self Attention
• Project an input vector xi into three vectors
k: key vector: ki=WKxi
v: value vector: vi=Wvxi
q: query vector: qi=WQxi
• Use key and query vectors for computing attention of ith word at word j
𝑘𝑖 .𝑞𝑗
φatt (xj;xi)= //scaled multiplicative
𝑑
𝑇
• Use value vector for computing attended summary 𝑗
𝑐 = 𝛼𝑖 . 𝑣𝑖
𝑖=1
Creation of query, key and
value vectors by
multiplying by trained
weight matrices
Separation of Value and

Key and Query
Matrix multiplications
are quite efficient and
can be done in
aggregated manner
Images from https://jalammar.github.io/illustrated-transformer/
Key-Value Single-Head
Self Attention


Key-Value Multi-Head Self Attention

Multi-Head Attention

Multi-Head Attended Vector  Output
One for each word
One for each attention head

Key-Value Multi-Head Self Attention (summary)

Multi-head
Self attention
visualisation
(Interpretable?!)

Transformer Encoders
Motivation
• Recurrence is powerful but
• Issues with learnability: vanishing gradients
• Issues with remembering long sentences
• Issues with scalability:
• backpropagation time high due to sequentiality in sentence length
• Issues with scalability:
• can’t be parallelized even at test time – O(sentence length)
• Remove recurrence: only use attention

“Attention is All You Need”
We focus only on encoder for now… (decoder is an extension of sequence decoders)
Zooming in...
Can you
see a
fundamental
limitation?
Encoders have same architecture but different weights… Zooming in further...

A note on Positional embeddings
Positional embeddings can be extended to any sentence length but if any test input
is longer than all training inputs then we will face issues.
Solution: use a functional form (as in Transformer paper – sinuisoidal encoding)
Adding residual
connections...

The residual connections help the
network train, by allowing gradients
to flow through the networks
directly.
The layer normalizations stabilize the

network -- substantially reducing the
training time necessary.
𝐱+𝒛−𝜇
z=LayerNorm(𝐱 + 𝒛)= 𝛾 +𝛽
𝜎
The pointwise feedforward layer is

used to project the attention outputs
potentially giving it a richer
representation.

Regularization
Residual dropout: Dropout added to the the output of each sublayer, before it is
added to the input of the sublayer and normalized
Label Smoothing: During training label smoothing was employed. This hurts
perplexity, as the model learns to be more unsure, but improves accuracy and BLEU
score. (skip for now)

Zooming in...
Zooming in...
Use of [CLS] for Text Classification
MLP+Softmax
Transformer
Pros
● Current state-of-the-art.
● Enables deep architectures
● Easier learning of long-range dependencies
● Can be efficiently parallelized
● Gradients don’t suffer from vanishing gradients

Cons
Huge number of parameters so
● Very data hungry
● Takes a long time to train
● Memory inefficient
Other issues
● Keeping sentence length limited
● How to ensure multi-head attention has diverse perspectives.

Transformers

Uploaded by

Copyright:

Available Formats

Transformers

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transformers

Uploaded by

Copyright:

Available Formats

Attention & Transformers

ℎ1:𝑇 = biLSTM𝑒𝑛𝑐 𝑥1:𝑇

what is 𝜙 att ? what is q?

d is the dimensionality of q and h To illustrate why the dot products get

ℎ1:𝑇 = biLSTM𝑒𝑛𝑐 𝑥1:𝑇

There is no external query q.

Transformers: query q is another xj: φatt(xj,xi)

• Use key vector for computing attention

• Use value vector for computing attended summary

Separation of Value and

Images from https://jalammar.github.io/illustrated-transformer/

Images from https://jalammar.github.io/illustrated-transformer/

Images from https://jalammar.github.io/illustrated-transformer/

Images from https://jalammar.github.io/illustrated-transformer/

One for each word

One for each attention head

Images from https://jalammar.github.io/illustrated-transformer/

Images from https://jalammar.github.io/illustrated-transformer/

Images from https://jalammar.github.io/illustrated-transformer/

• Remove recurrence: only use attention

Encoders have same architecture but different weights… Zooming in further...

Images from https://jalammar.github.io/illustrated-transformer/

The layer normalizations stabilize the

The pointwise feedforward layer is

Images from https://jalammar.github.io/illustrated-transformer/

Images from https://jalammar.github.io/illustrated-transformer/

● Enables deep architectures

● Easier learning of long-range dependencies

● Can be efficiently parallelized

● Gradients don’t suffer from vanishing gradients

You might also like