Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Transformers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Attention & Transformers

Mausam
IIT Delhi
(some figures taken from Jay Alammar’s blog)
Attention
Sentence Representation
• Encoding a single vector is too restrictive.
Instead of producing a single vector for the sentence,
produce one vector for each word.
• But, eventually need 1 vector.
Multiple vectors  Single vector
Sum/Avg operators give equal importance to each input
• We dynamically decide which input is more/less
important for a task.
• Create a weighted sum to reflect this variation: Attention
You can’t cram the meaning of
the whole *%#@ing sentence • query (q): decides importance of each input
in a single *%#@ing vector. attention weights (αi): normalized importance of input
unnormalized attention weights (αi): intermediate step to
compute αi
attended summary: weighted avg of input with α weights
LSTM Encoder

c
;

Encoder
(Bi-LSTM)
Multiple Encoded Vectors  Single Summary

ℎ1:𝑇 = biLSTM(𝑥1:𝑇 )
𝑐 = 𝑎𝑡𝑡𝑒𝑛𝑑(ℎ1:𝑇 , 𝑞)
Need to convert his to c
c
𝑇

𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1 h1 h2 h3 h4 h5
Multiple Encoded Vectors  Single Summary

𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1
𝛼1:𝑇 = softmax(𝛼1 , 𝛼2 , … , 𝛼 𝑇 )
c
𝛼1 , ℎ1 𝛼5 , ℎ5
𝛼3 , ℎ3
𝛼2 , ℎ2 𝛼4 , ℎ4
Multiple Encoded Vectors  Single Summary

𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1
𝛼1:𝑇 = softmax(𝛼1 , 𝛼2 , … , 𝛼 𝑇 )
att c
𝛼𝑖 = 𝜙 (𝑞, ℎ𝑖 )
𝛼1 , ℎ1 𝛼5 , ℎ5
𝛼3 , ℎ3
𝛼2 , ℎ2 𝛼4 , ℎ4
Attention: Encoding

𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1

ℎ1:𝑇 = biLSTM𝑒𝑛𝑐 𝑥1:𝑇

𝛼 = softmax 𝛼1 , … , 𝛼 𝑇

att
𝛼𝑖 = 𝜙 (𝑞, ℎ𝑖 )

what is 𝜙 att ? what is q?


att
Attention Functions 𝜙

att ′
• Bahadanau Attention: 𝜙 𝑞, ℎ = u. g(Wq + W h + b)

att
• Luong Attention: 𝜙 𝑞, ℎ = q. h

att q.h
• Scaled Dot Product Attention: 𝜙 𝑞, ℎ =
𝑑

att
• Bilinear Attention: 𝜙 𝑞, ℎ = hWq
Additive vs Multiplicative

Paper’s Justification:

d is the dimensionality of q and h To illustrate why the dot products get


large, assume that the components of q
and h are independent random
variables with mean 0 and variance 1 
Then their dot product, q · h has mean 0
Scaled dot product attention
and variance d
Attention: Encoding

𝑐= 𝛼𝑖 . ℎ𝑖
𝑖=1

ℎ1:𝑇 = biLSTM𝑒𝑛𝑐 𝑥1:𝑇

𝛼 = softmax 𝛼1 , … , 𝛼 𝑇

att
𝛼𝑖 = 𝜙 (𝑞, ℎ𝑖 )

what is q?
Attention and/vs Interpretation
Multi-head Key-Value Self Attention
Self-attention (single-head, high-level)
”The animal didn't cross the street because it was too tired”

There is no external query q.


The input is also the query.
Many approaches:
https://ruder.io/deep-learning-nlp-best-practices/

Transformers: query q is another xj: φatt(xj,xi)


Attention: Encoding (h  x)

𝑐= 𝛼𝑖 . 𝑥𝑖
𝑖=1

𝛼 = softmax 𝛼1 , … , 𝛼 𝑇

att
𝛼𝑖 = 𝜙 (𝑞, 𝑥𝑖 )
Attention: Encoding

𝑐= 𝛼𝑖 . 𝑥𝑖
Each vector (x)
𝑖=1
playing two roles
𝛼 = softmax 𝛼1 , … , 𝛼 𝑇 (1) computing
importance
𝛼𝑖 = 𝜙 att
(𝑞, 𝑥𝑖 ) (2) weighted sum
Key-Value Attention
• Project an input vector xi into two vectors
k: key vector ki=WKxi
v: value vector vi=WVxi

• Use key vector for computing attention


𝑘𝑖 .𝑞
φatt(q,xi)= att
φ (q,ki)= //scaled multiplicative
𝑑

• Use value vector for computing attended summary


𝑇

𝑐= 𝛼𝑖 . 𝑣𝑖
𝑖=1
Key-Value Single-Head Self Attention
• Project an input vector xi into three vectors
k: key vector: ki=WKxi
v: value vector: vi=Wvxi
q: query vector: qi=WQxi

• Use key and query vectors for computing attention of ith word at word j
𝑘𝑖 .𝑞𝑗
φatt (xj;xi)= //scaled multiplicative
𝑑
𝑇
• Use value vector for computing attended summary 𝑗
𝑐 = 𝛼𝑖 . 𝑣𝑖
𝑖=1
Key-Value Single-Head Self Attention
Creation of query, key and
value vectors by
multiplying by trained
weight matrices

Separation of Value and


Key and Query

Matrix multiplications
are quite efficient and
can be done in
aggregated manner
Images from https://jalammar.github.io/illustrated-transformer/
Key-Value Single-Head
Self Attention

Images from https://jalammar.github.io/illustrated-transformer/


Key-Value Single-Head Self Attention

Images from https://jalammar.github.io/illustrated-transformer/


Key-Value Multi-Head Self Attention

Images from https://jalammar.github.io/illustrated-transformer/


Multi-Head Attention

Images from https://jalammar.github.io/illustrated-transformer/


Multi-Head Attended Vector  Output

One for each word

One for each attention head

Images from https://jalammar.github.io/illustrated-transformer/


Key-Value Multi-Head Self Attention (summary)

Images from https://jalammar.github.io/illustrated-transformer/


Multi-head
Self attention
visualisation
(Interpretable?!)

Images from https://jalammar.github.io/illustrated-transformer/


Transformer Encoders
Motivation
• Recurrence is powerful but
• Issues with learnability: vanishing gradients
• Issues with remembering long sentences
• Issues with scalability:
• backpropagation time high due to sequentiality in sentence length
• Issues with scalability:
• can’t be parallelized even at test time – O(sentence length)

• Remove recurrence: only use attention


“Attention is All You Need”
We focus only on encoder for now… (decoder is an extension of sequence decoders)
Images from https://jalammar.github.io/illustrated-transformer/
Zooming in...
Images from https://jalammar.github.io/illustrated-transformer/
Can you
see a
fundamental
limitation?

Encoders have same architecture but different weights… Zooming in further...


Images from https://jalammar.github.io/illustrated-transformer/
A note on Positional embeddings

Positional embeddings can be extended to any sentence length but if any test input
is longer than all training inputs then we will face issues.
Solution: use a functional form (as in Transformer paper – sinuisoidal encoding)
Images from https://jalammar.github.io/illustrated-transformer/
Adding residual
connections...

Images from https://jalammar.github.io/illustrated-transformer/


The residual connections help the
network train, by allowing gradients
to flow through the networks
directly.

The layer normalizations stabilize the


network -- substantially reducing the
training time necessary.
𝐱+𝒛−𝜇
z=LayerNorm(𝐱 + 𝒛)= 𝛾 +𝛽
𝜎

The pointwise feedforward layer is


used to project the attention outputs
potentially giving it a richer
representation.

Images from https://jalammar.github.io/illustrated-transformer/


Regularization
Residual dropout: Dropout added to the the output of each sublayer, before it is
added to the input of the sublayer and normalized
Label Smoothing: During training label smoothing was employed. This hurts
perplexity, as the model learns to be more unsure, but improves accuracy and BLEU
score. (skip for now)

Images from https://jalammar.github.io/illustrated-transformer/


Images from https://jalammar.github.io/illustrated-transformer/
Zooming in...
Images from https://jalammar.github.io/illustrated-transformer/
Zooming in...
Images from https://jalammar.github.io/illustrated-transformer/
Use of [CLS] for Text Classification

MLP+Softmax

Transformer
Pros
● Current state-of-the-art.

● Enables deep architectures

● Easier learning of long-range dependencies

● Can be efficiently parallelized

● Gradients don’t suffer from vanishing gradients


Cons
Huge number of parameters so
● Very data hungry
● Takes a long time to train
● Memory inefficient
Other issues
● Keeping sentence length limited
● How to ensure multi-head attention has diverse perspectives.

You might also like