14.Chapter10_AdvancedDeepLearningForText
14.Chapter10_AdvancedDeepLearningForText
AC3110E
1
Chapter 10: Advanced Deep Learning
Techniques for Text
• Transformers net
• Transformers as Language Models
• Bidirectional Transformer Encoders
• Transfer Learning through Fine-Tuning
3
10.1. The transformer blocks
4
Single self-attention layer
• Extract and use information from arbitrarily large contexts without the need
to pass through intermediate recurrent connections
• A single causal self-attention layer:
• Input sequence: (x1,...,xn)
• Output sequence: (y1,...,yn)
• Self-attention: The output y is the
result of a straightforward
computation over the inputs
• The computations at each time step are
independent of all the other steps and
therefore can be performed in parallel.
5
Single self-attention layer
• Transformer self-attention :
• 𝐐
• 𝐊
• 𝐕
• α
• α = softmax(score( , ) j i)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=
6
Single self-attention layer
7
Multihead self-attention layer
8
Layer Normalization (layer norm)
• Output:
(𝐱 )
• LayerNorm(x) = 𝛾𝐱 + 𝛽 = 𝛾 +𝛽
9
Positional embedding
• To model the position of each token in the input sequence (the word order)
• Absolute position (index) representation
• Sinusoidal position representation
• Learned absolute position representations
• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position
• etc.
10
10.1.1. Transformers as Language Models
11
10.1.2. Bidirectional Transformer Encoders
• 𝐕
• α
• α = softmax(score( , ) j n)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=
• The matrix matrix showing the complete set of qi·kj comparisons, no more
masking => bidirectional context
12
Bidirectional Transformer Encoders training
• Transfer learning: Acquiring knowledge from one task or domain, and then applying it
(transferring it) to solve a new task
• Fine-tuning :
• Pretrained language models contain a rich representations of word meaning => can
be leveraged in other downstream applications through fine-tuning.
• Fine-tuning process:
• Add application-specific parameters on top of pre-trained models
• Use labeled data from the application to train these additional application-specific parameters
• Can freeze or make only minimal adjustments to the pretrained language model parameters
14
10.2. Transfer Learning to Downstream Tasks
https://jalammar.github.io/illustrated-bert/ 15
Encoders architecture: BERT Fine-Tuning
• Sentiment classification
• Finetuning a set of weights WC # ×
uses supervised training data
• Can update over the limited final few layers of the transformer
17
Encoders architecture: BERT Fine-Tuning
• Span-oriented approach
• Named entity recognition, question answering, syntactic parsing, semantic role
labeling and co-reference resolution.
18
Encoders architecture: BERT Fine-Tuning
19
Encoder-Decoder architecture: pretrained model T5
• Google model T5
• Span corruption as objective function
20
Decoder architecture: GPT model
22