Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Natural Language Processing

AC3110E

1
Chapter 10: Advanced Deep Learning
Techniques for Text

Lecturer: PhD. DO Thi Ngoc Diep


SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Outline

• Transformers net
• Transformers as Language Models
• Bidirectional Transformer Encoders
• Transfer Learning through Fine-Tuning

3
10.1. The transformer blocks

• The most common architecture for language modeling (dated 2022)


• non-recurrent networks
• handle distant information
• more efficient to implement at scale
• Make from stacks of transformer blocks
• Each block: a multilayer network made by combining simple linear layers,
feedforward networks, and self-attention layers

4
Single self-attention layer

• Extract and use information from arbitrarily large contexts without the need
to pass through intermediate recurrent connections
• A single causal self-attention layer:
• Input sequence: (x1,...,xn)
• Output sequence: (y1,...,yn)
• Self-attention: The output y is the
result of a straightforward
computation over the inputs
• The computations at each time step are
independent of all the other steps and
therefore can be performed in parallel.

• Simple dot-product based self-attention :


• α
• α = softmax(score( , ) j i)
• score( , )=

5
Single self-attention layer

• A single causal self-attention layer:


• Transformers consider 3 different roles for each input embedding
• Query : As the current focus of attention when being
compared to all of the other preceding inputs
(weight matrix 𝐖 𝐐 ∈ ℝ × )
• Key: as a preceding input being compared
(weight matrix 𝐖 𝐊 ∈ ℝ × )
• Value: as a value used to compute the output for the
current focus of value attention
(weight matrix 𝐖 𝐕 ∈ ℝ × )

• Transformer self-attention :
• 𝐐

• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j i)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=

6
Single self-attention layer

• A single causal self-attention layer:


• Transformer self-attention for input matrix X of N input tokens:
• 𝐗∈ ℝ ×
• 𝐐 = 𝐗 𝐖𝐐
• 𝐊 = 𝐗 𝐖𝐊
• 𝐕 = 𝐗 𝐖𝐕
× 𝐐𝐊
• Y∈ ℝ = SelfAttention(Q,K,V)=softmax 𝐕
• “Masked Attention”
• As for language models, we don’t look at the future when predicting a sequence
=> mask out attention to future words
• The matrix shows the qi · kj values => Mask the upper-triangle portion of
the matrix to −∞ (the softmax will turn to zero)

7
Multihead self-attention layer

• To capture all of the different kinds of


parallel relations among its inputs.
• sets of self-attention layers, called heads
• each head learns different aspects of the
relationships that exist among inputs
at the same level of abstraction
• Each head i:
𝐊 × 𝐐 𝐕
• 𝐢 , 𝐢 ∈ ℝ ×
, 𝐢 ∈ ℝ ×
𝐐 𝐊 𝐕
• 𝐢 𝐢 𝐢
• 𝐎 ×
𝐎

8
Layer Normalization (layer norm)

• To improve training performance


• Hidden values are normalized to zero mean and a standard deviation of one within
each layer.
• by keeping the values of a hidden layer in a range  facilitates gradient-based
training
• Input:
• a vector to normalize x, with dimensionality dh
• Calculate:
• 𝜇= ∑ 𝑥 ;σ= ∑ (𝑥 −𝜇)

• Output:
(𝐱 )
• LayerNorm(x) = 𝛾𝐱 + 𝛽 = 𝛾 +𝛽

9
Positional embedding

• To model the position of each token in the input sequence (the word order)
• Absolute position (index) representation
• Sinusoidal position representation
• Learned absolute position representations
• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position
• etc.

10
10.1.1. Transformers as Language Models

• Given a training corpus of plain text, train the model autoregressively to


predict the next token in a sequence yt, using cross-entropy loss
• Each training item can be processed in parallel since the output for each element in
the sequence is computed separately.

11
10.1.2. Bidirectional Transformer Encoders

• Bidirectional encoders: allow the self-attention mechanism to range over the


entire input
=> BERT (Bidirectional Encoder Representations from Transformers)
• In processing each element of the sequence, the model attends to all inputs, both
before and after the current one
𝐐

• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j n)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=
• The matrix matrix showing the complete set of qi·kj comparisons, no more
masking => bidirectional context

12
Bidirectional Transformer Encoders training

• BERT (Bidirectional Encoder Representations from Transformers)


• Masked Language Modeling (MLM)
approach
• Instead of trying to predict the
next word, the model learns to predict
the missing element
• A random sample of tokens from each
training sequence is selected for
learning (15% of the input tokens)
• Once chosen, a token is used in one
of three ways:
• It is replaced with the unique vocabulary token [MASK]. (80%)
• It is replaced with another token from the vocabulary, randomly sampled based on token unigram
probabilities. (10%)
• It is left unchanged. (10%)
• Objective is to predict the original inputs for each of the masked tokens
•  Generate a probability distribution over the vocabulary for each of the missing items

• SpanBERT: masking spans of words


• Next Sentence Prediction (NSP)
• RoBERTa: for longer and remove NSP

(Devlin et al., 2019), (Joshi et al., 2020) 13


10.2. Transfer Learning through Fine-Tuning

• Transfer learning: Acquiring knowledge from one task or domain, and then applying it
(transferring it) to solve a new task
• Fine-tuning :
• Pretrained language models contain a rich representations of word meaning => can
be leveraged in other downstream applications through fine-tuning.
• Fine-tuning process:
• Add application-specific parameters on top of pre-trained models
• Use labeled data from the application to train these additional application-specific parameters
• Can freeze or make only minimal adjustments to the pretrained language model parameters

14
10.2. Transfer Learning to Downstream Tasks

• Neural architecture influences the type of pretraining


• Encoders architecture
• Encoder-Decoders architecture
• Decoders architecture

https://jalammar.github.io/illustrated-bert/ 15
Encoders architecture: BERT Fine-Tuning

• Sentiment classification
• Finetuning a set of weights WC # ×
uses supervised training data
• Can update over the limited final few layers of the transformer

+ Add a new token as the start of all input sequences


16
Encoders architecture: BERT Fine-Tuning

• Part-of-speech tagging, BIO-based named entity recognition


• The final output vector corresponding to each input token is passed to a classifier that
produces a softmax distribution over the possible set of tags

17
Encoders architecture: BERT Fine-Tuning

• Span-oriented approach
• Named entity recognition, question answering, syntactic parsing, semantic role
labeling and co-reference resolution.

18
Encoders architecture: BERT Fine-Tuning

• Finetuning BERT also led to new state-of-the-art results on a broad range of


tasks:
• QQP: Quora Question Pairs (detect paraphrase questions)
• QNLI: natural language inference over question answering data
• SST-2: sentiment analysis
• CoLA: Sentence acceptability judgment (detect whether sentences are grammatical.)
• STS-B: semantic textual similarity
• MRPC: Paraphrasing/sentence similarity
• RTE: a small natural language inference corpus

19
Encoder-Decoder architecture: pretrained model T5

• Google model T5
• Span corruption as objective function

• Lots of downstream tasks:

20
Decoder architecture: GPT model

• Generative Pretrained Transformer (GPT)


• Type of large language model (LLM)
• Based on the transformer architecture, pre-trained on large data sets of un-labelled
text, and able to generate novel human-like content
• Transformer decoder with 12 layers, 117M parameters.
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Byte-pair encoding with 40,000 merges
• Trained on BooksCorpus: over 7000 unique books.
• Contains long spans of contiguous text, for learning long-distance dependencies.
• GPT-2: a larger version (1.5B) of GPT trained on more data
• GPT-3: in-context learning
• 175 billion parameters
• Trained on 300B tokens of text
• GPT-4 (March 2023)
• basis for more task-specific GPT systems, including models fine-tuned for instruction
following (ChatGPT chatbot service)

OpenAI (Radford et al., 2018) 21


• end of Chapter 10

22

You might also like