0% found this document useful (0 votes)

2 views

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

14.Chapter10_AdvancedDeepLearningForText

Uploaded by

Minh Mai Ngọc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Natural Language Processing

AC3110E

1
Chapter 10: Advanced Deep Learning
Techniques for Text

Lecturer: PhD. DO Thi Ngoc Diep

SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Outline

• Transformers net
• Transformers as Language Models
• Bidirectional Transformer Encoders
• Transfer Learning through Fine-Tuning

3
10.1. The transformer blocks

• The most common architecture for language modeling (dated 2022)

• non-recurrent networks
• handle distant information
• more efficient to implement at scale
• Make from stacks of transformer blocks
• Each block: a multilayer network made by combining simple linear layers,
feedforward networks, and self-attention layers

4
Single self-attention layer

• Extract and use information from arbitrarily large contexts without the need
to pass through intermediate recurrent connections
• A single causal self-attention layer:
• Input sequence: (x1,...,xn)
• Output sequence: (y1,...,yn)
• Self-attention: The output y is the
result of a straightforward
computation over the inputs
• The computations at each time step are
independent of all the other steps and
therefore can be performed in parallel.

• Simple dot-product based self-attention :

• α
• α = softmax(score( , ) j i)
• score( , )=

5
Single self-attention layer

• A single causal self-attention layer:

• Transformers consider 3 different roles for each input embedding
• Query : As the current focus of attention when being
compared to all of the other preceding inputs
(weight matrix 𝐖 𝐐 ∈ ℝ × )
• Key: as a preceding input being compared
(weight matrix 𝐖 𝐊 ∈ ℝ × )
• Value: as a value used to compute the output for the
current focus of value attention
(weight matrix 𝐖 𝐕 ∈ ℝ × )

• Transformer self-attention :
• 𝐐

• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j i)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=

6
Single self-attention layer

• A single causal self-attention layer:

• Transformer self-attention for input matrix X of N input tokens:
• 𝐗∈ ℝ ×
• 𝐐 = 𝐗 𝐖𝐐
• 𝐊 = 𝐗 𝐖𝐊
• 𝐕 = 𝐗 𝐖𝐕
× 𝐐𝐊
• Y∈ ℝ = SelfAttention(Q,K,V)=softmax 𝐕
• “Masked Attention”
• As for language models, we don’t look at the future when predicting a sequence
=> mask out attention to future words
• The matrix shows the qi · kj values => Mask the upper-triangle portion of
the matrix to −∞ (the softmax will turn to zero)

7
Multihead self-attention layer

• To capture all of the different kinds of

parallel relations among its inputs.
• sets of self-attention layers, called heads
• each head learns different aspects of the
relationships that exist among inputs
at the same level of abstraction
• Each head i:
𝐊 × 𝐐 𝐕
• 𝐢 , 𝐢 ∈ ℝ ×
, 𝐢 ∈ ℝ ×
𝐐 𝐊 𝐕
• 𝐢 𝐢 𝐢
• 𝐎 ×
𝐎
•

8
Layer Normalization (layer norm)

• To improve training performance

• Hidden values are normalized to zero mean and a standard deviation of one within
each layer.
• by keeping the values of a hidden layer in a range  facilitates gradient-based
training
• Input:
• a vector to normalize x, with dimensionality dh
• Calculate:
• 𝜇= ∑ 𝑥 ;σ= ∑ (𝑥 −𝜇)

• Output:
(𝐱 )
• LayerNorm(x) = 𝛾𝐱 + 𝛽 = 𝛾 +𝛽

9
Positional embedding

• To model the position of each token in the input sequence (the word order)
• Absolute position (index) representation
• Sinusoidal position representation
• Learned absolute position representations
• Relative linear position attention [Shaw et al., 2018]
• Dependency syntax-based position
• etc.

10
10.1.1. Transformers as Language Models

• Given a training corpus of plain text, train the model autoregressively to

predict the next token in a sequence yt, using cross-entropy loss
• Each training item can be processed in parallel since the output for each element in
the sequence is computed separately.

11
10.1.2. Bidirectional Transformer Encoders

• Bidirectional encoders: allow the self-attention mechanism to range over the

entire input
=> BERT (Bidirectional Encoder Representations from Transformers)
• In processing each element of the sequence, the model attends to all inputs, both
before and after the current one
𝐐
•
• 𝐊

• 𝐕

• α
• α = softmax(score( , ) j n)
• Scaled dot-product :
𝐪 ⋅𝒌
score( , )=
• The matrix matrix showing the complete set of qi·kj comparisons, no more
masking => bidirectional context

12
Bidirectional Transformer Encoders training

• BERT (Bidirectional Encoder Representations from Transformers)

• Masked Language Modeling (MLM)
approach
• Instead of trying to predict the
next word, the model learns to predict
the missing element
• A random sample of tokens from each
training sequence is selected for
learning (15% of the input tokens)
• Once chosen, a token is used in one
of three ways:
• It is replaced with the unique vocabulary token [MASK]. (80%)
• It is replaced with another token from the vocabulary, randomly sampled based on token unigram
probabilities. (10%)
• It is left unchanged. (10%)
• Objective is to predict the original inputs for each of the masked tokens
•  Generate a probability distribution over the vocabulary for each of the missing items

• SpanBERT: masking spans of words

• Next Sentence Prediction (NSP)
• RoBERTa: for longer and remove NSP

(Devlin et al., 2019), (Joshi et al., 2020) 13

10.2. Transfer Learning through Fine-Tuning

• Transfer learning: Acquiring knowledge from one task or domain, and then applying it
(transferring it) to solve a new task
• Fine-tuning :
• Pretrained language models contain a rich representations of word meaning => can
be leveraged in other downstream applications through fine-tuning.
• Fine-tuning process:
• Add application-specific parameters on top of pre-trained models
• Use labeled data from the application to train these additional application-specific parameters
• Can freeze or make only minimal adjustments to the pretrained language model parameters

14
10.2. Transfer Learning to Downstream Tasks

• Neural architecture influences the type of pretraining

• Encoders architecture
• Encoder-Decoders architecture
• Decoders architecture

https://jalammar.github.io/illustrated-bert/ 15
Encoders architecture: BERT Fine-Tuning

• Sentiment classification
• Finetuning a set of weights WC # ×
uses supervised training data
• Can update over the limited final few layers of the transformer

+ Add a new token as the start of all input sequences

16
Encoders architecture: BERT Fine-Tuning

• Part-of-speech tagging, BIO-based named entity recognition

• The final output vector corresponding to each input token is passed to a classifier that
produces a softmax distribution over the possible set of tags

17
Encoders architecture: BERT Fine-Tuning

• Span-oriented approach
• Named entity recognition, question answering, syntactic parsing, semantic role
labeling and co-reference resolution.

18
Encoders architecture: BERT Fine-Tuning

• Finetuning BERT also led to new state-of-the-art results on a broad range of

tasks:
• QQP: Quora Question Pairs (detect paraphrase questions)
• QNLI: natural language inference over question answering data
• SST-2: sentiment analysis
• CoLA: Sentence acceptability judgment (detect whether sentences are grammatical.)
• STS-B: semantic textual similarity
• MRPC: Paraphrasing/sentence similarity
• RTE: a small natural language inference corpus

19
Encoder-Decoder architecture: pretrained model T5

• Google model T5
• Span corruption as objective function

• Lots of downstream tasks:

20
Decoder architecture: GPT model

• Generative Pretrained Transformer (GPT)

• Type of large language model (LLM)
• Based on the transformer architecture, pre-trained on large data sets of un-labelled
text, and able to generate novel human-like content
• Transformer decoder with 12 layers, 117M parameters.
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Byte-pair encoding with 40,000 merges
• Trained on BooksCorpus: over 7000 unique books.
• Contains long spans of contiguous text, for learning long-distance dependencies.
• GPT-2: a larger version (1.5B) of GPT trained on more data
• GPT-3: in-context learning
• 175 billion parameters
• Trained on 300B tokens of text
• GPT-4 (March 2023)
• basis for more task-specific GPT systems, including models fine-tuned for instruction
following (ChatGPT chatbot service)

OpenAI (Radford et al., 2018) 21

• end of Chapter 10

Whitepaper - Foundational Large Language Models & Text Generation
100% (1)
Whitepaper - Foundational Large Language Models & Text Generation
75 pages
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
272 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Arjunawiwāha The Marriage of Arjuna of Mpu Kanwa PDF
100% (1)
Arjunawiwāha The Marriage of Arjuna of Mpu Kanwa PDF
221 pages
Transformer
No ratings yet
Transformer
5 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
LLM
No ratings yet
LLM
41 pages
generative AI Unit 3 notes
No ratings yet
generative AI Unit 3 notes
8 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Transformers
No ratings yet
Transformers
10 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
L.7
No ratings yet
L.7
54 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Transformer
No ratings yet
Transformer
55 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
2024_Transformer_master
No ratings yet
2024_Transformer_master
50 pages
Transformers
No ratings yet
Transformers
27 pages
Transformer
No ratings yet
Transformer
10 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
GenAI_Syllabus
No ratings yet
GenAI_Syllabus
17 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Presentation 11 (1)
No ratings yet
Presentation 11 (1)
20 pages
transformer slides
No ratings yet
transformer slides
21 pages
Attention_is_All_You_Need__Explained
No ratings yet
Attention_is_All_You_Need__Explained
46 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
22 pages
How Transformers Work_ A Detailed Exploration of Transformer Architecture _ DataCamp
No ratings yet
How Transformers Work_ A Detailed Exploration of Transformer Architecture _ DataCamp
19 pages
14 04 Transformers
No ratings yet
14 04 Transformers
11 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
Transformer Architecture explained in LLMs
No ratings yet
Transformer Architecture explained in LLMs
2 pages
Transformers
No ratings yet
Transformers
2 pages
BTech Advanced AI Unit03
No ratings yet
BTech Advanced AI Unit03
109 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformers
No ratings yet
Transformers
21 pages
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
No ratings yet
Fine-Tuning and Masked Lan-Guage Models: 11.1 Bidirectional Transformer Encoders
17 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
imp_ml
No ratings yet
imp_ml
8 pages
Unit 4 LLM
No ratings yet
Unit 4 LLM
11 pages
Session 8
No ratings yet
Session 8
24 pages
Summaries of The Chapters
No ratings yet
Summaries of The Chapters
29 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
Machine Translation Wise 2016/2017
No ratings yet
Machine Translation Wise 2016/2017
58 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
unit6
No ratings yet
unit6
26 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
ML for NLP-LO3
No ratings yet
ML for NLP-LO3
61 pages
Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Large Language Models
No ratings yet
Large Language Models
10 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Week4 LLMs EN
No ratings yet
Week4 LLMs EN
48 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
Whitepaper_Foundational Large Language Models & Text Generation_v2
100% (1)
Whitepaper_Foundational Large Language Models & Text Generation_v2
86 pages
JioDiscover-What is the neural networ
No ratings yet
JioDiscover-What is the neural networ
5 pages
Natural Language Processing_2
No ratings yet
Natural Language Processing_2
76 pages
00779778a72413121603 (1)
No ratings yet
00779778a72413121603 (1)
42 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Fair Federated Learning for Digital Healthcare
No ratings yet
Fair Federated Learning for Digital Healthcare
15 pages
15.Chapter11_NLPApplications
No ratings yet
15.Chapter11_NLPApplications
25 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
12-13.Chapter9_DeepLearningInNLP
No ratings yet
12-13.Chapter9_DeepLearningInNLP
45 pages
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.Chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
Carib Poetry
No ratings yet
Carib Poetry
4 pages
Parts of Speech and Figures of Speech
No ratings yet
Parts of Speech and Figures of Speech
7 pages
Water Cycle at Home Project
No ratings yet
Water Cycle at Home Project
4 pages
HTML (Hyper Text Markup Language)
100% (1)
HTML (Hyper Text Markup Language)
53 pages
Repaso 1 Bachillerato Unit 1
No ratings yet
Repaso 1 Bachillerato Unit 1
2 pages
2021 Oabt Ingilizce Ornek Deneme Sinavi
100% (2)
2021 Oabt Ingilizce Ornek Deneme Sinavi
23 pages
Communication Fundamentals 1
No ratings yet
Communication Fundamentals 1
6 pages
Active Communication in English-1
No ratings yet
Active Communication in English-1
41 pages
HEBT-5V - Manual de Serviço (En) (2017.07)
No ratings yet
HEBT-5V - Manual de Serviço (En) (2017.07)
82 pages
DBMS Tutorial PDF
No ratings yet
DBMS Tutorial PDF
66 pages
Mohammed Deyab CV
No ratings yet
Mohammed Deyab CV
3 pages
TEST 2
No ratings yet
TEST 2
4 pages
What Is Faulty Parallelism?
No ratings yet
What Is Faulty Parallelism?
4 pages
LASQ4 Wk7
No ratings yet
LASQ4 Wk7
3 pages
Unit1
No ratings yet
Unit1
35 pages
Narsee Monjee Institute of Management Studies: English Research Paper For Internal Evaluation
No ratings yet
Narsee Monjee Institute of Management Studies: English Research Paper For Internal Evaluation
8 pages
T e 1674037417 Esl Spring and Easter Idioms Teens b1 b2 - Ver - 3
No ratings yet
T e 1674037417 Esl Spring and Easter Idioms Teens b1 b2 - Ver - 3
23 pages
English 3 Detailed Lesson Plan_HOMONYMS
No ratings yet
English 3 Detailed Lesson Plan_HOMONYMS
5 pages
English As A Second Language Specimen Paper 2 Writing 2015
No ratings yet
English As A Second Language Specimen Paper 2 Writing 2015
8 pages
Tips To Make A Thesis Title
No ratings yet
Tips To Make A Thesis Title
11 pages
A Cashless Society: Lingua House Lingua House
100% (1)
A Cashless Society: Lingua House Lingua House
4 pages
210 Coding Decoding
No ratings yet
210 Coding Decoding
22 pages
Together 1 Worksheets Unit2
No ratings yet
Together 1 Worksheets Unit2
11 pages
Sentiment Prediction in Hindi and English Language
No ratings yet
Sentiment Prediction in Hindi and English Language
25 pages
Onomatopoeia Is A Combination of Speech
No ratings yet
Onomatopoeia Is A Combination of Speech
4 pages
Diploma in Electrical Engineering C 18 Syllabus
No ratings yet
Diploma in Electrical Engineering C 18 Syllabus
89 pages
Question Tag, Elliptical, Parallel, Concordance
No ratings yet
Question Tag, Elliptical, Parallel, Concordance
32 pages
Possessive Nouns and Pronouns Worksheet: Esl / Efl Resour Ces
No ratings yet
Possessive Nouns and Pronouns Worksheet: Esl / Efl Resour Ces
3 pages
TIẾNG ANH 10 - UNIT 6C LISTENING - Answer
No ratings yet
TIẾNG ANH 10 - UNIT 6C LISTENING - Answer
4 pages