Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
43 views

Lecture Notes

Uploaded by

Dary
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Lecture Notes

Uploaded by

Dary
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Training, Fine Tuning, Inference and Applications of

Language Models

Mrinmaya Sachan, Wangchunshu Zhou, Peng Cui, Shehzaad Dhuliawala,


Yifan Hou, Andreas Opedal, Kumar Shridhar,
Alessandro Stolfo, Vilém Zouhar

May 9, 2023
Contents

1 Introduction 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Transfer Learning 7
2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 ELMo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 BERT: Pre-training Objectives . . . . . . . . . . . . . 12
2.3.2 BERT Architecture . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Fine-tuning BERT . . . . . . . . . . . . . . . . . . . . 14
2.4 Other Transformer Language Models . . . . . . . . . . . . . . 16
2.4.1 BERT Variants . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 GPTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 Seq2seq TLMs . . . . . . . . . . . . . . . . . . . . . . 22

3 Parameter efficient finetuning 25


3.1 Partially Finetuning . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Adapter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 [Optional Reading] Example: Cross-Lingual Transfer . 27
3.2.2 [Optional Reading] LoRA . . . . . . . . . . . . . . . . 30
3.2.3 Prefix Tuning . . . . . . . . . . . . . . . . . . . . . . . 31

4 Prompting and Zero-shot inference 33


4.1 What is prompting . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 How to prompt? . . . . . . . . . . . . . . . . . . . . . 33
4.2 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Manual Prompts . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Automated Prompts . . . . . . . . . . . . . . . . . . . 35
4.3 Zero- and Few-Shot Inference . . . . . . . . . . . . . . . . . . 37

3
4 CONTENTS

4.3.1 In-Context Learning . . . . . . . . . . . . . . . . . . . 38


4.3.2 Chain of Thought Prompting . . . . . . . . . . . . . . 39

5 Multimodality 41
5.1 Vision Language Models . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Vision-and-Language Tasks . . . . . . . . . . . . . . . 42
5.1.2 Vision-Language Models: Architectures . . . . . . . . 44
5.1.3 Vision-Language Models: Pre-training Objectives . . . 47
5.2 Knowledge-Enhancement . . . . . . . . . . . . . . . . . . . . 51
5.2.1 kNN LM . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Dynamic Gating kNN LM . . . . . . . . . . . . . . . . 53
5.2.3 KnowBERT . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.4 [Optional] ERNIE . . . . . . . . . . . . . . . . . . . . 57

6 Additional Topics 61
6.1 Instruction-Based Training Procedures . . . . . . . . . . . . . 61
6.1.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . 61
6.1.2 Reinforcement Learning from Human Feedback . . . . 63
6.2 Scaling laws and Emergent Behavior . . . . . . . . . . . . . . 65
6.2.1 Summary of Scaling Laws . . . . . . . . . . . . . . . . 66
Chapter 1

Introduction

1.1 Introduction
Welcome to the class notes on “Training, Fine Tuning, Inference and Ap-
plications of Language Models” for the Large Language Models class (263-
5354-00L). This part of the class focuses on the practical aspects of im-
plementing large language models, their functionalities and applications.
Many universities are offering similar courses at the moment, e.g., CS324
at Stanford University (https://stanford-cs324.github.io/winter2022/)
and CS 600.471 (https://self-supervised.cs.jhu.edu/sp2023/) at Johns
Hopkins University. Their syllabi may serve as useful references.

5
6 CHAPTER 1. INTRODUCTION
Chapter 2

Transfer Learning

2.1 Transfer Learning


Transfer learning is the idea of using knowledge gained from training on one
task in order to solve other tasks. It takes inspiration from the concept of
“transfer of learning” in psychology, which is the phenomenon of a person
being able to apply some skill or piece of knowledge they have already
acquired to a new situation or context. For instance, a person hoping to
learn Swiss German will probably be more efficient in their learning trajectory
if they have already acquired knowledge of other related languages, like say
Dutch or English. Transfer is believed to be an integral part of the human
learning process,1 thereby begging the question, can machine learning models
benefit from similar transfer-related effects?
The idea of transfer learning in neural networks actually dates all the
way back to the 70s (Bozinovski, 2020). Bozinovski and Fulgosi (1976) first
posed the question of transfer learning as follows: Consider a neural network
f θ that has been trained on a first supervised learning task. Next, it is
provided a second supervised learning task with an accompanying train and
test split of data. The parameters (or memory) θ of the network are then
updated through some learning algorithm on the train set. Now, is it possible
that learning on the first task allows learning on the second task with a
smaller (positive transfer learning) or larger (negative transfer learning) set of
training data, compared to the case of no previous learning? Bozinovski and
1
Transfer of learning benefits is in fact one of the main reasons for the strong focus
on mathematics in a typical school curricula – it strengthens the student in their ability
to apply problem solving, reasoning and several other skills more broadly (Hohensee and
Lobato, 2021).

7
8 CHAPTER 2. TRANSFER LEARNING

Fulgosi (1976) provided a geometrical model, along with empirical evidence,


showing that such was indeed the case.
Years later, when neural networks had again started to gain traction
within the AI community, a seminal work by Pratt (1992) introduced the
discriminability-based transfer (DBT) algorithm. DBT went beyond simply
using the very same network that had been trained on the first task as an
initialized network on the second task, by additionally scaling the parameters
of the network according to how well they fit the training data of the second
task. It was able to achieve the same asymptotic performance as that of
randomly initialized networks with fewer training epochs, across several
different tasks.
The core idea remains more or less the same today. As we will see however,
it is not even necessary to update the parameters of the network in order
to achieve positive effects of transfer learning with large language models.
Indeed, a recent and very effective trend in transfer learning with large
language models is that of demonstrating a new task with a set of examples
in the context window of an already trained language model, which nudges
the language model into generating text according to the specifications of that
new task. This is typically referred to as prompting in the context of NLP
(also, modulo context and some nuances, called “in-context learning” or “few-
shot learning”), and we will discuss it in depth in Chapter 4. Going beyond
Bozinovski and Fulgosi’s (1976) original definition, we want to encapsulate
such forms of transfer learning as well. However, in this course we consider
only the case in which the first learning task is that of language modeling.
We arrive at the following operational definition:

Definition 2.1.1 (Transfer learning for language models). Consider a lan-


guage model pLM (y; θ) over Σ∗ trained on corpus D = {y (n) | y (n) ∈ Σ}N n=1 .
Next, consider a target task T , posed as learning a function f : L 7→ Y, for
some input space L ⊆ Σ∗ and output space Y. We say that transfer learn-
ing occurs if parameterizing f as f θ̂ , with θ̂ ⊆ θ, allows for more efficient
learning of T compared to initializing f as f θ′ , with θ ′ being some set of
parameters that are sampled randomly from some distribution.

Note that T can be any task, as long as it takes a language L as input.


The above definition is intentionally left open in what we mean by “efficient
learning”, and it does not necessarily involve training in the traditional sense
of updating the parameters to minimize some loss function. We typically
measure efficiency in terms of number of training samples and/or number of
training iterations, ceteris paribus.
2.2. ELMO 9

Now that we have formalized what we mean by transfer learning, we


can introduce some related terminology that will be used in this and later
chapters: The network trained on the source task (in our case language
modeling) is called a pretrained model, and we refer to the learning
process of that model as pretraining. The process of updating the weights
of a pretrained model for a new target task is called fine-tuning.2 A
related concept to transfer learning is multi-task learning, which is the
idea of sharing learned information across multiple tasks. In contrast to
transfer learning, the tasks are learned jointly rather than sequentially.3
Language modeling is a particularly suitable task for transfer learning
since it is very easy to scale up language modeling data. We only require
some corpus of text that approximately captures the domain we want to
model. Since the input space of language modeling is the same as the output
space it does not require any expensive labeling — it is a self-supervised
task. In the next sections, we will study a few particular instances and
variations of language modeling-based transfer learning, starting with ELMo.

2.2 ELMo
As a first case study of transfer learning based on language models, we
consider ELMo (Peters et al., 2018). ELMo was one of the first successful
transfer learning models based on language modeling. It leverages the
language modeling task to learn word representations, that is, vectors meant
to represent the meaning of words. Whereas most previous approaches, such
as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014),
trained static word representations, the word representations in ELMo are
context-dependent.4 That means that depending on the sentence fed to
ELMo, the representation of a particular word will differ in order to reflect
the meaning of that particular sentence. ELMo considers two separate
language models, a (standard) forward language model pLM (y t | y <t ) as well
as a backward language model pLMB (y t | y >t ) = Tt=1 pLMB (y t | y t+1 , . . . , y T ).
def Q

Both are implemented by stacking L layers of LSTMs, the parameters of



− ←

which we refer to as θ and θ respectively. For a given input token y t ,
2
Note that fine-tuning does not encapsulate all forms of transfer learning, as our
definition does not necessitate updating the parameters of the language model.
3
It has become common in recent years to evaluate models on multiple tasks in order to
test their multi-task learning abilities. See for instance DecaNLP (McCann et al., 2018).
4
The model of McCann et al. (2017) is another early contextualized word vector model.
However, it was pre-trained on machine translation rather than language modeling.
10 CHAPTER 2. TRANSFER LEARNING

Figure 2.1: Illustration of the ELMo model architecture. Hidden represen-


tations from forward and backward language models are concatenated in
order to yield representations that consider both left and right context. We
can then fine-tune weights for these layer-specific representations for various
downstream tasks.


the forward LSTM layers output context-dependent representations h LM tl ,

−LM
with l ∈ [0, L] (and analogously, h tl for the LSTM layers of the backward

− ←−LM
language model). The deepest representations h LM tL and h tL are fed to a
softmax layer to predict the forward and backward probabilities of y t . In

− ←

addition to θ and θ , the parameters for token representations and the
softmax layer (denoted together as θ ′ ) are tied between the two networks.
All parameters are optimized jointly by maximizing the log likelihoods of
the forward and backward models:

N X
T
(n) →
− (n) ←−
X
LELMo (θ) = log pLM (y t (n) | y <t ; θ , θ ′ ) + log pLMB (y t (n) | y >t ; θ , θ ′ )
n=1 t=1
(2.1)
Now in order to fine-tune for a specific task, we can use the context-

−LM → − LM
specific representations hLM
tl = [ h tl ; h tl ]. One could simply take the last
layer representations hLM
tL and use those as input to a separate model that is
fine-tuned on another task. The original paper additionally experimented
with learning a task-specific representation as a scaled convex combination
over the hidden representations, as follows:
L
X
ELMotask
t = γ task sl task hLM
tl , (2.2)
l=0
PL task
where l=0 sl = 1 are the outputs of a softmax function over the hidden
2.3. BERT 11

Figure 2.2: Results of ELMo on six benchmark tasks, compared to previous


state-of-the-art results. The performance metrics vary across the tasks. SNLI
and SST-5 are measured by accuracy. SQuAD, SRL and NER are measured
by F1. Coref is measuerad by average F1.

representations, and γ task ∈ R is meant to scale the representations.


Now, consider a neural network-based model f θ̂ that is fine-tuned on some
downstream task. As input it takes a sequence of representations (x1 , . . . , xT ).
These could be some pre-trained static representations or simple one-hot
encodings. To improve the model using contextualized information we can
concatenate these input representations with the ELMo representations,
yielding input on the form [xt ; ELMotask t ]. This only changes the input
dimension of f θ̂ , the rest of the network can remain unchanged, and be
further fine-tuned.
ELMo was indeed somewhat of a breakthrough in NLP transfer learning
research: By leveraging ELMo representations for other downstream tasks
as just described, the original paper was able to beat what was at the time
state-of-the-art performance on six different benchmark datasets. These
benchmarks included the tasks of question answering, natural language infer-
ence, semantic role labeling, coreference resolution, named entity recognition
and sentiment analysis. See Fig. 2.2 for exact numbers.

2.3 BERT
After the success of ELMO, Devlin et al. (2019) pre-trained BERT (Bidirec-
tional Encoder Representations from Transformers), a Transformer-based
bidirectional masked language model. BERT is pre-trained with the masked
language modeling objective on large scale text corpus. After pre-training,
BERT can be fine-tuned to perform different NLP task without the need
of design task-specific architectures. BERT advanced the state-of-the-art of
many NLP benchmarks at the time it was proposed.
12 CHAPTER 2. TRANSFER LEARNING

In this section, we first introduce the pre-training objective of BERT. We


then describe BERT model architecture and the pre-training and fine-tuning
paradigm.

2.3.1 BERT: Pre-training Objectives


BERT is pre-trained with two self-supervised tasks: masked language
modeling and next sentence prediction.

Masked Language Modeling Masked language modeling (MLM) is first


proposed by Taylor (1953) in the literature, who referred to this as a Cloze
task. Devlin et al. (2019) adapted this task as a novel pre-training task to
overcome the drawback of the standard unidirectional LM.In the masked
language modeling setup, the goal is to predict the omitted token from a piece
of text that constitutes a logical and coherent completion. For example, in
the piece of text ”The students [MASK] to learn about language models”, we
predict want or like with high probability for the [MASK] position. The goal
of masked language modeling is to approximate the probability distribution
over tokens in our vocabulary as the original token at a given masked position.
Similarly to the standard language modeling objective, we can choose model
parameters by optimizing for the log-likelihood of a dataset D. Albeit in
this case, the words at a percentage of randomly-chosen positions in D are
replaced with [MASK] and the model is given both sides of context around
the masked token in order to make its prediction:

N X
T
log pMLM (y t (n) | y <t , y >t ; θ)1{y t (n) = [MASK]}
(n) (n)
X
LMLM (θ) = (2.3)
n=1 t=1

However, this pre-training method will create a mismatch between the


pre-training phase and the fine-tuning phase because the mask token does
not appear during the fine-tuning phase. Empirically, to deal with this issue,
Devlin et al. (2019) used a special [MASK] token 80% of the time, a random
token 10% of the time and the original token 10% of the time to perform
masking.

Next Sentence Prediction The next sentence prediction objective is


included to enable the model to capture the relationships between two
consecutive sentences. This is important because many NLP tasks require
understanding the relationship between different text inputs (e.g., question
2.3. BERT 13

NSP Mask LM Mask LM MNLI NER SQuAD


Start/End Span

C T1 ... TN T[SEP] T1’ ... TM’ C T1 ... TN T[SEP] T1’ ... TM’

BERT BERT BERT


E[CLS] E1 ... EN E[SEP] E1’ ... EM’ E[CLS] E1 ... EN E[SEP] E1’ ... EM’

[CLS] Tok 1 ... Tok N [SEP] Tok 1 ... TokM [CLS] Tok 1 ... Tok N [SEP] Tok 1 ... TokM

Masked Sentence A Masked Sentence B Question Paragraph

Unlabeled Sentence A and B Pair Question Answer Pair

Pre-training Fine-Tuning

Figure 2.3: Illustration of BERT model architecture.

and context for the question answering task). By pre-training the model to
predict whether a given sentence follows another sentence, we can help the
model learn the dependencies and relationships between sentences.
The next sentence prediction objective is implemented as follows. Given
a pair of sentences, denoted as A and B, the model is trained to predict
whether B is the next sentence that follows A. This is done by feeding the
two sentences as input to the model, along with a special [CLS] token that
serves as the input representation of the entire sequence. The model then
generates a binary output that indicates whether B is the next sentence.
BERT is pre-trained with the combination of the masked language model-
ing objective and the next sentence prediction objective on English Wikipedia
data and the BookCorpus (Zhu et al., 2015) dataset, which is a collection of
over 11,000 books from various genres. The resulting corpus contains over
20 GB of text with over 3.3 billion words and approximately 800 million
sentences. BERT is pre-trained with a batch size of 256 sequences for 1M
steps.

2.3.2 BERT Architecture


As illustrated in Figure 2.3, BERT’s model architecture is a multi-layer
bidirectional Transformer encoder. BERT is a Transformer encoder model.
It takes a sequence of text tokens as input and produces their contextualized
representations and (optionally) predictions. We denote the number of
layers (i.e., Transformer blocks) as L, the hidden size as H, and the number
of self-attention heads as A.5 Devlin et al. (2019) pre-trained two model
5
In all cases the feed-forward/filter size is set to be 4H, i.e., 3072 for the H = 768 and
4096 for the H = 1024.
14 CHAPTER 2. TRANSFER LEARNING

Input [CLS] my dog is cute [SEP] he likes play ##ing [SEP]

Token
E[CLS] Emy Edog Eis Ecute E[SEP] Ehe Elikes Eplay E##ing E[SEP]
Embeddings

Segment
Embeddings EA EA EA EA EA EA EB EB EB EB EB

Position
Embeddings E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10

Figure 2.4: BERT input representation. The input embedding are the sum
of the token embeddings, the segmentation embeddings and the position
embeddings.

sizes: BERTBASE (L=12, H=768, A=12, Total Parameters=110M) and


BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M).

Input/Output Representations To make BERT handle a variety of


down-stream tasks, BERT input representation is able to unambiguously
represent both a single sentence and a pair of sentences (e.g., ⟨ Question,
Answer ⟩) in one token sequence. BERT use WordPiece embeddings Wu
et al. (2016) with a 30,000 token vocabulary. The first token of every
sequence is always a special classification token ([CLS]). The final hidden state
corresponding to this token is used as the aggregate sequence representation
for classification tasks. Sentence pairs are packed together into a single
sequence. We differentiate the sentences in two ways. First, we separate
them with a special token ([SEP]). Second, we add a learned embedding
to every token indicating whether it belongs to sentence A or sentence B.
For a given token, its input representation is constructed by summing the
corresponding token, segment, and position embeddings. A visualization of
this construction can be seen in Figure 2.4.

2.3.3 Fine-tuning BERT


To apply pre-trained BERT models on different downstream tasks, we can
simply plug in the task-specific inputs and outputs into BERT and fine-tune
all the parameters end-to-end. This pre-training and fine-tuning paradigm
alleviates need of designing or including any task-specific architectures as
EMLO does.
As illustrated in Figure 2.5, at the input, sentence A and sentence B
from pre-training are analogous to (1) sentence pairs in paraphrasing, (2)
2.3. BERT 15

Class Class
Label Label

C T1 ... TN T[SEP] T1’ ... TM’ C T1 T2 ... TN

BERT BERT

E[CLS] E1 ... EN E[SEP] E1’ ... EM’ E[CLS] E1 E2 ... EN

[CLS]
Tok
... Tok
[SEP]
Tok
... Tok [CLS] Tok 1 Tok 2 ... Tok N
1 N 1 M

Sentence 1 Sentence 2 Single Sentence

Start/End Span O B-PER ... O

C T1 ... TN T[SEP] T1’ ... TM’ C T1 T2 ... TN

BERT BERT

E[CLS] E1 ... EN E[SEP] E1’ ... EM’ E[CLS] E1 E2 ... EN

Tok Tok Tok Tok Tok 1 Tok N


[CLS]
1
... N
[SEP]
1
... M
[CLS] Tok 2 ...

Question Paragraph Single Sentence

Figure 2.5: Illustration of BERT fine-tuning procedure for different NLP


tasks.

hypothesis-premise pairs in entailment, (3) question-passage pairs in question


answering, and (4) a degenerate text-∅ pair in text classification or sequence
tagging. At the output, the token representations are fed into an output layer
for token-level tasks, such as sequence tagging or question answering, and
the [CLS] representation is fed into an output layer for classification, such as
entailment or sentiment analysis. As shown in Figure 2.6, BERT substantially
outperforms both ELMo and OpenAI GPT, as well as all previous state-
of-the-art on the GLUE benchmark. This confirms the effectiveness of
pre-training a bi-directional Transformer encoder using the masked language
modeling objective.
16 CHAPTER 2. TRANSFER LEARNING

Figure 2.6: Experimental results of fine-tuning BERT on the GLUE bench-


mark.

2.4 Other Transformer Language Models


Insipred by the success of BERT, a number of Transformer Language Models
(TLMs) have been pre-trained. We categorize these TLMs to three types
according to their architecture designs: (i) BERT variants, which are Trans-
former encoder models, (ii) GPTs, which are Transformer decoder models,
and (iii) Seq2Seq TLMs, which are Transformer encoder-decoder models.

2.4.1 BERT Variants


After the success of BERT, the community made a lot effort to improve
the performance of BERT-like Transformer encoders. We describe a few
representative work as follows:

RoBERTa RoBERTa (Robustly Optimized BERT Approach) is an op-


timized version of BERT developed by Liu et al. (2019). RoBERTa uses
the same model architecture of BERT but with some improvements to the
pre-training process. RoBERTa uses a larger pre-training corpus, larger
batch size, longer training time, longer input sequence lengths, larger batch
size, and a dynamic masking strategy during pre-training.
Specifically, RoBERTa adds CC-NEWs and OpenWebText to the original
BERT pre-training data. The resulting corpus contains over 160 GB of text
and approximately 2.5 billion word pieces. RoBERTa is pre-trained with a
batch size of 8,000 sequences of 512 tokens for 500k steps, resulting in much
more computation compared to BERT. The dynamic masking strategy, which
randomizes the masking pattern in each epoch and helps the model to learn
more robust representations of the input text. These modifications allows it
to surpass BERT and achieve state-of-the-art performance on several NLP
tasks at the time.
2.4. OTHER TRANSFORMER LANGUAGE MODELS 17

x% x%

($) ($) ($) ($) ($) ($) ($) ($)


h# h$ h% h" h# h$ h% h"

(#) (#) (#) (#) (#) (#) (#) (#)


mem (#) h# h$ h% h" mem (#) h# h$ h% h"

mem (+) x# x$ x% x" mem (+) x# x$ x% x"

Factorization order: 3 à 2 à 4 à 1 Factorization order: 2 à 4 à 3 à 1

x% x%

($) ($) ($) ($) ($) ($) ($) ($)


h# h$ h% h" h# h$ h% h"

(#) (#) (#) (#) (#) (#) (#) (#)


mem (#) h# h$ h% h" mem (+) h# h$ h% h"

mem (+) x# x$ x% x" mem (+) x# x$ x% x"

Factorization order: 1 à 4 à 2 à 3 Factorization order: 4 à 3 à 1 à 2

Figure 2.7: Illustration of the factorization scheme in the permutation


language modeling objective of XLNet.

[Optional]: XLNet While BERT achieves strong empirical performance


on various NLP tasks, it still suffers from a discrepancy between pre-training
and fine-tuning. That is because BERT uses the [MASK] token during pre-
training, but these artificial symbols are absent in the real data at the
fine-tuning stage. To alleviate this discrepancy, Yang et al. (2019) proposed
XLNet, a pre-trained Transformer encoder model based on the same model
architecture as BERT but using a permutation language modeling objective
instead of the masked language modeling objective. Permutation language
modeling objective maximizes the expected log likelihood of a sequence w.r.t.
all possible permutations of the factorization order.
Specifically, as illustrated in Figure 2.7, for a text sequence x, we sample a
factorization order z at a time and decompose the likelihood pθ (x) according
to factorization order. Note that the original sequence order is kept and
permutation in factorization order is achieved by changing the attention
mask matrix in the Transformer. Since the same model parameter θ is
shared across all factorization orders during training, in expectation, xt has
18 CHAPTER 2. TRANSFER LEARNING

seen every possible element xi ̸= xt in the sequence, hence being able to


capture the bidirectional context. Moreover, as this objective fits into the
autoregressive framework, it naturally avoids the independence assumption
and the pretrain-finetune discrepancy.

T X
X i−1
L= log pθ (xπi,j |x< πi,j ) (2.4)
i=1 j=0

where xπi,j is the j-th token in the permutation of the first i tokens in
the input sequence and x< πi,j is the set of tokens that appear before the
j-th token in the permutation of the first i tokens.
In addition to the permutation language modeling objective, XLNet
also increases the size of pre-training data and pre-training computation
budget. By combining these improvements, XLNet achieved state-of-the-art
performance on several NLP tasks at the time.

[Optional] ALBERT ALBERT (A Lite BERT) is another BERT variant


pre-trained by Lan et al. (2019) focusing on the parameter efficiency of
BERT-like models. ALBERT reduces the parameter count by (i) factorized
embedding parameterization which decomposes the large embedding matrix
into two smaller matrices which first project the word to a lower dimensional
embedding space and then project it to the hidden space, and (ii) cross-layer
parameter sharing, which shares the parameter of each Transformer layer
in the model architecture. ALBERT also includes a new sentence order
prediction (SOP) task as alternative for the NSP task. The SOP task uses
as positive examples the same technique as BERT (two consecutive segments
from the same document), and as negative examples the same two consecutive
segments but with their order swapped. Compared to the NSP task which
focuses on topic prediction, the SOP task focuses primarily on coherence
and forces the model to learn finer-grained distinctions about discourse-level
coherence properties.
With these techniques, ALBERT reduces the parameter count of BERTBASE
from 108M to 12M with moderate performance drop. Lan et al. (2019) also
pre-trained larger models with larger pre-training corpus and computation
budget, resulting in new state-of-the-art performance on several NLP tasks
at the time. However, it is worth noting that while ALBERT significantly
reduces the parameter count of BERT-like models, the computation cost
remains the same and could still be a bottleneck for real-world applications.
sample
artist
[MASK]
2.4. OTHER TRANSFORMER LANGUAGE MODELS
vartist 19

sample
the [MASK] the original
chef chef chef original
Generator Discriminator
cooked [MASK] (typically a ate replaced
(ELECTRA)
the the small MLM) the original
meal meal meal original

Figure 2.8: Illustration of ELECTRA model and the replaced token detection
framework.

[Optional] ELECTRA ELECTRA (Efficiently Learning an Encoder that


Classifies Token Replacements Accurately) is another BERT variant pre-
trained by Clark et al. (2020). ELECTRA uses the same model architecture
with the aforementioned BERT-like models but is pre-trained with a dis-
criminative self-supervised objective called the Replaced Token Detection
(RTD) task instead of generative ones such as masked language modeling or
permutation language modeling.
Specifically, as illustrated in Figure 2.8, the ELECTRA pre-training
approach involves training a generator and a discriminator model simultane-
ously. The generator model is used to create corrupted versions of the input
data by replacing some of the tokens with random tokens and is trained with
the masked language modeling objective. The discriminator model is a larger
and more complex model compared to the generator model and is trained
to maximize the probability of correctly distinguishing between the original
input data and the corrupted data at token level. After pre-training, the
generator model is discarded and only the discriminator model is fine-tuned
on downstream tasks. This training framework is more computationally
efficient because it provides meaningful training signal on each input token,
instead of only 15% of the tokens that are masked. Experiments show that
ELECTRA achieves competitive performance compared to the RoBERTa
model with fewer computational budget.

2.4.2 GPTs
We then describe another important type of pre-trained language models,
i.e., Transformer decoder models. These kind of models are true language
models according to the definition and can be used for generative tasks. The
most representative models in this type are the GPT family pre-trained by
OpenAI. We will introduce them as follows.
20 CHAPTER 2. TRANSFER LEARNING

Figure 2.9: Illustration of how to fine-tune GPT on natural language under-


standing tasks.

Figure 2.10: Illustration of GPT-2’s zero-shot performance on various NLP


tasks with respect to model sizes.

GPT A few months before BERT is released, Radford et al. (2018) pre-
trained the first version of the GPT (Generative Pre-training Transformers)
family. GPT is a decoder-only Transformer language model pre-trained with
the standard language modeling objective on large scale web text. After pre-
training, GPT can be fine-tuned on various natural language understanding
tasks by transforming the inputs into a single text sequence. The sequence
is fed into the GPT model and the final hidden state is used as the sequence
representation, which is then fed into a fully-connected layer to produce
predictions. The GPT fine-tuning procedure is illustrated in Figure 2.9.

GPT-2 After the success of GPT, OpenAI further scales generative pre-
training with GPT-2 (Radford et al., 2019) with larger models and more
training data. In addition to better performance with fine-tuning, they
also find that generative pre-training on large scale text data makes the
2.4. OTHER TRANSFORMER LANGUAGE MODELS 21

Figure 2.11: Comparison of few-shot in-context learning with zero-shot


learning and fine-tuning.

model becomes unsupervised multi-task learners that performs reasonably


on various NLP tasks in the zero-shot setting. As shown in Figure 2.10,
the zero-shot performance of GPT-2 significantly improves when scaling
the model size, and the largest GPT-2 model with 1.5 Billion parameters
achieves non-trivial performance on some NLP tasks. This for the first time
demonstrate the potential of scaling LLM pre-training for zero-shot inference.

GPT-3 Motivated by the success of scaling model size and pre-training


corpus in GPT-2, OpenAI further scales the size of decoder-only Transformer
language models to 175 Billion, which is over 100 times larger compared
to the largest GPT-2 model. They find that in addition to improved zero-
shot performance, GPT-3 also exhibits strong few-shot “in-context learning”
ability. As illustrated in Figure 2.11, in-context learning append both the
22 CHAPTER 2. TRANSFER LEARNING

"translate English to German: That is good."

"cola sentence: The "Das ist gut."


course is jumping well."

T5
"not acceptable"
"stsb sentence1: The rhino grazed
on the grass. sentence2: A rhino
is grazing in a field." "3.8"

"summarize: state authorities "six people hospitalized after


dispatched emergency crews tuesday to a storm in attala county."
survey the damage after an onslaught
of severe weather in mississippi…"

Figure 2.12: Illustration of the text-to-text framework of T5.

task description and a few demonstrations of the task to the original input
as the input for the language model, which produces the prediction by
auto-regressively predicting next tokens. Different

2.4.3 Seq2seq TLMs


Apart from encoder-only and decoder-only Transformer language models,
researchers also explored pre-training encoder-decoder Transformer language
models. Compared to BERT-like encoder-only models and GPT-like decoder-
only models, encoder-decoder models suits sequence-to-sequence generation
tasks such as machine translation and text summarization better. This is
because the encoder help the model better understand the input text with
bidirectional attention flow, and the decoder help the model generate more
fluent outputs. We will introduce T5 (Raffel et al., 2020) and BART (Lewis
et al., 2020), two representative pre-traiend encoder-decoder Transformer
language models, in this section.

T5 After the success of BERT and GPTs on natural language understanding


and unconditional text generation, it is natural to consider pre-training an
encoder-decoder model to improve seq2seq tasks. To this end, Raffel et al.
(2020) pre-trained T5 (”Text-to-Text Transfer Transformer”), a powerful
pre-trained encoder-decoder language model.
As illustrated in Figure 2.12, T5 is a general-purpose seq2seq model
that can handle both natural language understanding and generation tasks
with a text-to-text framework: the model take task description and task
input as the encoder input and output the answer with the decoder. This
paradigm is very influential to future NLP research because it demonstrates
the possibility of building a single general-purpose model that can handle all
2.4. OTHER TRANSFORMER LANGUAGE MODELS 23

A _C . _ E . DE.ABC. C.DE.AB
Token Masking Sentence Permutation Document Rotation

<X> <Y> A.C.E. ABC.DE. A_.D_E.


Token Deletion Text Infilling
<X> <Y> <Z>
(b) Noise functions investigated in BART
(a) Text infilling objective. pre-training.

Figure 2.13: Illustrations of the text infilling objective in T5 and other noise
functions investigated in BART.

NLP tasks.
T5 is pre-trained by mixing self-supervised data with multi-task super-
vised data. T5 uses a “text infilling” objective, which is shown to outperform
other possible variants, as the source of self-supervision. As illustrated in
Figure 2.13a, it randomly masks a portion of contiguous tokens and train the
model to predict the masked text spans. Raffel et al. (2020) collected the C4
(”Colossal Clean Crawled Corpus”) dataset and construct massive text infill-
ing data with it. For supervised data, T5 convert training data for GLUE,
SuperGLUE, machine translation, text summarization, question answering,
etc., into text-to-text format by prepending task-specific prefixes. During
pre-training, T5 randomly samples text-to-text training data according to
the size of different tasks.
After pre-training, T5 can achieve competitive performance on supervised
pre-training tasks without fine-tuning by prepending the corresponding prefix
to the input. T5 also achieves state-of-the-art performance on a wide range
of NLP tasks with task-specific fine-tuning.

BART Concurrently with T5, Lewis et al. (2020) pre-trained BART,


another powerful seq2seq pre-trained model. As illustrated in Figure 2.13b,
they investigated a number of noise functions as sources for self-supervision,
and found the combination of text infilling and sentence permutation to
be the best. Then it pre-trained BART on the same pre-training data
as RoBERTa and obtained state-of-the-art performance on some natural
language generation tasks.
24 CHAPTER 2. TRANSFER LEARNING
Chapter 3

Parameter efficient finetuning

Pretrained language models are used in a wide range of NLP tasks. When
the model size becomes larger and larger, it is more and more difficult to tune
the model with limited size of annotated data. Overfitting easily happens,
and the cost of model tuning is quite expensive. Thus, to avoid above issues,
various parameter-efficient tuning methods are proposed. In the following
sections, we will first introduce partially finetuning techniques. They
are simple but effective, which is widely used in Transformer model tuning.
Besides selecting part of parameters for tuning, another line of work explores
how to keep the model frozen, and add extra tuned parameters. These newly
added and tuned modules are called adapters.

3.1 Partially Finetuning


This type of tuning methods (which is also called specification-based tun-
ing (Ding et al., 2022)) fine-tune a few inherent parameters while leaving
the majority of parameters unchanged in model adaptation. This approach
does not seek to change the internal structure of a model but to optimize
a small number of internal parameters to solve particular tasks. Generally,
such specifications could be implemented based on heuristics or training
supervision.

Heuristic Specification. Specification-based methods do not introduce


any new parameters in the model, but directly specify part of the parameters
to be optimized. The idea is simple but surprisingly effective, Lee et al. (2019)
only fine-tune one-fourth of the final layers of BERT and RoBERTa
and could produce 90% of the performance of full parameter fine-tuning.

25
26 CHAPTER 3. PARAMETER EFFICIENT FINETUNING

BitFit (Zaken et al., 2022) empirically proves that by only optimizing


the bias terms inside the model and freezing other parameters, the model
could still reproduce over 95% performance on several benchmarks. The
biased term exists in both attention mechanism and MLP feedforward layer.
BitFit can fine-tune only two bias components (the “query” and “middle-of-
MLP” bias terms), amounting to half of the bias parameters in the model,
and only 0.04% of all model parameters. Corresponding bias terms are
colored red. Other bias terms are colored in blue.

Query in Attention : Q(x) = Wq x + bq (3.1)


K(x) = Wk x + bk (3.2)
V(x) = Wv x + bv (3.3)
MLP in Feedfoward : h2 = Dropout(W1 · h1 + b1 ) (3.4)
(h2 + x) − µ
h3 = gLN1 ⊙ + bLN1 (3.5)
σ
h4 = GELU(W2 · h3 + b2 ) (3.6)
h5 = Dropout(W3 · h4 + b3 ) (3.7)
(h5 + h3 ) − µ
out = gLN2 ⊙ + bLN2 (3.8)
σ
Empirical results in BitFit also show that even if we use a small random set
of parameters for tuning (which obviously will degrade the performance), the
model could still yield passable results on the GLUE benchmark. Unfortu-
nately, the work only applies this trick to small-scale models, and there is no
guarantee that randomly choosing some parameters to be tuned would remain
competitive for larger models. Another valuable observation is that different
bias terms may have different functionalities during model adaptation.

Learn the Specification. Rather than manually or heuristically specify


which parameters to be updated, one alternative is to “learn” such specifica-
tions. Diff pruning (Guo et al., 2021) reparameterizes the fine-tuned model
parameters as the summation of the pre-trained parameters and the differ-
ence vector as θF T = θLM + δDif f . Hence, the key issue is to encourage the
difference vector δDif f to be as sparse as possible. This work regularizes the
vector by a differentiable approximation to the L0 -norm penalty as ||δDif f ||0
to achieve the goal of sparsity. Practically, because new parameters to be
optimized are introduced in the learning phase, Diff pruning takes up more
GPU memory than full parameter fine-tuning, which may establish barriers
in the application on large language models.
3.2. ADAPTER TUNING 27

Figure 3.1: Adapter


Layer Norm Adapter
Layer + Houlsby et al. (2019).
Transformer
Layer
+ During adapter tuning,
Adapter
Feedforward blocks in green are trained,
2x Feed-forward up-project
layer including adapters, layer
Nonlinearity normalizations, and the
Layer Norm
final classification layer (not
+ shown in the figure).
Adapter Feedforward
down-project
Feed-forward layer

Multi-headed
attention

(a) Adapters are inserted (b) The adapter consists


after the multi-head at- of a down-projection, an
tention sublayer and the up-projection, and a skip-
feed-forward sublayer. connection. The inter-
mediate hidden state typ-
ically have smaller size,
which is thereby called a
”bottleneck”.

3.2 Adapter Tuning


Adapter tuning (Rebuffi et al., 2017) inserts small modules called adapters
to a model. Adapters can be any network architectures and can be placed
anywhere in the model, but many follow Houlsby et al. (2019)’s practice
to place a two-layer feed-foward neural network with a bottleneck after
each sublayer (including both the multi-head attention sublayer and the
feed-forward sublayer) within the transformer (Fig. 3.1):

h ← h + f (hWdown )Wup (3.9)

where f is a nonlinear activation function.


It has been shown that adapter tuning can attain performance comparable
to full fine-tuning by training only a small fraction of parameters (Fig. 3.2).

3.2.1 [Optional Reading] Example: Cross-Lingual Transfer


In this section, we introduce the application of adapters to cross-lingual
transfer. LLMs pretrained on multiple languages such as multilingual BERT
(Devlin et al., 2019) and XLM-R (CONNEAU and Lample, 2019) enables
28 CHAPTER 3. PARAMETER EFFICIENT FINETUNING

Accuracy delta (%)


−5

−10

−15

−20 Adapters
Fine-tune top layers
−25
10 5 10 6 10 7 10 8 10 9
Num trainable parameters / task

Figure 3.2: Number of trained task-specific parameters for adapter tuning


and fine-tuning.

few-shot or even zero-shot cross-lingual transfer to low-resource languages


(Pires et al., 2019). However, the capacity of a single model is limited: it
cannot cover an unlimited number of languages and better performance on
low-resource languages often comes at the cost of worse performance on
high-resource languages (compared to monolingual models) (Conneau et al.,
2020). To alleviate the issue, we can learn an adapter for each language,
which allows the model to incorporate more languages without having the
languages interfering with each other.
Pfeiffer et al. (2020) proposed a Multiple ADapters for Cross-lingual trans-
fer (MAD-X) framework (Fig. 3.3) that comprises three types of adapters:
language, task, and invertible adapters.

Language Adapter A language adapter is introduced for each distinct


language. The language adapters are trained using masked language model-
ing.

Task Adapter A task adapter is stacked on top of the language adapter


to capture task-specific knowledge. During fine-tuning on downstream tasks,
language adapters are fixed and only task adapters are trained.

Invertible Adapter There often exists a mismatch between the pre-


trained multilingual LLM’s vocabulary and the target language’s vocabulary.
3.2. ADAPTER TUNING 29

Embeddings T Add & Norm

Inv En -1 Inv Qu -1
Adap Adap Task NER Adapt

Lang En Adapt Lang Qu Adapt

... Add & Norm

Feed
Forward

Add & Norm


Inv En Inv Qu
Inv MLM EnAdap
Adap Adap Multi-Head
Attention
Embeddings
Embeddings

Figure 3.3: MAD-X framework

To mitigate it, invertible adapters are stacked on top of the embedding layer
to better accommodate for the target language. Furthermore, since input
and out embeddings are tied, the inverses of the adapters are placed before
the output embedding layer to revert it for inference. Invertible adapters
split an input embedding vector e into two vectors of equal dimensionality:
e1 , e2 and applies the following transformations:
o1 = F (e2 ) + e1
o2 = G(o1 ) + e2 (3.10)
o = [o1 , o2 ]
Its inverse can be computed as follows:
e2 = o2 − G(o1 )
e1 = o1 − F (e2 ) (3.11)
o = [e1 , e2 ]
F and G can be arbitrary non-linear functions. In Pfeiffer et al. (2020), a
function similar to language and task adapters are used (minus the residual
connection):
F (x) = ReLU(xWdown,F )Wup,F
(3.12)
G(x) = ReLU(xWdown,G )Wup,G
30 CHAPTER 3. PARAMETER EFFICIENT FINETUNING

-1
FF Up FF Up

FF Down FF Down
G F

FF Up FF Up

FF Down FF Down
F G

Inv MLM Inv MLM


Adapter Adapter

Figure 3.4: Invertible Adapter

Invertible adapters are trained together with language adapters using MLM
and are also fixed during task-specific fine-tuning. The architecture of an
invertible adapter and its inverse is shown in Fig. 3.4.

3.2.2 AdapterFusion
[Optional Reading] LoRA
SoftMax
Bottleneck adapters introduce extra compute in adapter layers, which can
not be bypassed with model parallelism since theyT have to be processed
Embeddings
Value Key Query
sequentially. Even though adapter layers are designed -1 to have very few
Inv MLM En Adap
parameters, it could still lead to a non-negligible latency, especially during
inference. To address this issue, Hu et al. (2022) propose LoRA (Fig. 3.5),
which injects trainable low-rank matrices to approximate the weight updates.
For a pre-trained weight matrix W ∈ Rd×k (e.g. query and value projection
matrices in the multi-head attentions), the update is computed with a
low-rank decomposition: ...

∆W = BA (3.13)

where B ∈ Rd×r , A ∈ Rr×k . During training, W is frozen and only A and


B are trained. A is initialized with a random Gaussian distribution, and B
is initialized to be zero. The update isInv
then scaled by a constant αr , which is
Inv MLM
MLM En
En Adap
Adap
roughly equivalent to tuning the learning rate. In summary, given a specific
input x to the layer, we have Embeddings

α α
h = Wx + · ∆Wx = Wx + · BAx. (3.14)
r r
3.2. ADAPTER TUNING 31

65
All parameters

60 f(x)

Accuracy
h
55

Pretrained
𝐵=0 50
Pretrained Weights
Weights 𝑟
0.001% 0.01% 0.1%
𝑑×𝑑
𝑊 ∈ ℝupdated
% of parameters
𝑊 ∈ ℝ𝑑×𝑑
𝐴 = 𝒩(0, 𝜎 2 ) (IA)³ Prompt Tuning
LoRA 𝑑
Prefix Tuning
𝑑 BitFit Adapter
Layer Norm
x FISH Mask
x Compacter Intrinsic SAID
Compacter++
Figure 3.5: LoRA.
Figure 3.6: Accuracy of PEFT meth-
ods applied to T0-3B (Sanh et al.,
2022).

3.2.3 Prefix Tuning


Prefix tuning is another PEFT method. But different from others, it has its
roots in prompting. Therefore, we will discuss it in details in Chapter 4. In
Assignment 2, you will be asked to show how prefix tuning as well as LoRA
connects to adapter tuning.
The performance of various PEFT methods are summarized in Fig. 3.6
(Liu et al., 2022). We could not cover all the methods in this course. Feel
free to read their papers if you are interested.
32 CHAPTER 3. PARAMETER EFFICIENT FINETUNING
Chapter 4

Prompting and Zero-shot


inference

4.1 What is prompting


In a traditional supervised learning setting, an output y is generated given
the input x and the model parameters θ using the modeling objective
P (y|x; θ). However, for many tasks, the supervised data is unavailable
making the training process difficult/impossible. Prompting enables models
to circumvent the training issue by learning an LM that models the probability
P (x; θ) of text x itself, and using this probability to predict y, reducing or
obviating the need for large supervised datasets. In other words, prompting
is non-invasive: it does not introduce large amounts of additional parameters
or require direct inspection of a model’s representations. It can be thought
of as a lower bound on what the model “knows” and can be used to extract
information from LM.

4.1.1 How to prompt?


To prompt an LM, it is important to map the input x to a prompt x′ . A
prompting function fprompt (·) is applied to modify the input text x into a
prompt x′ = fprompt (x) (example provided in Table 4.1). Next, a template
is defined that consists of two slots: an input slot [X] for input x and
an answer slot [Z] for any generated answer z that may or may not be
mapped into y depending on the use case. Next, the highest-scoring ẑ that
maximizes the score of the LM is taken (z can take all possible values from
the vocabulary for generation tasks but can also be modified for controlled

33
34 CHAPTER 4. PROMPTING AND ZERO-SHOT INFERENCE

Name Notation Example Description

Input x I love this movie. One or multiple texts


Output y ++ (very positive) Output label or text

A function that converts the


input into a specific form by in-
Prompting
fprompt (x) [X] Overall, it was a [Z] movie. serting the input x and adding
Function
a slot [Z] where answer z may
be filled later.

A text where [X] is instanti-


I love this movie. Overall, it was
Prompt x′ ated by input x but answer
a [Z] movie.
slot [Z] is not.
Filled I love this movie. Overall, it was A prompt where slot [Z] is
ffill (x′ , z)
Prompt a bad movie. filled with any answer.
Answered I love this movie. Overall, it was A prompt where slot [Z] is
ffill (x′ , z ∗ )
Prompt a good movie. filled with a true answer.

A token, phrase, or sentence


Answer z “good”, “fantastic”, “boring”
that fills [Z]

Table 4.1: Terminology and notation of prompting methods. z ∗ represents


answers that correspond to true output y ∗ . The table is taken from Liu et al.
(2023).

classification/generation tasks).
A function ffill (x′ , z) fills in the location [Z] in prompt x′ with the
potential answer z by searching over the set of all potential answers by
calculating the probability of their corresponding filled prompts using a
pre-trained LM P (·; θ).
ẑ = search P (ffill (x′ , z); θ). (4.1)
z∈Z

This search function could be an argmax search that searches for the highest-
scoring output or various sampling techniques that randomly generate outputs
following the probability distribution of the LM.

4.2 Prompt Engineering


Given a task, multiple prompts can work. However, finding the most effective
prompts is desired in order to unlock the full potential of the LM. Prompt
Engineering is the process of designing a prompting function fprompt (x) that
results in the most effective performance for the given task.
4.2. PROMPT ENGINEERING 35

4.2.1 Manual Prompts


The most straightforward yet somewhat effective technique is to design the
prompts for a given task manually. Since the number of required prompts
is usually very less (1 to 8 depending on the task and the input sequence
memory), designing manual prompts gives more control and flexibility to the
user.

4.2.2 Automated Prompts


While the strategy of manually creating prompts is intuitive and allows for
solving various tasks with some degree of accuracy, there are several problems
with this approach: creating and experimenting with these prompts is an
art that takes time and experience, especially for more complicated tasks
like multi-step reasoning, and even experienced prompt designers may fail to
find optimal prompts manually (Jiang et al., 2020). Searching for automated
prompts can be further divided into discrete prompts or continuous prompts.

Discrete prompts
Discrete prompts (also sometimes known as hard prompts), as the name sug-
gests, automatically search for prompts in a discrete space, i.e. the prompts
are usually text strings corresponding to natural language. Various methods
have been proposed in this line of work and we explore a few approaches
here.
Jiang et al. (2020) introduced a mining-based approach that automatically
identifies templates from a given set of training inputs x and outputs y. This
approach searches for strings in a large text corpus, such as Wikipedia, that
contain both x and y, and identifies the middle words or dependency paths
between them that can be used as templates, e.g. in the form of “[X] middle
words [Z]”.
Paraphrasing methods involve taking an original prompt and creating var-
ious alternative prompts to use as training data. These methods include
translating the prompt into another language and back (Jiang et al., 2020),
using a thesaurus to replace words (Yuan et al., 2021), or using a neural
prompt rewriter designed to improve the accuracy of systems that use the
prompt (Haviv et al., 2021).
Treating prompts as generation tasks is an obvious choice and Gao et al.
(2021) used the pre-trained T5 model for the template search process. Ben-
David et al. (2021) further extended it and proposed a domain adaptation
algorithm that trains T5 to generate unique domain-relevant features.
36 CHAPTER 4. PROMPTING AND ZERO-SHOT INFERENCE

Other popular approaches involve gradient-based search that finds sequences


that trigger the LM to generate desired output and use that as prompts
(Wallace et al., 2019). AUTO PROMPT (Shin et al., 2020) creates a prompt
based on a template that combines the original input x with trigger tokens
(trigger tokens use the gradient-based method as shown in (Wallace et al.,
2019)) and the LM predictions for the prompt is then converted to class
probabilities by marginalizing over a set of associated label tokens.

Continuous prompts

Prompt construction aims to enable an LM to perform a specific task ef-


fectively, and it is not imperative for the prompt to be limited to natural
language that can be interpreted by humans. Therefore, some approaches
focus on continuous prompts (also sometimes known as soft prompts) that
prompt the model directly in its embedding space.
One such approach is Prefix Tuning (Li and Liang, 2021) that prepends
a sequence of continuous task-specific vectors to the input, while keeping the
LM parameters frozen. Mathematically, this consists of optimizing over the
following log-likelihood objective given a trainable prefix matrix Mϕ and a
fixed pre-trained LM parameterized by θ.
X
max log P (y|x; θ; ϕ) = max log P (yi |h<i ; θ; ϕ) (4.2)
ϕ ϕ
yi

(1) (n)
In Eq. 4.2, h<i = [h<i ; · · · ; h<i ] is the concatenation of all neural network
layers at time step i. It is copied from Mϕ directly if the corresponding time
step is within the prefix (hi is Mϕ [i]), otherwise it is computed using the
pre-trained LM.
However, it will be more interesting to combine the discrete prompts
with continuous approaches as prefix tuning might be very sensitive to the
changes. Zhong et al. (2021) takes the advantage of the hybrid approach by
first defining a template using AutoPrompt (Shin et al., 2020)’s (a discrete
search method), initialize virtual tokens based on this discovered prompt,
then fine-tune the embeddings to increase the performance. Similarly, Liu
et al. (2021b) propose “P-tuning”, where continuous prompts are learned
by inserting trainable variables into the embedded input. Han et al. (2021)
further propose prompt tuning with rules (PTR) where logic rules create
the templates alongside the virtual tokens whose embedding comes from the
pre-trained LM.
4.3. ZERO- AND FEW-SHOT INFERENCE 37

4.3 Zero- and Few-Shot Inference


In many cases, prompting methods can be used without any explicit training
of the language model (LM) for the downstream task. Instead, one can take
an LM that has been trained to predict the probability of text P (x) and
apply it as-is to fill the cloze or prefix prompts that define the task. This is
traditionally referred to as the zero-shot setting, as there is no training data
available for the task of interest.
However, in the scenario where a limited number of labeled examples are
available or can be easily annotated, it is possible to augment the prompt to
include this additional information. This setting is referred to as few-shot
inference and consists of including a small collection of input-out exemplars.
These exemplars are input-output pairs that serve as demonstrations of the
behavior that one would like the LM to emulate. By using these pairs as a
guide, the large LMs can learn to carry out the desired task. This simple idea
was shown to For example, we can augment the standard prompt “France’s
capital is [X] .” by prepending a few examples such as “Great Britain’s
capital is London . Germany’s capital is Berlin . France’s capital is [X]”.
Note that few-shot inference does not involve any parameter updates.
The introduction of GPT-3 (Brown et al., 2020) popularized this approach:
the model was shown to benefit from the inclusion of exemplars in the prompt
on a wide range of tasks and across different model sizes. An example of this
behavior is displayed in Figure 4.1.

Figure 3.3: On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models
continue
Figureto absorb
4.1: knowledge
Resultsas their capacity increases.
of GPT-3 on the One-shot and few-shot
TriviaQA performance
dataset, make significant
using different gains
over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG
numbers
[LPP 20]
+ of exemplars in the few-shot prompt. From Brown et al. (2020).

and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this
distribution, recovering strong performance in the few-shot setting.
On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in
the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot
to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to
TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia
specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution.
Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two
datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we
find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting
the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.
38 CHAPTER 4. PROMPTING AND ZERO-SHOT INFERENCE

Although the idea of including a small number of examples in the prompt


is simple, there are some aspects that we need to pay attention to: (1)
what examples should be included in the prompt to make the demonstration
effective, and (2) how to order the examples in the prompt. Regarding the
example selection, researchers have shown that different demonstrations can
result in very different performance (Lu et al., 2021). An approach to tackle
this issue consists of using sentence embeddings to sample examples that are
close to the input in the embedding space (Liu et al., 2021a; Drori et al.,
2022). As for the order of the labeled examples, methods were proposed to
score different candidate permutations (Lu et al., 2021).

4.3.1 In-Context Learning

In-context learning is an emergent behavior displayed by large LMs that can


lead to the models performing previously unseen tasks in a few-shot setting,
without any parameter optimization.
Initially pointed out by Brown et al. (2020) in the presentation of the GPT-
3 model, this behavior has been the subject of several studies and surveys.
Notably, the phenomenon occurs suddenly as the model size increases, leading
to the definition of emergent behavior. While including a few exemplars in
the prompt does not lead to a significant difference in performance in small-
scale models, it leads to a significantly improved behavior in large models
(e.g., GPT-3, PaLM). Another aspect that makes this behavior surprising is
the mismatch between the LM’s training objective and the tasks that the
model is able to learn in-context. Typically, large LMs are trained with
a self-supervised language modeling objective, and the mechanisms that
allow these models to learn a previously unseen task (e.g., text classification)
without any parameter optimization are currently an active area of research.
Recent work has hypothesized that in-context learning by few-shot
prompting works as a process of meta-optimization on the few examples
included in the prompt (Dai et al., 2022). The authors show experimental
results that support the idea that, when performing in-context learning, the
models produces meta-gradients according to the demonstration exemplars
through froward computation. Then these meta gradients are applied to the
original LM through attention. This way, in-context learning can be seen as
a kind of implicit fine-tuning on the few-shot exemplars.
A popular prompting strategy aimed at eliciting in-context learning is
chain-of-tought prompting.
Chain of Thought Prompting Elicits Reasoning
in Large Language Models
4.3. ZERO- AND FEW-SHOT INFERENCE 39

4.3.2 ChainJason
ofWei
Thought Prompting
Xuezhi Wang Dale Schuurmans Maarten Bosma
Brian Ichter Fei Xia Ed H. Chi Quoc V. Le Denny Zhou
Multiple studies investigating the reasoning capabilities of large LMs, high-
arXiv:2201.11903v4 [cs.CL] 13 Jun 2022

Google Research, Brain Team


lighted how eliciting the model to produce a step-by-step solution of a
{jasonwei,dennyzhou}@google.com
problem can lead to a more accurate final answer (Nye et al., 2021). Com-
bining this idea with the intuition of including a set of demonstrations in the
prompt, Wei et al. (2022c) introduced Abstractthe concept of chain of thought (CoT)
prompting. We Given
explore ahowquestion, a chain
generating a chain of thought
of thought—a is a coherent
series of intermediate reasoning sequence of
steps—significantly improves the ability of large language models to perform
reasoning steps that leads to a final answer. Figure 4.2 shows
complex reasoning. In particular, we show how such reasoning abilities emerge
an example of a
math world naturally
problem in sufficiently
that the largemodel
language models via to
is able a simple
solvemethod called chain
through CoT of prompting,
thought prompting, where a few chain of thought demonstrations are provided as
and that would otherwise
exemplars in prompting.be gotten incorrect. The step-by-step reasoning
displayed byExperiments
LM through on three large
this language models show
prompting that chain of thought
technique was prompting
subsequently show
improves performance on a range of arithmetic, commonsense, and symbolic
to be elicited in a zero-shot setting (Kojima et al., 2022).
reasoning tasks. The empirical gains can be striking. For instance, prompting This was achieved
a 540B-parameter language model with just eight chain of thought exemplars
simply by adding “Let’s
achieves state of the think step-by-step”
art accuracy to theofprompt
on the GSM8K benchmark before the model’s
math word problems,
answer. surpassing even finetuned GPT-3 with a verifier.

Standard Prompting Chain of Thought Prompting


Input Input

Q: Roger has 5 tennis balls. He buys 2 more cans of Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now? tennis balls does he have now?

A: The answer is 11. A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples Q: The cafeteria had 23 apples. If they used 20 to
do they have? make lunch and bought 6 more, how many apples
do they have?

Model Output Model Output

A: The answer is 27. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
answer is 9.

Figure 1: Chain of thought prompting enables large language models to tackle complex arithmetic,
Figurecommonsense, and symbolic
4.2: Example reasoning
of chain oftasks. Chain ofprompting,
thought thought reasoningfrom
processes
Weiare highlighted.
et al. (2022c).

Chain-of-thought prompting has been shown to significantly improve


the ability of large language models to perform complex reasoning. In
particular, through experiments on arithmetic and commonsense reason-
ing, it has been found that the ability of generating chains of thought is
an emergent property of model scale. This means that larger models are
better at generating step-by-step reasoning chains than smaller models.
This phenomenon is noticeable in the scores reported in Figure 4.3. The
results display a significant discrepancy between the solve rate obtained
with chain of thought and with standard prompting on the GSM8k dataset.
40 CHAPTER 4. PROMPTING AND ZERO-SHOT INFERENCE

However, this difference in perfor-


mance doesSecond,
not chain
seems of thought
to beprompting
as large has larger Standard prompting
performance gains for more-complicated prob- Chain of thought prompting
for SVAMP lems.and MAWPS.
For instance, This(theisdataset
for GSM8K
Prior supervised best
with the lowest baseline performance), perfor-
because GSM8k consists of complex
mance more than doubled for the largest GPT LaMDA GPT PaLM
and PaLM
problems that models. On
require the other hand,
multiple rea-for Sin- 60
gleOp, the easiest subset of MAWPS which only

solve rate (%)


soning steps, while
requires theto solve,
a single step other two im-
performance

GSM8K
40
provements were either negative or very small
datasets are made up
(see Appendix Table 3). of simpler 20
problems that,
Third, thefor the most
performance of chainpart,
of thought 0
prompting via GPT-3 175B and PaLM 540B
only require the application of a sin-
compares favorably to prior state of the art, 80
which typically finetunes ato
task-specific
obtain model

solve rate (%)


gle mathematical operator 60
on a labeled training dataset. Figure 4 shows

SVAMP
the final result.
how PaLM This
540B highlights howprompt-
uses chain of thought 40
ing to achieve new state of the art on GSM8K, 20
CoT prompting
SVAMP, and MAWPS (though note in
is most effective that stan-
dard prompting already passed 0
eliciting in-context learning onthe prior best for
multi-
SVAMP). On the other two datasets, AQuA and 100
step reasoning
ASDiv, problems.
PaLM with chain of thought prompting

solve rate (%)


reaches within 2% of the state of the art (Ap- 75

MAWPS
pendix Table 2). 50
To better understand why chain of thought 25
prompting works, we manually examined model- 0
generated chains of thought by LaMDA 137B 0.42 8 137 0.35 7 175 8 62 540
for GSM8K. Of 50 random examples where the
model returned the correct final answer, all of Model scale (# parameters in billions)
the generated chains of thought were also log-
ically and mathematically correct except two Figure 4: Chain of thought prompting enables
that coincidentally arrived at the correct answer large language models to solve challenging math
(see Appendix D.1, and Table 8 for examples problems.
Figure 4.3: Notably, chain of thought
Performance reasoning
of three large
of correct model-generated chains of thought). is an emergent ability of increasing model scale.
We also randomly examined 50 random sam- LMsPrioron bestthree
numbers different
are from Cobbemath world
et al. (2021)
ples for which the model gave the wrong answer. for GSM8K, Jie et al. (2022) for SVAMP, and Lan
problem datasets,
The summary of this analysis is that 46% of the et al. (2021) for MAWPS.
from Wei et al.
chains of thought were almost correct, barring(2022c).
minor mistakes (calculator error, symbol mapping error, or one reasoning step missing), and that the
other 54% of the chains of thought had major errors in semantic understanding or coherence (see
Appendix D.2). To provide a small insight into why scaling improves chain of thought reasoning
ability, we performed a similar analysis of errors made by PaLM 62B and whether those errors were
fixed by scaling to PaLM 540B. The summary is that scaling PaLM to 540B fixes a large portion of
one-step missing and semantic understanding errors in the 62B model (see Appendix A.1).

3.3 Ablation Study

The observed benefits of using chain of thought prompting raises the natural question of whether the
same performance improvements can be conferred via other types of prompting. Figure 5 shows an
ablation study with three variations of chain of thought described below.
Equation only. One reason for why chain of thought prompting might help is that it produces the
mathematical equation to be evaluated, and so we test a variation where the model is prompted
to output only a mathematical equation before giving the answer. Figure 5 shows that equation
only prompting does not help much for GSM8K, which implies that the semantics of the questions
in GSM8K are too challenging to directly translate into an equation without the natural language
reasoning steps in chain of thought. For datasets of one-step or two-step problems, however, we find
that equation only prompting does improve performance, since the equation can be easily derived
from the question (see Appendix Table 6).

5
Chapter 5

Multimodality

In this chapter we describe how language models deal with multimodality.


By multimodality we mean information which is outside of the main input
text. Prototypically, the extra modality is visual. For this, consider the
case of an image captioning model. It is a language model which also has
a module that is able to process an image and generate text based on it.
However, there are also other types of multimodalities, which we cover in
the subsequent section. These include, for example, knowledge-graphs or
knowledge-bases in general and we discuss how they can be used to enhance
the model with extra knowledge.

5.1 Vision Language Models


While text-only pre-trained models achieves great success on various NLP
tasks, we live in a multimodal world and human brains naturally learn to
process multi-sense signals received from the environment to help us make
sense of the world around us (Gan et al., 2022). Among multiple modalities,
vision and language are two main channels with which humans perceive
and communicate, respectively. Therefore, Vision-and-Language (VL) has
been a popular research area that sits at the nexus of Computer Vision
and Natural Language Processing (NLP), aiming to develop algorithms that
endow computers with an ability to effectively learn from multimodal data.
Inspired by the great success of LLM pre-training in NLP, which is described
in previous chapters, Vision-Language Pre-training (VLP) has attracted
rapidly growing attention from both CV and NLP communities and become
the main paradigm for modern VL research.
In this section, we first introduce important vision and language tasks in

41
42 CHAPTER 5. MULTIMODALITY

Figure 5.1: Illustration of representative tasks from three cate-


gories of VL problems covered in this paper: image-text tasks ,
vision tasks as VL problems , and video-text tasks .

section 5.1.1 and then present popular model architectures and pre-training
objectives for vision-language models (VLMs) in section ?? and section ??,
respectively.

5.1.1 Vision-and-Language Tasks


Vision-and-language tasks, by definition, should include both vision and
language modalities in their inputs and outputs. According to Gan et al.
(2022), VL tasks can be grouped into three categories: Image-Text Tasks,
CV Tasks as VL Tasks, and Video-Text Tasks.

• Image-Text Tasks. Image-text tasks are tasks that include images and
texts in their inputs and outputs. Image-text tasks are arguably the most
important and well studied tasks in VL research. We introduce the most
representative image-text tasks below:

– VQA and visual reasoning. Visual Question Answering (VQA) (An-


tol et al., 2015) is a task where an AI system is given an image and a
natural language question about the image, and it must provide an an-
swer to the question in natural language. The answers required in these
these tasks can be open-ended free-form texts, or selected from multiple
choices. The goal of VQA is to build a system that can understand both
5.1. VISION LANGUAGE MODELS 43

visual and textual information and combine them to answer questions


about the content of images. As illustrated in Figure 5.1, given an image
of a dog playing frisbee, a VQA system might be asked ”What is the
dog holding with its paws?” and the answer should be ”frisbee.”
As extensions to visual question answering, researchers also developed
datasets for visual reasoning (Hudson and Manning, 2019; Suhr et al.,
2019), visual commonsense reasoning (Zellers et al., 2019), visual di-
alog (Das et al., 2017), knowledge-based VQA (Marino et al., 2019),
scene-text-based VQA (Singh et al., 2019), etc.
– Image captioning. Image captioning (Vinyals et al., 2015) involves
generating a descriptive and semantically meaningful caption for a given
image. The goal of image captioning is to understand the visual content
of an image and translate it into natural language text. This task involves
several sub-tasks, including object recognition, scene understanding,
and natural language generation. For example, as illustrated in Figure
5.1, an image captioning system will produce a caption of “A dog is
lying on the grass next to a frisbee” given the image.
In addition to the traditional setting where short single-sentence gen-
eration is required (Lin et al., 2014), researchers have also developed
datasets for image paragraph captioning (Krause et al., 2017), scene-text-
based image captioning (Sidorov et al., 2020), visual storytelling (Huang
et al., 2016), and so on.
– Image-text retrieval. Image-text retrieval models are required to
retrieve the most relevant texts (or images) from a large corpus, given
an image (or text) query. Popular image-text retrieval datasets are
based on image captioning datasets (Chen et al., 2015; Plummer et al.,
2015).
– Visual grounding. In the visual grounding (Yu et al., 2016; Plummer
et al., 2015) task, the model receives an image, a referring expression,
and bounding boxes of objects and concepts in the image. The model
then need to predict the bounding box corresponding to the input text
query. This task requires the model to understand the relationship
between language and vision, and accurately locate and identify the
objects and concepts in an image that are referred to in a given language
input.
– Text-to-image generation. It can be considered as the dual task of
image captioning, where the system is required to create a high-fidelity
image based on the text input. Most text-to-image generation models
44 CHAPTER 5. MULTIMODALITY

are trained on image captioning datasets (Chen et al., 2015; Plummer


et al., 2015).

• Computer Vision Tasks as VL Problems. Traditionally, core visual


recognition tasks such as image classification, object detection, and seg-
mentation (highlighted with pink in Figure 5.1) are considered as pure
vision problems. With the advent of CLIP (Radford et al., 2021) and
ALIGN (Jia et al., 2021), researchers have realized that language supervi-
sion can play an important role in computer vision tasks. First, the use of
noisy image-text data crawled from web allows large-scale pre-training of
vision encoders from scratch. Second, instead of treating the supervision
signals (e.g., class labels) as one-hot vectors, we take the semantic meaning
behind the labels into consideration and cast these computer vision tasks
as VL problems. This perspective generalizes the traditional close-set clas-
sification or detection models to recognizing unseen concepts in real-world
applications, such as open-vocabulary object detection.

• Video-Text Tasks. Besides static images, videos are another important


type of visual modality. Naturally, all aforementioned image-text tasks
have their video-text counterparts, such as video captioning, retrieval, and
question answering (highlighted with green in Figure 5.1). The uniqueness
of video inputs, in comparison to images, requires an AI system to not only
capture spatial information within a single video frame, but also capture
the inherent temporal dependencies among video frames.

5.1.2 Vision-Language Models: Architectures


To effective solve the aforementioned VL problems, researchers in the VL
community follow the success of LLM pre-training in NLP and pre-trained a
number of VLMs that can be effectively fine-tuned on different VL tasks. In
this section we describe
Overview
A vision-language model typically consist of a text encoder, a vision
encoder, a fusion module, and (optionally) a decoder.
Given an image-text pair, a VL model first extracts text features w =
{w1 , · · · , wN } and visual features v = {v1 , · · · , vM } via a text encoder and a
vision encoder, respectively. Here, N is the number of tokens in a sentence,
and M is the number of visual features for an image, which can be the
number of image regions/grids/patches, depending on the specific vision
encoder being used.
5.1. VISION LANGUAGE MODELS 45

OD, CNN, ViT, Masked Language Modeling


Pre-training
Patch Embedding Masked Image Modeling
Objectives
Image-Text Matching
Image encoder
Merged Attention/ Image-Text Contrastive Learning
Decoder
Co-attention/
(optional)
Dot-product Visual Question Answering
Image Captioning
A dog lying on Multimodal Fusion Downstream Image-Text Retrieval
BERT, RoBERTa, Tasks
the grass next Phrase Grounding
Word Embedding
to a frisbee
Text encoder

Figure 5.2: Illustration of a general framework for Transformer-based vision-


language models.

The text and visual features are then fed into a multimodal fusion module
to produce cross-modal representations, which are then optionally fed into a
decoder before generating the final outputs. An illustration of this general
framework is shown in Figure 5.2.
In many cases, there are no clear boundaries among image/text backbones,
multimodal fusion module, and the decoder. In this paper, we refer to the part
of the model that only takes image/text features as input as the corresponding
vision/text encoder, and the part of the model that takes both image and text
features as input as the multimodal fusion module. Besides this, if there are
additional modules that take the multimodal features as input to generate
the output, we call it decoder. We next describe different kinds of vision
encoder, text encoder, and fusion module in detail.

Vision Encoder. There are three types of vision encoders: (i) an object
detector (OD), (ii) a plain CNN, and (iii) a vision Transformer.

• OD. The most widely used object detector for VL research is the Faster R-
CNN (Ren et al., 2015) pre-trained on the Visual Genome (VG) dataset (Kr-
ishna et al., 2017) as in BUTD (Anderson et al., 2018). In VinVL (Zhang
et al., 2021), a stronger OD model based on the ResNeXt-152 C4 architec-
ture is pre-trained on multiple public OD datasets (including COCO (Chen
et al., 2015), OpenImages (Kuznetsova et al., 2020), Objects365 (Shao
et al., 2019) and VG), and significant performance boost is observed across
a wide range of VL tasks by using this stronger OD model. Additional
care is taken to encode the location information of image regions, which is
typically represented as a 7-dimensional vector.1 Both visual and location
features are then fed through a fully-connected layer, to be projected
1
[x1 , y1 , x2 , y2 , w, h, w ∗ h] (normalized top/left/bottom/right coordinates, width, height,
and area)
46 CHAPTER 5. MULTIMODALITY

into the same embedding space. The final embedding for each region is
obtained by summing up the two FC outputs and then passing through a
layer normalization layer.

• CNN. In PixelBERT (Huang et al., 2020) and SOHO (Huang et al.,


2021), ResNet-50, ResNet-101 and ResNeXt-152 pre-trained from ImageNet
classification are adopted. In CLIP-ViL (Shen et al., 2022), ResNet-50,
ResNet-101, and ResNet-50x4 pre-trained from CLIP (Radford et al., 2021)
are used. In SimVLM (Wang et al., 2022b), they use the first three blocks
(excluding the Conv stem) of ResNet-101 and ResNet-152 for their base
and large models, respectively, and a larger variant of ResNet-152 with
more channels for the huge model. Typically, it is observed that a stronger
CNN backbone results in stronger downstream performance.

• ViT. Some recent pre-trained VLMs such as ALBEF (Li et al., 2021) and
ViLT (Kim et al., 2021) use Transformer-based vision encoders. Follow-
ing Dosovitskiy et al. (2021), an image is first split into image patches,
which are then flattened into vectors and linearly projected to obtain
patch embeddings. A learnable special token [CLS] embedding is also
prepended to the sequence. These patch embeddings, when summed up
together with learnable 1D position embeddings and a potential image-type
embedding, are sent into a multi-layer Transformer block to obtain the
final output image features. Different ViT variants have been studied for
VLP, such as plain ViT (Dosovitskiy et al., 2021), DeiT (Touvron et al.,
2021), BEiT (Bao et al., 2022a), Swin Transformer (Liu et al., 2021c), and
CLIP-ViT (Radford et al., 2021), to name a few.
In a nutshell, no matter what vision encoder is used, the input image
is represented as a set of feature vectors v = {v1 , · · · , vM }. VLMs can be
categorized into non end-to-end models, which use an OD model to get vision
features for the model, and end-to-end models that directly take raw images
as input.

Text Encoder. Following BERT (Devlin et al., 2019) and RoBERTa (Liu
et al., 2019), VLMs first segment the input sentence into a sequence of
subwords and then insert two special tokens at the beginning and the end
of the sentence to generate the input text sequence. After we obtain the
text embeddings, existing works either feed them directly to the multimodal
fusion module (Li et al., 2019; Chen et al., 2020), or to several text-specific
layers (Tan and Bansal, 2019; Lu et al., 2019) before the fusion. For the
former, the fusion module is typically initialized with BERT, and the role of
5.1. VISION LANGUAGE MODELS 47

text encoding and multimodal fusion is therefore entangled and absorbed in


a single BERT model, and in this case, we consider text encoder as the word
embedding layer.
In a nutshell, no matter what text encoder is used, the input text is
represented as a set of feature vectors w = {w1 , · · · , wN }.

Multimodal Fusion. For dual encoders like CLIP (Radford et al., 2021)
and ALIGN (Jia et al., 2021), fusion is essentially computing the similarity
between representation in the two modalities, which is typically performed
via a dot-product between two global image and text feature vectors. For
fusion encoder, it takes both v = {v1 , · · · , vM } and w = {w1 , · · · , wN } as
input, and learns contextualized multimodal representations denoted as
ṽ = {ṽ1 , · · · , ṽM } and w̃ = {w̃1 , · · · , w̃N }. There are mainly two types
of fusion modules, namely, merged attention and co-attention, shown in
Figure 5.3. Specifically,
• In a merged attention module, the text and visual features are simply
concatenated together, and then fed into a single Transformer block. This
design has been used in many previous works, such as VisualBERT (Li
et al., 2019), Unicoder-VL (Li et al., 2020a), VL-BERT (Su et al., 2019),
UNITER (Chen et al., 2020), OSCAR (Li et al., 2020b), VinVL (Zhang
et al., 2021), ViLT (Kim et al., 2021), GIT (Wang et al., 2022a), etc.
• In a co-attention module, on the other hand, the text and visual features
are fed into different Transformer blocks independently, and techniques
such as cross-attention are used to enable cross-modal interaction. This
design has been used in LXMERT (Tan and Bansal, 2019), ViLBERT (Lu
et al., 2019), ERNIE-ViL (Yu et al., 2021), METER (Dou et al., 2022),
etc. Also, in many models, only image-to-text cross-attention modules are
used, such as ALBEF (Li et al., 2021), BLIP (Li et al., 2022), CoCa (Yu
et al., 2022), and Flamingo (Alayrac et al., 2022). Most ViT-based models
adopts co-attention module since the image sequence can be very long and
doing merged attention can be very computationally inefficient.

5.1.3 Vision-Language Models: Pre-training Objectives


We then introduce the pre-training tasks used in VLP. We will first introduce
masked language modeling and image-text matching, which have been used
extensively in almost every VLP model. Then, we will also describe image-
text contrastive learning and masked image modeling tasks which are widely
used in recent VLMs.
48 CHAPTER 5. MULTIMODALITY

Feedforward Feedforward
Feedforward
Mx Mx Mx
Cross-Attn Cross-Attn

Self-Attn
Self-Attn Self-Attn

Visual Feature Text Feature


Visual Feature Text Feature
(a) Co-attention (b) Merged attention

Figure 5.3: Co-attention and merged attention design for multimodal fusion.

Masked Language Modeling (MLM). The MLM objective is first


introduced in language model pre-training (e.g., Devlin et al., 2019; Liu
et al., 2019). In VLP, MLM with image-text pairs has also proven to be
useful. In MLM, given an image-text pair, we randomly mask out the input
words with probability of 15%, and replace the masked ones w̃m with special
token [MASK].2 The goal is to predict these masked tokens based on their
surrounding words w̃\m and the paired image ṽ, by minimizing the negative
log-likelihood:

LMLM (θ) = −E(w̃,ṽ)∼D log Pθ (w̃m |w̃\m , ṽ) , (5.1)

where θ denotes the trainable parameters. Each pair (w̃, ṽ) is sampled from
the whole training set D. There are several MLM variants used in VLP.
Specifically,

• Seq-MLM: In order to adapt the pre-trained model for image captioning,


it is observed (Zhou et al., 2020; Wang et al., 2021a) that adding a seq2seq
causal mask during pre-training is beneficial. That is, in Seq-MLM, the
model can only use its preceding context to predict the masked token,
which is consistent to the way the model performs image captioning during
inference.

• LM: Direct language modeling is used in BLIP (Li et al., 2022) and
CoCa (Yu et al., 2022) for VLP. The model predicts the caption given an
image token-by-token autoregressively.

• Prefix-LM: Using the encoder-decoder framework as in SimVLM (Wang


et al., 2022b) and DaVinCi (Diao et al., 2023), a PrefixLM pre-training
2
Following BERT, this 15% is typically decomposed into 10% random words, 10%
unchanged, and 80% [MASK].
5.1. VISION LANGUAGE MODELS 49

objective is proposed, where a sentence is first split into two parts, and the
bi-directional attention is enabled on the prefix sequence and the input
image, while a causal attention mask is adopted on the remaining tokens.

Image-Text Matching (ITM). In ITM, given a batch of matched or


mismatched image-caption pairs, the model needs to identify which images
and captions correspond to each other. Most VLP models treat image-text
matching as a binary classification problem. Specifically, a special token
(i.e., [CLS]) is appended at the beginning of the input sentence to learn a
global cross-modal representation. We then feed the model with either a
matched or mismatched image-caption pair ⟨ṽ, w̃⟩ with equal probability,
and a classifier is added on top of the [CLS] token to predict a binary label y,
indicating whether the sampled image-caption pair is matched. Specifically,
denote the output score as sθ (w̃, ṽ), We apply the binary cross-entropy loss
for optimization:

LITM (θ) = −E(w̃,ṽ)∼D [y log sθ (w̃, ṽ) + (1 − y) log(1 − sθ (w̃, ṽ))]) . (5.2)

Besides randomly sampling a negative image-text pair, harder negative pairs


can also be mined from an image-text contrastive loss introduced below, which
has been shown to be effective in improving the downstream performance,
as reported in ALBEF (Li et al., 2021), VLMo (Wang et al., 2021b), and
X-VLM (Zeng et al., 2022).

Image-Text Contrastive Learning (ITC). Early VLP models, such as


UNITER (Chen et al., 2020) and VinVL (Zhang et al., 2021), do not use
ITC for pre-training. Though the ITC loss is widely studied before VLP, in
the context of end-to-end VLP, it is mostly popularized by CLIP (Radford
et al., 2021) and ALIGN (Jia et al., 2021) to pre-train a dual encoder. Later
on, it is also used to pre-train a fusion encoder as in ALBEF (Li et al.,
2021). Note that this ITC loss is used on top of the outputs of image and
text encoders, before multimodal fusion (i.e., the use of w and v, instead
of w̃ and ṽ). Specifically, given a batch of N image-text pairs, ITC aims to
predict the N matched pairs from all the N 2 possible image-text pairs. With
a little bit abuse of notation, let {vi }N N
i=1 and {wi }i=1 denote respectively the
normalized image vectors and text vectors in a training batch. To compute
50 CHAPTER 5. MULTIMODALITY

image-to-text and text-to-image similarities, we have:


i2t
si,j = vi⊤ wj , st2i ⊤
i,j = wi vj , (5.3)
N i2t /σ)
1 X exp(si,i
Li2t
ITC (θ) = − log PN , (5.4)
N i2t
j=1 exp(si,j /σ)
i=1
N
1 X exp(st2i
i,i /σ)
Lt2i
ITC (θ) = − log N
, (5.5)
N t2i
P
i=1 j=1 exp(si,j /σ)

where σ is a learned temperature hyper-parameter, Li2t t2i


ITC and LITC are
image-to-text and text-to-image contrastive loss, respectively.

Masked Image Modeling (MIM). Similar to the MLM objective, re-


searchers have studied various masked image modeling (MIM) tasks for
pre-training. Specifically, the model is trained to reconstruct the masked
patches or regions ṽm given the remaining visible patches or regions ṽ\m
and all the words w̃ as

LMIM (θ) = E(w̃,ṽ)∼D Pθ (ṽm |ṽ\m , w̃) . (5.6)

The designs of MIM can be divided into two categories.

• For OD-based VLP models, e.g., LXMERT (Tan and Bansal, 2019)
and UNITER (Chen et al., 2020), some of the input regions are randomly
masked (i.e., the visual features of the masked regions are replaced by
zeros), and the model is trained to regress the original region features via
minimizing the mean squared error loss. Researchers (Tan and Bansal,
2019; Lu et al., 2019; Chen et al., 2020) have also tried to first generate
object labels for each region using a pre-trained object detector, which
can contain high-level semantic information, and the model is trained to
predict the object labels for the masked regions instead of the original
region features.

• For end-to-end VLP models, e.g., ViLT (Kim et al., 2021) and ME-
TER (Dou et al., 2022), researchers have investigated the use of masked
patch regression/classification for masked image modeling. Specifically,

– For MIM with discrete VQ tokens, inspired by BEiT (Bao et al.,


2022a), discrete VQ tokens are first extracted for the input patches, and
the model is then trained to reconstruct the discrete tokens. Specifically,
the VQ-VAE (van den Oord et al., 2017) model in DALL-E (Ramesh
5.2. KNOWLEDGE-ENHANCEMENT 51

et al., 2021) is first used to tokenize each image into a sequence of


discrete tokens. Each image is resized so that the number of patches is
equal to the number of tokens, and thus each patch corresponds to a
discrete token. Then, we randomly mask 15% of the patches and feed
the masked image patches to the model as before, but now the model is
trained to predict the discrete tokens instead of the masked patches.
– For MIM with in-batch negatives, by imitating MLM which uses
a text vocabulary, the model is trained to reconstruct input patches
by using a dynamical vocabulary constructed with in-batch negatives.
Concretely, at each training step, we sample a batch of image-caption
pairs {⟨v k , wk ⟩}B
k=1 , where B is the batch size. We treat all the patches
in {vk }B
k=1 as candidate patches. For each masked patch, we mask 15%
of the input patches. The model needs to select the original patch within
this candidate set. The model is trained to maximize its probability
similar to noise contrastive estimation (Gutmann and Hyvärinen, 2010).
Notably, recent state-of-the-art VLP models (e.g., VinVL (Zhang et al.,
2021), ALBEF (Li et al., 2021), VLMo (Wang et al., 2021b)) do not apply
MIM during pre-training, and in ViLT (Kim et al., 2021) and METER (Dou
et al., 2022), the authors also demonstrate that MIM is not helpful for
downstream performance. However, there are also recent works that adopt
masked vision-language modeling (as in MaskVLM (Kwon et al., 2022) and
VL-BEiT (Bao et al., 2022b)), which try to randomly mask patches/tokens
while keeping the other modality intact.

5.2 Knowledge-Enhancement
Language models are known to contain knowledge on their own (Petronic
et al., 2019; Petroni et al., 2020) and we cover the extraction of this knowledge
in a later chapter. Consider a question answering system based on a large
language model. When asked about the current president, it is able to provide
the correct answer. But on the day of inauguration, the true answer will
be different while the model will keep the old one. Because the information
about the current president is stored in the model’s parameters, the only
reliable way of updating this information in the model is to fine-tune it or
re-train it from scratch. However, changing a small piece of information
should not require these complicated and expensive operations. For this, we
turn to knowledge-enhanced language models.
Models which contain information indirectly in their parameters are
called parametric and their counterparts, models which rely on external
52 CHAPTER 5. MULTIMODALITY

information, are called non-parametric. They do have parameters but most


of the information is not stored there but in some more human-accessible
form and location. The general principle is that during the inference, the
model generates a key by which an external, interpretable and editable system
is queried. Next, the result of this search is fused into the model and the
inference continues. Continuing with the question answering example, the
model may construct e.g. an SQL query for the current president in a given
country. Upon receiving the answer, it is attended to in the inference and
copied to the output. When the true answer changes e.g. on the inauguration
day, it can easily be edited by a human or another automated system in the
database and the large language model, which serves as question answerer,
does not need to be changed at all. To summarize, the advantage of having an
external source of information is ease of editability but also interpretability
because we can later examine which pieces of information were retrieved and
used in the inference process.
We now describe several models that utilize the knowledge retrieval
approach.

5.2.1 kNN LM
The language model proposed by Khandelwal et al. (2019) utilizes memo-
rization of the training data during inference in order to predict otherwise
sparsely occurring words. In this approach, the authors first compute repre-
sentations of all sentence prefixes (last self-attention layer) and store them
as a mapping to the following word:

Encoder: k = LMrep. (X<i ) (5.7)


Datastore : B = {(Encoder(X<i ), Xi )|X ∈ Dtrain , i < |X|} (5.8)

In autoregressive inference, the prefix is again computed for the current


sentence and then 1024 neighbours in the vector space are retrieved:
L 2
Retriever: S = {(r, v)|(r, v) ∈ N1024 (k)} (5.9)

Then, their corresponding mapped words are combined together into a single
distribution over the vocabulary:

1v exp(−||r − k||2 · v) (5.10)


X
Aggregator: pξ (X̂i ) ∝
(r,v)∈S

Formally, this alone would create its own kNN-based language model pξ .
However, its output is weighted to the main model’s own prediction using
5.2. KNOWLEDGE-ENHANCEMENT 53

manually set hyperparameter λ ∈ [0, 1]. The higher it is, more weight is put
to the retrieved prediction as opposed to the current model’s prediction:

Output: λ · pξ + (1 − λ) · LM(X<i ) (5.11)

The symbolic working of the model is shown in the following set of


equations and Fig. 5.4 which is adapted from the original paper. Even
with preserving the same amount of trainable parameters, the perplexity on
WikiText-103 of a baseline model (Baevski and Auli, 2018) was reduced from
18.7 to 15.8.
Training Contexts Targets Representations Distances Nearest k Normalization Aggregation

Obama was senator for Illinois 4 Hawaii 3 Hawaii 0.7 Hawaii 0.8
Barack is married to Michelle 100 Illinois 4 Illinois 0.2 Illinois 0.2
Obama was born in Hawaii 5 Hawaii 5 Hawaii 0.1
… … … …
Obama is a native of Hawaii 3 Classification Interpolation

Test Context Target Representation


Hawaii 0.2 Hawaii 0.6
Illinois 0.2 Illinois 0.2
Obama’s birthplace is ? … … … …

Figure 5.4: An illustration of kNN-LM. A datastore is constructed with an


entry for each training set token, and an encoding of its leftward context.
For inference, a test context is encoded, and the k most similar training
contexts are retrieved from the datastore, along with the corresponding
targets. A distribution over targets is computed based on the distance of
the corresponding context from the test context. This distribution is then
interpolated with the original model’s output distribution. (Khandelwal
et al., 2019)

5.2.2 Dynamic Gating kNN LM


The previously described language models use the same weighting mechanism
across all words, i.e. both common words such as the, a, one and very low
frequency words, where the retrieval from training data may be particularly
useful (e.g. f (Barack’s wife is called) → Michelle). Yogatama et al. (2021)
propose an approach in which the model itself determines this parameter,
now a vector, dynamically based on the current sentence prefix (e.g. lower
weight to retrieval component for easy words). The datastore is constructed
in the same way from the training data as in kNN LM. Their approach
modifies the final Eq. (5.11) with a dynamic weight based on the current
54 CHAPTER 5. MULTIMODALITY

hidden state:

Dynamic weight: g = σ(wT · LMrep. (X<i )) (5.12)


Output: (1 − g) ⊙ pξ + g ⊙ LM(X<i ) (5.13)

The experimented model (Transformer-XL) was half the size of the original
one, but again, on Wikipedia-103, it reduced the perplexity from 19.1 to 17.6

5.2.3 KnowBERT
KnowBERT (Peters et al., 2019) is one typical knowledge-enhanced language
model that injecting factual knowledge into hidden representations of tokens.
The idea is quite intuitive: if a language model can align mentions in the input
text with entities in the knowledge base, we can say that factual knowledge
is contained in this model. Note that in previous sections, we mentioned
that in Transformers, text is first tokenized and vectorized as embeddings.
Then, they are processed by Transformer modules. In KnowBERT, mentions
are not aligned with entities before input into the model. Instead, they are
both first vectorized, and then they are aligned in the format of hidden
representations. Next, we illustrate its idea with more details.

Preparation (knowledge base and base language model). Knowledge


bases contain rich factual knowledge. The knowledge in them is usually
formatted as triples, i.e., (subj, rel, obj). subj and obj here are entities,
which are usually associated with entity labels and descriptions. rel indicates
the relationship between entities, with labels associated as well. In most cases,
the knowledge is vectorized before usage. In KnowBERT, they compute
entity embeddings from WordNet as the entity embeddings, where they
claimed that relations, and synset definitions are encoded.
As most knowledge-enhanced language models, KnowBERT is not trained
from scratch. The backbone pretrainded language model is BERT,
where we can regard the knowledge enhancement process as secondary
pretraining or special finetuning of BERT. The architecture is roughly the
same to BERT, with specially designed context mechanism. Next, we
introduce the Knowledge Attention and Recontextualization (as shown in
Figure 5.5) component with details.

(1) Mention-span representations is calculated with below equation:

Hproj
i = Hi Wproj
1 + bproj
1 . (5.14)
5.2. KNOWLEDGE-ENHANCEMENT 55

Figure 5.5: The knowledge integration component of KnowBERT (i.e., Knowl-


edge Attention and Recontextualization). BERT word piece representations
Hi are first projected to Hproj
i (1), then pooled over candidate mentions spans
(2) to compute S, and contextualized into Se using mention-span self-attention
(3). An integrated entity linker computes weighted average entity embeddings
E (4), which are used to enhance the span representations with knowledge
e

from the knowledge base (5), computing S e . Finally, the BERT word piece
representations are recontextualized with word-to-entity-span attention (6)

and projected back to the BERT dimension (7) resulting in Hi .

The KAR starts with the knowledge base entity candidate selector that
provides a list of candidate mentions which it uses to compute C mention-span
representations sm ∈ RE . Hi is first projected to the entity dimension with a
linear projection, then, the KAR computes mention-span representations, one
for each candidate mention, by pooling over all word pieces in a mention-span
using the self-attentive span pooling. The mention-spans are stacked into a
matrix S ∈ RC×E .

(2) Entity linker functions as its name. Here, it is responsible for per-
forming entity disambiguation for each potential mention from available
candidates. It first runs mention-span self-attention to compute embedding
as
Se = TransformerBlock(S). (5.15)

The span self-attention is identical to the typical transformer layer, exception


that the attention mechanism here is cross-attention, which is between
mention-span vectors instead of word piece vectors. This allows KnowBERT
to incorporate global information into each linking decision, so that it can
take advantage of entity-entity cooccurrence and resolve which of several
overlapping candidate mentions should be linked.
56 CHAPTER 5. MULTIMODALITY

The objective function is based on the supervision signal of entity linking,


which can be used to optimize parameters of the entity linker and other
weight matrices. Specifically, Se is used to score each of the candidate entities
while incorporating the candidate entity prior from the knowledge base.
Each candidate span m has an associated mention-span vector sem ∈ Se , Mm
candidate entities with embeddings emk (from knowledge base), and prior
probabilities pmk . KnowBERT computes Mm scores using the prior and dot
product between the entity span vectors and entity embeddings as

ψmk = MLP(pmk , sem · emk ). (5.16)

And thus correspondingly, the loss function of entity linker can be written as
 
X exp ψmg
LEntityLinker = − log P (5.17)
m k exp ψmg

(3-5) Knowledge enhanced entity-span representations. KnowBERT


injects the knowledge base entity information into the mention-span repre-
sentations computed from BERT vectors (sem ) to form entity span represen-
tations. The weighted entity embedding can be then written as
X
em =
e ψmg em . (5.18)
k

Then, the entity-span representations are updated with the weighted entity
embeddings as

sme = sem + e
em , (5.19)

which can be packed into a matrix as S e ∈ RC×E .

(6-7) Recontextualization happens after updating the entity span rep-


resentations with the weighted entity vectors. KnowBERT uses them to
recontextualize the word piece representations. This is accomplished using a
modified transformer layer that substitutes the multi-headed self-attention
with a multi-headed attention between the projected word piece representa-
tions and knowledge enhanced entity span vectors. As introduced before, we
have
′ ′
H’proj
i = MLP(MultiHeadAttn(Hproj i , S e , S e )). (5.20)
Then, Hproj
i is projected back to the BERT dimension with a linear transfor-
mation and a residual connection added as:

H’i = H’proj
i Wproj
2 + bproj
2 + Hi . (5.21)
5.2. KNOWLEDGE-ENHANCEMENT 57

Figure 5.6: The left part is the architecture of ERNIE. The right part is
the aggregator for the mutual integration of the input of tokens and entities.
Information fusion layer takes two kinds of input: one is the token embedding,
and the other one is the concatenation of the token embedding and entity
embedding. After information fusion, it outputs new token embeddings and
entity embeddings for the next layer

Training process. To avoid catastrophic forgetting, KnowBERT cannot


be trained only using the entity linking supervision signal, the pretraining
objectives of BERT should also be considered. Thus, simply speaking, the
knowledge enhancement objective is defined as:

LKnowBERT = LBERT + LEntityLinker . (5.22)

Note that there are many engineering tricks of the KnowBERT design. These
tricks are commonly used in many knowledge-enhanced language models. In
our following sections, we will leave out these tricks for simplicity.

5.2.4 [Optional] ERNIE


ERNIE (Zhang et al., 2019) was proposed prior to KnowBERT, which utilized
part of the design mentioned above in the KnowBERT section. Therefore, to
avoid repeating content, we briefly introduce the general method design of
ERNIE, and mainly focus on the difference between ERNIE and KnowBERT.

Architecture. The knowledge enhancement design of ERNIE can be


found in Figure 5.6. We can find that different from KnowBERT, the
58 CHAPTER 5. MULTIMODALITY

whole model architecture of ERNIE consists of two stacked modules: (1)


the underlying textual encoder (T-Encoder) responsible to capture basic
lexical and syntactic information from the input tokens, and (2) the upper
knowledgeable encoder (K-Encoder) responsible to integrate extra token-
oriented knowledge information into textual information from the underlying
layer, so that we can represent heterogeneous information of tokens and
entities into a united feature space.

Objective. In order to inject knowledge into language representation


by informative entities, ERNIE proposes a new pre-training task, which
randomly masks some token-entity alignments and then requires the system
to predict all corresponding entities based on aligned tokens. As the task is
similar to training a denoising auto-encoder, it refers to this procedure as a
denoising entity auto-encoder (dEA). Considering that the size of entities is
quite large for the softmax layer, ERNIE thus only requires the system to
predict entities based on the given entity sequence instead of all entities in
knowledge base. Given the token sequence {w1 , ..., wn } and its corresponding
entity sequence {e1 , ..., em }, the aligned entity distribution for the token wi
is defined as follows,

p(ej |wi ) = softmax(linear(wio ) · ej ). (5.23)

The equation above will be used to compute the cross-entropy loss function
for dEA.

Performance. After we know how KnowBERT and ERNIE inject knowl-


edge, we show that they can perform good on knowledge-intensive tasks.
Note that the knowledge enhancement can be regarded as the secondary
pretraining. Thus, to compare the performance of BERT, ERNIE, and Know-
BERT, similar to finetuning, we need first tune them on training data of the
downstream task and then test them. As aforementioned, the knowledge can
be regarded as triples, which contains entities and relations. Thus, the most
intuitive way to evaluate these knowledge-enhanced language models is to
see if they perform good on predicting knowledge.
We consider two typical knowledge-intensive downstream tasks for evalu-
ation: relation extraction (Choi et al., 2018) and entity typing (Zhang et al.,
2017). Results can be found in Table 5.1.
• Relation extraction: Models are given a sentence with marked a subject
and object, and asked to predict which of several different relations (or
no relation) holds.
5.2. KNOWLEDGE-ENHANCEMENT 59

Table 5.1: Performance of BERT, ERNIE, and KnowBERT on two knowledge-


intensive tasks. P here is precision, R is recall.

Relation extraction (Choi et al., 2018) Entity typing (Zhang et al., 2017)
Model
P R F1-Micro P R F1-Micro
BERT 67.2 64.8 66.0 76.4 71.0 73.6
ERNIE 70.0 66.1 68.0 78.4 72.9 75.6
KnowBERT 71.6 71.4 71.5 78.6 73.7 76.1

• Entity typing: Given an entity mention and its context, entity typing
requires systems to label the entity mention with its respective semantic
types.
60 CHAPTER 5. MULTIMODALITY
Chapter 6

Additional Topics

This chapter includes...

6.1 Instruction-Based Training Procedures


Shortly after the advent of transformer-based language models, the pretraining-
finetuning paradigm was shown to be an effective strategy for the great
majority of NLP tasks. The ability of the pretraining stage to boost the
performance of these models on downstream tasks led to the hypothesis that
language models are unsupervised multitask learners (Radford et al., 2019).
This idea suggests that models pretrained with a language modeling objec-
tive on large corpora are able to perform better because, during pretraining,
these models implicitly learn a latent structure of the language that is useful
to subsequently learn more effectively a task involving text. While classic
pretraining objectives leverage this ability of language models to implicitly
learn information useful for multiple tasks, recent work proposed an explicit
multi-task training paradigm that leverages the use of natural language
instructions. We can identify two main approaches that follow this paradigm:
instruction tuning and reinforcement learning from human feedback.

6.1.1 Instruction Tuning


Instruction tuning consists of finetuning language models on a collection of
datasets described via instructions. The way these instructions are incor-
porated in the training procedure is simply by prepending to the model’s
input a short description of the task that needs to be carried out (e.g., Is
the sentiment of this movie review positive or negative? or Translate the

61
62 CHAPTER 6. ADDITIONAL TOPICS

Figure 6.1: Example of instruction tuning, from Chung et al. (2022).

following sentence into Chinese). This approach combines aspects of both


pretrain-finetune and prompting paradigms to improve language models’
responses to inference-time text interactions. By finetuning a pretrained
language model on a mixture of NLP datasets expressed via natural language
instructions, the models were shown to perform better on previously unseen
tasks (process illustrated in Figure 6.1).

An example of models trained using this procedure is the FLAN (Fine-


tuned Language Net) family (Wei et al., 2022a). To evaluate the FLAN
models’ ability to perform a specific task T (e.g., natural language infer-
ence), the model is instruction-tuned on a range of other NLP tasks such as
commonsense reasoning, translation, and sentiment analysis. As this setup
ensures that the FLAN model has not seen task T in instruction tuning,
the model is the evaluated on its ability to perform T in a zero-shot setting.
FLAN models were shown to outperform larger LMs prompted with few
shots (Figure 6.2). This approach was subsequently scaled to larger models
(540B PaLM; Chowdhery et al. 2022) and a larger number of tasks (1836)
(Chung et al., 2022). This resulted in models with better generalization
capabilities, improved reasoning abilities, and better behavior in open-ended
zero-shot generation than non-instruction-tuned models.
Here is a goal: Get a cool sleep on Translate this sentence to
summer days. Spanish:
Input (Natural Language Inference)
How would you accomplish this goal? The new office building
Premise: At my age you will probably
OPTIONS: was built in less than three
have learnt one lesson.
-Keep stack of pillow cases in fridge. months.
Hypothesis: It's not certain how many
-Keep stack of pillow cases in oven. Target lessons you'll learn by your thirties.
Target El nuevo edificio de oficinas Does the premise entail the hypothesis?
keep stack of pillow cases in fridge se construyó en tres meses. OPTIONS:
-yes -it is not possible to tell -no
Sentiment analysis tasks
FLAN Response
Coreference resolution tasks
It is not possible to tell

6.1. INSTRUCTION-BASED TRAINING PROCEDURES 63

GPT-3 175B zero shot GPT-3 175B few-shot FLAN 137B zero-shot

77.4
56.2 72.6 55.7 56.6
Performance 53.2
on unseen 63.7 49.8
42.9
task types

Natural language inference Reading Comprehension Closed-Book QA

Figure 1:Figure
Top: overview of instruction
6.2: Performance of tuning and FLAN,
zero-shot FLAN. Instruction
comparedtuning finetunes aand
with zero-shot pretrained
language model on a mixture of tasks phrased as instructions. At inference time, we evaluate on
few-shot GPT-3, from Wei et al. (2022a).
an unseen task type; for instance, we could evaluate the model on natural language inference (NLI)
when no NLI tasks were seen during instruction tuning. Bottom: performance of zero-shot FLAN,
compared with zero-shot and few-shot GPT-3, on three unseen task types where instruction tuning
improved6.1.2 Reinforcement
performance substantially outLearning from Human
of ten we evaluate. Feedback
NLI datasets: ANLI R1–R3, CB, RTE.
Reading comprehension datasets: BoolQ, MultiRC, OBQA. Closed-book QA datasets: ARC-easy,
A different
ARC-challenge, training approach that leverages human instructions is reinforce-
NQ, TriviaQA.
ment learning from human feedback (Christiano et al., 2017). Reinforcement
⇤ learning (RL)
Lead contributors. is acontributions
Author branch of machine learning
listed at end of paper.that focuses on how agents can
learn to make decisions through interactions with an environment. In RL, an
agent receives rewards or penalties based on its actions in the environment,
1
and its objective is to maximize its cumulative reward over time. The reward
function is a critical component of RL, as it specifies the goals and incentives
for the agent. The agent’s behavior is captured by a policy, which maps
observations to actions. The goal of RL is to optimize the policy to maximize
the expected cumulative reward. The intuition behind RL from human
feedback (RLHF) is to fit a reward function to the human’s preferences while
simultaneously training a policy to optimize the current predicted reward
function.
In this setting, the agent that we are considering is a pretrained language
model that needs to be aligned with the user’s intention. To this end, human
preferences are used as a reward signal to train the model (process illustrated
in Fig. 6.3). This procedure incorporates both explicit intentions such as
following instructions and implicit intentions such as staying truthful, and
not being biased, toxic, or otherwise harmful. RLHF consists of the following
steps:

1. Defining a distribution of prompts on which we want the model to


produce aligned outputs.

2. Constructing a dataset of human-written demonstrations and using


it to train supervised learning baseline model. These demonstrations
represent the desired behavior on the input prompt distribution.
64 CHAPTER 6. ADDITIONAL TOPICS

Figure 6.3: RLHF training procedure. Source: https://openai.com/blog/


chatgpt.

3. Collecting a dataset D of comparisons between outputs produced by


the baseline model, where labelers indicate which output they prefer
for a given input. A reward model is then trained to predict the
human-preferred output. The loss function for the reward model θRM
is:
1  
L(θRM ) = − K
 E(x,yw ,yl )∼D log(σ(rθRM (x, yw ) − rθRM (x, yl ))) ,
2

where rθ (x, y) is the scalar output of the reward model for prompt x
and completion y with parameters θ, yw is the preferred completion
out of the pair of yw and yl , and K is the number of ranked responses
to prompt x present in D.

4. Using the output of the reward model as a scalar reward, fine-tune the
supervised policy to optimize this reward using the proximal policy
optimization (PPO) algorithm (Schulman et al., 2017). The objective
function during this phase is:

obj(ϕ) =E(x,y)∼D RL rθRM (x, y) − β log(πϕRL (y|x)/π SF T (y|x)) +


 
π
ϕ

γEx∼Dpretrain log(πϕRL (x)) ,


 

where πϕRL is the learned RL policy, π SF T is the supervised trained


model, and Dpretrain is the pretraining distribution. This objective
6.2. SCALING LAWS AND EMERGENT BEHAVIOR 65

combines a per-token Kullback-Leibler penalty from the supervised


finetuning (SFT) model to mitigate over-optimization of the reward
model, with a pretraining loss term. This terms are weighted by the
parameters β and θ, respectively.

This procedure aligns the behavior of the language model to the stated
preferences of a specific group of people.
An example of models trained via RLHF is the InstructGPT series
(Ouyang et al., 2022). For these models, the 175B-parameter GPT-3 is used
a SFT baseline model, while a smaller 6B model is used to model the reward.
A human evaluation of the outputs produced by InstructGPT showed the
efficacy of the RLHF technique: a set of labelers rated InstructGPT outputs
favorably, compared to the predictions of the non-instruction-trained baseline,
along multiple axes (e.g., compliance to the constraints in the prompt, lack
of hallucinations, use of appropriate language). An aggregation of this result
is displayed in Figure 6.4. The same procedure was used to train the popular
ChatGPT model, though with slight differences in the data collection setup.

Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how
Figure 6.4: Multiple models evaluated by how often their outputs were
often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT
models (PPO-ptx)
preferred to thoseas well
forasthe
its variant
175B trained
SFT without
model,pretraining mix (PPO)
according tosignificantly outperform
human evaluators.
the GPT-3 baselines
InstructGPT (GPT, GPT prompted);
(PPO-ptx), as well outputs
as itsfrom our 1.3Bwithout
variant PPO-ptx model are preferredmix
pretraining to
those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals.
(PPO), significantly outperform GPT-3.

used for many recent large LMs—predicting the next token on a webpage from the internet—is
different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019;
Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022). Thus, we say that
6.2 Scaling laws and Emergent Behavior
the language modeling objective is misaligned. Averting these unintended behaviors is especially
important for language models that are deployed and used in hundreds of applications.
With theprogress
We make improved accuracy
on aligning language of pretrained
models language
by training them models, with
to act in accordance researchers
the user’s
intentionimprovements
noticed (Leike et al., 2018).in
This encompasses both
performance explicit
with intentions such
increasing as following
model size.instructions
and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful.
Using the language of Askell et al. (2021), we want language models to be helpful (they should
help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and
harmless (they should not cause physical, psychological, or social harm to people or the environment).
We elaborate on the evaluation of these criteria in Section 3.6.
We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement
learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune
GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human
preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label
our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more
details). We then collect a dataset of human-written demonstrations of the desired output behavior
on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and
66 CHAPTER 6. ADDITIONAL TOPICS

Kaplan et al. (2020) demonstrate that the test-loss of an autoregressive


transformer exhibits a power-law relationship when the model’s performance
is constrained by only one of three factors: the number of non-embedding
parameters (N ), the size of the dataset (D), or the compute budget allocated
optimally (Cmin ). They analyze the relationship between language modeling
loss and these factors, with a focus on the Transformer architecture. The
broad range of performance levels in language tasks enables us to examine
trends across more than seven orders of magnitude in scale.
The authors identify precise power-law scaling patterns for performance
in relation to training time, context length, dataset size, model size, and
computational resources.

6.2.1 Summary of Scaling Laws


The test loss of a Transformer trained to autoregressively model language
can be predicted using a power-law when performance is limited by only
either the number of non-embedding parameters N , the dataset size D, or
the optimally allocated compute budget Cmin (see Figure ??):
1. For models with a limited number of parameters, trained to con-
vergence on sufficiently large datasets: L(N ) = (Nc /N )αN ; αN ∼
0.076, Nc ∼ 8.8 × 1013 (non-embedding parameters)

2. For large models trained with a limited dataset with early stopping:
L(D) = (Dc /D)αD ; αD ∼ 0.095, Dc ∼ 5.4 × 1013 (tokens)

3. When training with a limited amount of compute, a sufficiently large


dataset, an optimally-sized model, and a sufficiently small batch
αmin
size: L(Cmin ) = Ccmin /Cmin C ; αC min ∼ 0.050, Ccmin ∼ 3.1 ×
108 (PF-days)
These relations hold across eight orders of magnitude in Cmin , six orders
of magnitude in N , and over two orders of magnitude in D. They depend
very weakly on model shape and other Transformer hyperparameters (depth,
width, number of self-attention heads). The power laws αN , αD , αCmin specify

the degree of performance improvement expected as we scale up N , D, or


Cmin ; for example, doubling the number of parameters yields a loss that is
smaller by a factor 2−αN = 0.95. The precise numerical values of Nc , Ccmin ,
and Dc depend on the vocabulary size and tokenization and hence do not
have a fundamental meaning.
Scaling a model can give rise to some emergent abilities. Wei et al.
(2022b) describe an emergent ability as
6.2. SCALING LAWS AND EMERGENT BEHAVIOR
Test Loss 67

Compute Dataset Size Parameters


PF-days, non-embedding tokens non-embedding

Figure 6.5: Language modeling performance improves smoothly as we increase


the model size, datasetset size, and amount of compute used for training.
For optimal performance all three factors must be scaled up in tandem.
Larger models require fewer samples The optimal model size grows smoothly
to reach the same performance with the loss target and compute budget

Line color indicates


Test Loss 10 10 number of parameters

103 106 109


8 8
103 Params

6 6
Compute-efficient
109 Params training stops far
short of convergence
4 4

107 109 1011 10-9 10-6 10-3 100


Tokens Processed Compute (PF-days)

Figure 6.6: Language model training runs, with models ranging in size from
103 to 109 parameters (excluding embeddings).

An ability is emergent if it is not present in smaller models but


is present in larger models.

Wei et al. (2022b) compare how models of varying sizes perform on various
tasks and show that abilities to perform some tasks emerge in language models
after a certain scale. Figure 6.7 and 6.8 show how language models that are
trained above a certain number of FLOPS exhibit increased performance on
certain tasks.
68 CHAPTER 6. ADDITIONAL TOPICS

LaMDA GPT-3 Gopher Chinchilla PaLM Random

(A) Mod. arithmetic (B) IPA transliterate (C) Word unscramble (D) Persian QA
50 50 50 50

Exact match (%)

Exact match (%)


40 40 40 40
Accuracy (%)

BLEU (%)

30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
1018 1020 1022 1024 1018 1020 1022 1024 1018 1020 1022 1024 1018 1020 1022 1024

(E) TruthfulQA (F) Grounded mappings (G) Multi-task NLU (H) Word in context
70 70 70 70
60 60 60 60
Accuracy (%)

Accuracy (%)

Accuracy (%)

Accuracy (%)
50 50 50 50
40 40 40 40
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
1020 1022 1024 1020 1022 1024 1020 1022 1024 1020 1022 1024

Model scale (training FLOPs)

Figure 6.7: Eight examples of emergence in the few-shot prompting setting.


Each point is a separate model. The ability to perform a task via few-shot
prompting is emergent when a language model achieves random performance
until a certain scale, after which performance significantly increases to well
above random.
6.2. SCALING LAWS AND EMERGENT BEHAVIOR 69

Prior SOTA (pretrain–finetune)


Few-shot prompting

(A) TriviaQA (B) Physical QA (C) GSM8K (D) OKVQA


(GPT-3) (GPT-3) (PaLM) (Flamingo)
100 90 60 60

VQA accuracy (%)


80 50 50
Accuracy (%)

Accuracy (%)

Accuracy (%)

80 40 40
60
30 30
40
70 20 20
20 10 10
0 60 0 0
1B 100B 1B 100B 8B 62B 540B 3B 9B 80B
Model scale (number of parameters)

Figure 6.8: On some benchmarks, task-general models (not explicitly trained


to perform a task) surpass prior state-of-the-art performance held by a task-
specific model.
70 CHAPTER 6. ADDITIONAL TOPICS
Index

A non-parametric 51
adapter 27

F P
fine-tuning 9 parametric 51
finetuning 25 pretrained model 9
pretraining 9
M prompting 8
masked language modelling 12

N S
next sentence prediction 12 self-supervision 9

71
72 INDEX
Bibliography

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain


Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm
Reynolds, et al. 2022. Flamingo: A visual language model for few-shot
learning. arXiv preprint arXiv:2204.14198.

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,
Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention
for image captioning and visual question answering. In CVPR.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv


Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question
answering. In ICCV.

Alexei Baevski and Michael Auli. 2018. Adaptive input representations for
neural language modeling. arXiv preprint arXiv:1809.10853.

Hangbo Bao, Li Dong, and Furu Wei. 2022a. BEiT: BERT pre-training of
image transformers. In ICLR.

Hangbo Bao, Wenhui Wang, Li Dong, and Furu Wei. 2022b. VL-BEiT:
Generative vision-language pretraining. arXiv preprint arXiv:2206.01127.

Eyal Ben-David, Nadav Oved, and Roi Reichart. 2021. PADA: A prompt-
based autoregressive approach for adaptation to unseen domains.

Stevo Bozinovski. 2020. Reminder of the first paper on transfer learning in


neural networks, 1976. Informatica, 44(3).

Stevo Bozinovski and Ante Fulgosi. 1976. The influence of pattern similarity
and transfer learning upon training of a base perceptron b2. In Proceedings
of Symposium Informatica, volume 3, pages 121–126.

73
74 BIBLIOGRAPHY

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. 2020. Language models are few-shot learners.
Advances in neural information processing systems, 33:1877–1901.

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh


Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft
COCO captions: Data collection and evaluation server. arXiv preprint
arXiv:1504.00325.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe
Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: Universal image-text
representation learning. In ECCV.

Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultra-
fine entity typing. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages
87–96, Melbourne, Australia. Association for Computational Linguistics.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma,


Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles
Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling
with pathways.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and
Dario Amodei. 2017. Deep reinforcement learning from human preferences.
Advances in neural information processing systems, 30.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William
Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma,
et al. 2022. Scaling instruction-finetuned language models. arXiv preprint
arXiv:2210.11416.

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning.


2020. Electra: Pre-training text encoders as discriminators rather than
generators. arXiv preprint arXiv:2003.10555.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary,


Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual
representation learning at scale. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 8440–8451, Online.
Association for Computational Linguistics.
BIBLIOGRAPHY 75

Alexis CONNEAU and Guillaume Lample. 2019. Cross-lingual language


model pretraining. In Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc.

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2022.
Why can gpt learn in-context? language models secretly perform gradient
descent as meta optimizers. arXiv preprint arXiv:2212.10559.

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,
José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In
CVPR.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
BERT: Pre-training of deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Computational Linguistics.

Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, and Jiawei Wang. 2023.
Write and paint: Generative vision-language models are unified modal
learners. In The Eleventh International Conference on Learning Represen-
tations.

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng
Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, Jing Yi,
Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei Chen,
Yang Liu, Jie Tang, Juanzi Li, and Maosong Sun. 2022. Delta tuning:
A comprehensive study of parameter efficient methods for pre-trained
language models. CoRR, abs/2203.06904.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,


Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words:
Transformers for image recognition at scale. In ICLR.

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan
Wang, Chenguang Zhu, Zicheng Liu, Michael Zeng, et al. 2022. An
empirical study of training end-to-end vision-and-language transformers.
In CVPR.

Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu,
Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al.
76 BIBLIOGRAPHY

2022. A neural network solves, explains, and generates university math


problems by program synthesis and few-shot learning at human level.
Proceedings of the National Academy of Sciences, 119(32):e2123433119.

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao,
et al. 2022. Vision-language pre-training: Basics, recent advances, and
future trends. Foundations and Trends® in Computer Graphics and
Vision, 14(3–4):163–352.

Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained
language models better few-shot learners. In Association for Computational
Linguistics (ACL).

Demi Guo, Alexander M. Rush, and Yoon Kim. 2021. Parameter-efficient


transfer learning with diff pruning. In Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing, ACL/I-
JCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021,
pages 4884–4896. Association for Computational Linguistics.

Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estima-


tion: A new estimation principle for unnormalized statistical models. In
AISTATS.

Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. 2021.
PTR: Prompt tuning with rules for text classification.

Adi Haviv, Jonathan Berant, and Amir Globerson. 2021. BERTese: Learning
to speak to BERT. In Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Linguistics: Main Volume,
pages 3618–3623, Online. Association for Computational Linguistics.

C. Hohensee and J. Lobato. 2021. Transfer of Learning: Progressive Per-


spectives for Mathematics Education and Related Fields. Research in
Mathematics Education. Springer International Publishing.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin


De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly.
2019. Parameter-efficient transfer learning for NLP. In Proceedings of
the 36th International Conference on Machine Learning, volume 97 of
Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
BIBLIOGRAPHY 77

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi


Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank
adaptation of large language models. In International Conference on
Learning Representations.

Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aish-


warya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet
Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In NAACL.

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and
Jianlong Fu. 2021. Seeing Out of tHe bOx: End-to-End pre-training for
vision-language representation learning. In CVPR.

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu.
2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal
transformers. arXiv preprint arXiv:2004.00849.

Drew A Hudson and Christopher D Manning. 2019. GQA: A new dataset


for real-world visual reasoning and compositional question answering. In
CVPR.

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham,
Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling
up visual and vision-language representation learning with noisy text
supervision. In ICML.

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How
can we know what language models know? Transactions of the Association
for Computational Linguistics, 8:423–438.

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin


Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario
Amodei. 2020. Scaling laws for neural language models. arXiv preprint
arXiv:2001.08361.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike
Lewis. 2019. Generalization through memorization: Nearest neighbor
language models. arXiv preprint arXiv:1911.00172.

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language
transformer without convolution or region supervision. In ICML.
78 BIBLIOGRAPHY

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and
Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.
arXiv preprint arXiv:2205.11916.

Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017.


A hierarchical approach for generating descriptive image paragraphs. In
CVPR.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata,
Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A
Shamma, et al. 2017. Visual genome: Connecting language and vision
using crowdsourced dense image annotations. IJCV.

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin,
Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexan-
der Kolesnikov, et al. 2020. The open images dataset v4. IJCV.

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul


Bhotika, and Stefano Soatto. 2022. Masked vision and language modeling
for multi-modal representation learning. arXiv preprint arXiv:2208.02131.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush


Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised
learning of language representations. arXiv preprint arXiv:1909.11942.

Jaejun Lee, Raphael Tang, and Jimmy Lin. 2019. What would elsa do?
freezing layers during transformer fine-tuning. CoRR, abs/1911.03090.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman


Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
BART: denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension. In ACL, pages 7871–7880.
Association for Computational Linguistics.

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020a.
Unicoder-VL: A universal encoder for vision and language by cross-modal
pre-training. In AAAI.

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrap-
ping language-image pre-training for unified vision-language understanding
and generation. In ICML.
BIBLIOGRAPHY 79

Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq


Joty, Caiming Xiong, and Steven Hoi. 2021. Align before fuse: Vision and
language representation learning with momentum distillation. In NeurIPS.

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei
Chang. 2019. VisualBERT: A simple and performant baseline for vision
and language. arXiv preprint arXiv:1908.03557.

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous


prompts for generation. arXiv preprint arXiv:2101.00190.

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang,
Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng
Gao. 2020b. Oscar: Object-semantics aligned pre-training for vision-
language tasks. In ECCV.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft
COCO: Common objects in context. In ECCV.

Haokun Liu, Derek Tam, Muqeeth Mohammed, Jay Mohta, Tenghao Huang,
Mohit Bansal, and Colin Raffel. 2022. Few-shot parameter-efficient fine-
tuning is better and cheaper than in-context learning. In Advances in
Neural Information Processing Systems.

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and
Weizhu Chen. 2021a. What makes good in-context examples for gpt-3?
arXiv preprint arXiv:2101.06804.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and
Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey
of prompting methods in natural language processing. ACM Computing
Surveys, 55(9):1–35.

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang,
and Jie Tang. 2021b. GPT understands, too. CoRR, abs/2103.10385.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692.
80 BIBLIOGRAPHY

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen
Lin, and Baining Guo. 2021c. Swin transformer: Hierarchical vision
transformer using shifted windows. In ICCV.

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretrain-
ing task-agnostic visiolinguistic representations for vision-and-language
tasks. In NeurIPS.

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stene-
torp. 2021. Fantastically ordered prompts and where to find them: Over-
coming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi.


2019. OK-VQA: A visual question answering benchmark requiring external
knowledge. In CVPR.

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
Learned in translation: Contextualized word vectors. In Advances in
Neural Information Processing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-9, 2017, Long Beach,
CA, USA, pages 6294–6305.

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher.
2018. The natural language decathlon: Multitask learning as question
answering. arXiv preprint arXiv:1806.08730.

Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Ef-
ficient estimation of word representations in vector space. In International
Conference on Learning Representations.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski,


Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten
Bosma, David Luan, et al. 2021. Show your work: Scratchpads for interme-
diate computation with language models. arXiv preprint arXiv:2112.00114.

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural
discrete representation learning. In NeurIPS.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright,


Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex
Ray, et al. 2022. Training language models to follow instructions with
human feedback. Advances in Neural Information Processing Systems,
35:27730–27744.
BIBLIOGRAPHY 81

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.


GloVe: Global vectors for word representation. In Empirical Methods in
Natural Language Processing (EMNLP), pages 1532–1543.

Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz,


Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge enhanced
contextual word representations. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th Interna-
tional Joint Conference on Natural Language Processing, EMNLP-IJCNLP
2019, Hong Kong, China, November 3-7, 2019, pages 43–54. Association
for Computational Linguistics.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher


Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word
representations. In Proceedings of the 2018 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New
Orleans, Louisiana. Association for Computational Linguistics.

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang


Wu, Alexander H. Miller, and Sebastian Riedel. 2020. How context affects
language models’ factual predictions. In Automated Knowledge Base
Construction.

Fabio Petronic, T. Rocktäschel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu,


and S. Riedel. 2019. Language models as knowledge bases? In In:
Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2019.

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020. MAD-
X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer.
In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 7654–7673, Online. Association for
Computational Linguistics.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is
multilingual BERT? In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages 4996–5001, Florence,
Italy. Association for Computational Linguistics.

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia


Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting
82 BIBLIOGRAPHY

region-to-phrase correspondences for richer image-to-sentence models. In


ICCV.

L. Y. Pratt. 1992. Discriminability-based transfer between neural networks.


In Advances in Neural Information Processing Systems, volume 5. Morgan-
Kaufmann.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin,
Jack Clark, et al. 2021. Learning transferable visual models from natural
language supervision. In ICML.

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al.


2018. Improving language understanding by generative pre-training.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al. 2019. Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the limits of transfer learning with a unified text-to-text transformer. J.
Mach. Learn. Res., 21:140:1–140:67.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image
generation. In ICML.

Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. 2017. Learning


multiple visual domains with residual adapters. In Advances in Neural
Information Processing Systems, volume 30. Curran Associates, Inc.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
Towards real-time object detection with region proposal networks. In
NeurIPS.

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika,
Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey,
M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma,
Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti
Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica,
Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas
Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli,
BIBLIOGRAPHY 83

Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella
Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. 2022. Multitask
prompted training enables zero-shot task generalization. In International
Conference on Learning Representations.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg
Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint
arXiv:1707.06347.

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu
Zhang, Jing Li, and Jian Sun. 2019. Objects365: A large-scale, high-quality
dataset for object detection. In ICCV.

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach,
Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2022. How much can
CLIP benefit vision-and-language tasks? In ICLR.

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and
Sameer Singh. 2020. AutoPrompt: Eliciting knowledge from language
models with automatically generated prompts. In Empirical Methods in
Natural Language Processing (EMNLP).

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh.


2020. Textcaps: A dataset for image captioning with reading comprehen-
sion. In ECCV.

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen,


Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa
models that can read. In CVPR.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai.
2019. VL-BERT: Pre-training of generic visual-linguistic representations.
In ICLR.

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav
Artzi. 2019. A corpus for reasoning about natural language grounded in
photographs. In ACL.

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder
representations from transformers. In EMNLP.

Wilson L. Taylor. 1953. “cloze procedure”: A new tool for measuring


readability. Journalism Quarterly, 30(4):415–433.
84 BIBLIOGRAPHY

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexan-


dre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image
transformers & distillation through attention. In ICML.
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015.
Show and tell: A neural image caption generator. In CVPR.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh.
2019. Universal adversarial triggers for attacking and analyzing NLP.
In Proceedings of the 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International Joint Conference on
Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China,
November 3-7, 2019, pages 2153–2162. Association for Computational
Linguistics.
Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng
Liu, Yumao Lu, and Lijuan Wang. 2021a. UFO: A unified transformer for
vision-language representation learning. arXiv preprint arXiv:2111.10023.
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan,
Zicheng Liu, Ce Liu, and Lijuan Wang. 2022a. GIT: A generative image-to-
text transformer for vision and language. arXiv preprint arXiv:2205.14100.
Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. 2021b. VLMo: Uni-
fied vision-language pre-training with mixture-of-modality-experts. arXiv
preprint arXiv:2111.02358.
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and
Yuan Cao. 2022b. SimVLM: Simple visual language model pretraining
with weak supervision. In ICLR.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu,
Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022a. Finetuned
language models are zero-shot learners. In International Conference on
Learning Representations.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian
Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler,
et al. 2022b. Emergent abilities of large language models. arXiv preprint
arXiv:2206.07682.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc
Le, and Denny Zhou. 2022c. Chain of thought prompting elicits reasoning
in large language models. arXiv preprint arXiv:2201.11903.
BIBLIOGRAPHY 85

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,
et al. 2016. Google’s neural machine translation system: Bridging the gap
between human and machine translation. arXiv preprint arXiv:1609.08144.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdi-
nov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining
for language understanding. In Advances in Neural Information Processing
Systems, volume 32. Curran Associates, Inc.

Dani Yogatama, Cyprien de Masson d’Autume, and Lingpeng Kong. 2021.


Adaptive semiparametric language models. Transactions of the Association
for Computational Linguistics, 9:362–373.

Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng
Wang. 2021. ERNIE-VIL: Knowledge enhanced vision-language represen-
tations through scene graphs. In AAAI.

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini,
and Yonghui Wu. 2022. Coca: Contrastive captioners are image-text
foundation models. TMLR.

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L
Berg. 2016. Modeling context in referring expressions. In ECCV.

Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. BARTScore: Evaluat-
ing generated text as text generation.

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. 2022. Bitfit: Simple
parameter-efficient fine-tuning for transformer-based masked language-
models. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin,
Ireland, May 22-27, 2022, pages 1–9. Association for Computational Lin-
guistics.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From
recognition to cognition: Visual commonsense reasoning. In CVPR.

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and
Wangchunshu Zhou. 2022. X2 -VLM: All-In-One pre-trained model for
vision-language tasks. arXiv preprint arXiv:2211.12402.
86 BIBLIOGRAPHY

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan
Wang, Yejin Choi, and Jianfeng Gao. 2021. VinVL: Revisiting visual
representations in vision-language models. In CVPR.

Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D.
Manning. 2017. Position-aware attention and supervised data improve
slot filling. In Proceedings of the 2017 Conference on Empirical Methods
in Natural Language Processing, pages 35–45, Copenhagen, Denmark.
Association for Computational Linguistics.

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun
Liu. 2019. ERNIE: Enhanced language representation with informative
entities. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 1441–1451, Florence, Italy. Association
for Computational Linguistics.

Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. Factual probing is
[MASK]: Learning vs. learning to recall. CoRR, abs/2104.05240.

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jian-
feng Gao. 2020. Unified vision-language pre-training for image captioning
and vqa. In AAAI.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies:
Towards story-like visual explanations by watching movies and reading
books. In Proceedings of the IEEE international conference on computer
vision, pages 19–27.

You might also like