Lecture Notes
Lecture Notes
Language Models
May 9, 2023
Contents
1 Introduction 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Transfer Learning 7
2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 ELMo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 BERT: Pre-training Objectives . . . . . . . . . . . . . 12
2.3.2 BERT Architecture . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Fine-tuning BERT . . . . . . . . . . . . . . . . . . . . 14
2.4 Other Transformer Language Models . . . . . . . . . . . . . . 16
2.4.1 BERT Variants . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 GPTs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 Seq2seq TLMs . . . . . . . . . . . . . . . . . . . . . . 22
3
4 CONTENTS
5 Multimodality 41
5.1 Vision Language Models . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Vision-and-Language Tasks . . . . . . . . . . . . . . . 42
5.1.2 Vision-Language Models: Architectures . . . . . . . . 44
5.1.3 Vision-Language Models: Pre-training Objectives . . . 47
5.2 Knowledge-Enhancement . . . . . . . . . . . . . . . . . . . . 51
5.2.1 kNN LM . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Dynamic Gating kNN LM . . . . . . . . . . . . . . . . 53
5.2.3 KnowBERT . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.4 [Optional] ERNIE . . . . . . . . . . . . . . . . . . . . 57
6 Additional Topics 61
6.1 Instruction-Based Training Procedures . . . . . . . . . . . . . 61
6.1.1 Instruction Tuning . . . . . . . . . . . . . . . . . . . . 61
6.1.2 Reinforcement Learning from Human Feedback . . . . 63
6.2 Scaling laws and Emergent Behavior . . . . . . . . . . . . . . 65
6.2.1 Summary of Scaling Laws . . . . . . . . . . . . . . . . 66
Chapter 1
Introduction
1.1 Introduction
Welcome to the class notes on “Training, Fine Tuning, Inference and Ap-
plications of Language Models” for the Large Language Models class (263-
5354-00L). This part of the class focuses on the practical aspects of im-
plementing large language models, their functionalities and applications.
Many universities are offering similar courses at the moment, e.g., CS324
at Stanford University (https://stanford-cs324.github.io/winter2022/)
and CS 600.471 (https://self-supervised.cs.jhu.edu/sp2023/) at Johns
Hopkins University. Their syllabi may serve as useful references.
5
6 CHAPTER 1. INTRODUCTION
Chapter 2
Transfer Learning
7
8 CHAPTER 2. TRANSFER LEARNING
2.2 ELMo
As a first case study of transfer learning based on language models, we
consider ELMo (Peters et al., 2018). ELMo was one of the first successful
transfer learning models based on language modeling. It leverages the
language modeling task to learn word representations, that is, vectors meant
to represent the meaning of words. Whereas most previous approaches, such
as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014),
trained static word representations, the word representations in ELMo are
context-dependent.4 That means that depending on the sentence fed to
ELMo, the representation of a particular word will differ in order to reflect
the meaning of that particular sentence. ELMo considers two separate
language models, a (standard) forward language model pLM (y t | y <t ) as well
as a backward language model pLMB (y t | y >t ) = Tt=1 pLMB (y t | y t+1 , . . . , y T ).
def Q
N X
T
(n) →
− (n) ←−
X
LELMo (θ) = log pLM (y t (n) | y <t ; θ , θ ′ ) + log pLMB (y t (n) | y >t ; θ , θ ′ )
n=1 t=1
(2.1)
Now in order to fine-tune for a specific task, we can use the context-
←
−LM → − LM
specific representations hLM
tl = [ h tl ; h tl ]. One could simply take the last
layer representations hLM
tL and use those as input to a separate model that is
fine-tuned on another task. The original paper additionally experimented
with learning a task-specific representation as a scaled convex combination
over the hidden representations, as follows:
L
X
ELMotask
t = γ task sl task hLM
tl , (2.2)
l=0
PL task
where l=0 sl = 1 are the outputs of a softmax function over the hidden
2.3. BERT 11
2.3 BERT
After the success of ELMO, Devlin et al. (2019) pre-trained BERT (Bidirec-
tional Encoder Representations from Transformers), a Transformer-based
bidirectional masked language model. BERT is pre-trained with the masked
language modeling objective on large scale text corpus. After pre-training,
BERT can be fine-tuned to perform different NLP task without the need
of design task-specific architectures. BERT advanced the state-of-the-art of
many NLP benchmarks at the time it was proposed.
12 CHAPTER 2. TRANSFER LEARNING
N X
T
log pMLM (y t (n) | y <t , y >t ; θ)1{y t (n) = [MASK]}
(n) (n)
X
LMLM (θ) = (2.3)
n=1 t=1
C T1 ... TN T[SEP] T1’ ... TM’ C T1 ... TN T[SEP] T1’ ... TM’
[CLS] Tok 1 ... Tok N [SEP] Tok 1 ... TokM [CLS] Tok 1 ... Tok N [SEP] Tok 1 ... TokM
Pre-training Fine-Tuning
and context for the question answering task). By pre-training the model to
predict whether a given sentence follows another sentence, we can help the
model learn the dependencies and relationships between sentences.
The next sentence prediction objective is implemented as follows. Given
a pair of sentences, denoted as A and B, the model is trained to predict
whether B is the next sentence that follows A. This is done by feeding the
two sentences as input to the model, along with a special [CLS] token that
serves as the input representation of the entire sequence. The model then
generates a binary output that indicates whether B is the next sentence.
BERT is pre-trained with the combination of the masked language model-
ing objective and the next sentence prediction objective on English Wikipedia
data and the BookCorpus (Zhu et al., 2015) dataset, which is a collection of
over 11,000 books from various genres. The resulting corpus contains over
20 GB of text with over 3.3 billion words and approximately 800 million
sentences. BERT is pre-trained with a batch size of 256 sequences for 1M
steps.
Token
E[CLS] Emy Edog Eis Ecute E[SEP] Ehe Elikes Eplay E##ing E[SEP]
Embeddings
Segment
Embeddings EA EA EA EA EA EA EB EB EB EB EB
Position
Embeddings E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Figure 2.4: BERT input representation. The input embedding are the sum
of the token embeddings, the segmentation embeddings and the position
embeddings.
Class Class
Label Label
BERT BERT
[CLS]
Tok
... Tok
[SEP]
Tok
... Tok [CLS] Tok 1 Tok 2 ... Tok N
1 N 1 M
BERT BERT
x% x%
x% x%
T X
X i−1
L= log pθ (xπi,j |x< πi,j ) (2.4)
i=1 j=0
where xπi,j is the j-th token in the permutation of the first i tokens in
the input sequence and x< πi,j is the set of tokens that appear before the
j-th token in the permutation of the first i tokens.
In addition to the permutation language modeling objective, XLNet
also increases the size of pre-training data and pre-training computation
budget. By combining these improvements, XLNet achieved state-of-the-art
performance on several NLP tasks at the time.
sample
the [MASK] the original
chef chef chef original
Generator Discriminator
cooked [MASK] (typically a ate replaced
(ELECTRA)
the the small MLM) the original
meal meal meal original
Figure 2.8: Illustration of ELECTRA model and the replaced token detection
framework.
2.4.2 GPTs
We then describe another important type of pre-trained language models,
i.e., Transformer decoder models. These kind of models are true language
models according to the definition and can be used for generative tasks. The
most representative models in this type are the GPT family pre-trained by
OpenAI. We will introduce them as follows.
20 CHAPTER 2. TRANSFER LEARNING
GPT A few months before BERT is released, Radford et al. (2018) pre-
trained the first version of the GPT (Generative Pre-training Transformers)
family. GPT is a decoder-only Transformer language model pre-trained with
the standard language modeling objective on large scale web text. After pre-
training, GPT can be fine-tuned on various natural language understanding
tasks by transforming the inputs into a single text sequence. The sequence
is fed into the GPT model and the final hidden state is used as the sequence
representation, which is then fed into a fully-connected layer to produce
predictions. The GPT fine-tuning procedure is illustrated in Figure 2.9.
GPT-2 After the success of GPT, OpenAI further scales generative pre-
training with GPT-2 (Radford et al., 2019) with larger models and more
training data. In addition to better performance with fine-tuning, they
also find that generative pre-training on large scale text data makes the
2.4. OTHER TRANSFORMER LANGUAGE MODELS 21
T5
"not acceptable"
"stsb sentence1: The rhino grazed
on the grass. sentence2: A rhino
is grazing in a field." "3.8"
task description and a few demonstrations of the task to the original input
as the input for the language model, which produces the prediction by
auto-regressively predicting next tokens. Different
A _C . _ E . DE.ABC. C.DE.AB
Token Masking Sentence Permutation Document Rotation
Figure 2.13: Illustrations of the text infilling objective in T5 and other noise
functions investigated in BART.
NLP tasks.
T5 is pre-trained by mixing self-supervised data with multi-task super-
vised data. T5 uses a “text infilling” objective, which is shown to outperform
other possible variants, as the source of self-supervision. As illustrated in
Figure 2.13a, it randomly masks a portion of contiguous tokens and train the
model to predict the masked text spans. Raffel et al. (2020) collected the C4
(”Colossal Clean Crawled Corpus”) dataset and construct massive text infill-
ing data with it. For supervised data, T5 convert training data for GLUE,
SuperGLUE, machine translation, text summarization, question answering,
etc., into text-to-text format by prepending task-specific prefixes. During
pre-training, T5 randomly samples text-to-text training data according to
the size of different tasks.
After pre-training, T5 can achieve competitive performance on supervised
pre-training tasks without fine-tuning by prepending the corresponding prefix
to the input. T5 also achieves state-of-the-art performance on a wide range
of NLP tasks with task-specific fine-tuning.
Pretrained language models are used in a wide range of NLP tasks. When
the model size becomes larger and larger, it is more and more difficult to tune
the model with limited size of annotated data. Overfitting easily happens,
and the cost of model tuning is quite expensive. Thus, to avoid above issues,
various parameter-efficient tuning methods are proposed. In the following
sections, we will first introduce partially finetuning techniques. They
are simple but effective, which is widely used in Transformer model tuning.
Besides selecting part of parameters for tuning, another line of work explores
how to keep the model frozen, and add extra tuned parameters. These newly
added and tuned modules are called adapters.
25
26 CHAPTER 3. PARAMETER EFFICIENT FINETUNING
Multi-headed
attention
−10
−15
−20 Adapters
Fine-tune top layers
−25
10 5 10 6 10 7 10 8 10 9
Num trainable parameters / task
Inv En -1 Inv Qu -1
Adap Adap Task NER Adapt
Feed
Forward
To mitigate it, invertible adapters are stacked on top of the embedding layer
to better accommodate for the target language. Furthermore, since input
and out embeddings are tied, the inverses of the adapters are placed before
the output embedding layer to revert it for inference. Invertible adapters
split an input embedding vector e into two vectors of equal dimensionality:
e1 , e2 and applies the following transformations:
o1 = F (e2 ) + e1
o2 = G(o1 ) + e2 (3.10)
o = [o1 , o2 ]
Its inverse can be computed as follows:
e2 = o2 − G(o1 )
e1 = o1 − F (e2 ) (3.11)
o = [e1 , e2 ]
F and G can be arbitrary non-linear functions. In Pfeiffer et al. (2020), a
function similar to language and task adapters are used (minus the residual
connection):
F (x) = ReLU(xWdown,F )Wup,F
(3.12)
G(x) = ReLU(xWdown,G )Wup,G
30 CHAPTER 3. PARAMETER EFFICIENT FINETUNING
-1
FF Up FF Up
FF Down FF Down
G F
FF Up FF Up
FF Down FF Down
F G
Invertible adapters are trained together with language adapters using MLM
and are also fixed during task-specific fine-tuning. The architecture of an
invertible adapter and its inverse is shown in Fig. 3.4.
3.2.2 AdapterFusion
[Optional Reading] LoRA
SoftMax
Bottleneck adapters introduce extra compute in adapter layers, which can
not be bypassed with model parallelism since theyT have to be processed
Embeddings
Value Key Query
sequentially. Even though adapter layers are designed -1 to have very few
Inv MLM En Adap
parameters, it could still lead to a non-negligible latency, especially during
inference. To address this issue, Hu et al. (2022) propose LoRA (Fig. 3.5),
which injects trainable low-rank matrices to approximate the weight updates.
For a pre-trained weight matrix W ∈ Rd×k (e.g. query and value projection
matrices in the multi-head attentions), the update is computed with a
low-rank decomposition: ...
∆W = BA (3.13)
α α
h = Wx + · ∆Wx = Wx + · BAx. (3.14)
r r
3.2. ADAPTER TUNING 31
65
All parameters
60 f(x)
Accuracy
h
55
Pretrained
𝐵=0 50
Pretrained Weights
Weights 𝑟
0.001% 0.01% 0.1%
𝑑×𝑑
𝑊 ∈ ℝupdated
% of parameters
𝑊 ∈ ℝ𝑑×𝑑
𝐴 = 𝒩(0, 𝜎 2 ) (IA)³ Prompt Tuning
LoRA 𝑑
Prefix Tuning
𝑑 BitFit Adapter
Layer Norm
x FISH Mask
x Compacter Intrinsic SAID
Compacter++
Figure 3.5: LoRA.
Figure 3.6: Accuracy of PEFT meth-
ods applied to T0-3B (Sanh et al.,
2022).
33
34 CHAPTER 4. PROMPTING AND ZERO-SHOT INFERENCE
classification/generation tasks).
A function ffill (x′ , z) fills in the location [Z] in prompt x′ with the
potential answer z by searching over the set of all potential answers by
calculating the probability of their corresponding filled prompts using a
pre-trained LM P (·; θ).
ẑ = search P (ffill (x′ , z); θ). (4.1)
z∈Z
This search function could be an argmax search that searches for the highest-
scoring output or various sampling techniques that randomly generate outputs
following the probability distribution of the LM.
Discrete prompts
Discrete prompts (also sometimes known as hard prompts), as the name sug-
gests, automatically search for prompts in a discrete space, i.e. the prompts
are usually text strings corresponding to natural language. Various methods
have been proposed in this line of work and we explore a few approaches
here.
Jiang et al. (2020) introduced a mining-based approach that automatically
identifies templates from a given set of training inputs x and outputs y. This
approach searches for strings in a large text corpus, such as Wikipedia, that
contain both x and y, and identifies the middle words or dependency paths
between them that can be used as templates, e.g. in the form of “[X] middle
words [Z]”.
Paraphrasing methods involve taking an original prompt and creating var-
ious alternative prompts to use as training data. These methods include
translating the prompt into another language and back (Jiang et al., 2020),
using a thesaurus to replace words (Yuan et al., 2021), or using a neural
prompt rewriter designed to improve the accuracy of systems that use the
prompt (Haviv et al., 2021).
Treating prompts as generation tasks is an obvious choice and Gao et al.
(2021) used the pre-trained T5 model for the template search process. Ben-
David et al. (2021) further extended it and proposed a domain adaptation
algorithm that trains T5 to generate unique domain-relevant features.
36 CHAPTER 4. PROMPTING AND ZERO-SHOT INFERENCE
Continuous prompts
(1) (n)
In Eq. 4.2, h<i = [h<i ; · · · ; h<i ] is the concatenation of all neural network
layers at time step i. It is copied from Mϕ directly if the corresponding time
step is within the prefix (hi is Mϕ [i]), otherwise it is computed using the
pre-trained LM.
However, it will be more interesting to combine the discrete prompts
with continuous approaches as prefix tuning might be very sensitive to the
changes. Zhong et al. (2021) takes the advantage of the hybrid approach by
first defining a template using AutoPrompt (Shin et al., 2020)’s (a discrete
search method), initialize virtual tokens based on this discovered prompt,
then fine-tune the embeddings to increase the performance. Similarly, Liu
et al. (2021b) propose “P-tuning”, where continuous prompts are learned
by inserting trainable variables into the embedded input. Han et al. (2021)
further propose prompt tuning with rules (PTR) where logic rules create
the templates alongside the virtual tokens whose embedding comes from the
pre-trained LM.
4.3. ZERO- AND FEW-SHOT INFERENCE 37
Figure 3.3: On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models
continue
Figureto absorb
4.1: knowledge
Resultsas their capacity increases.
of GPT-3 on the One-shot and few-shot
TriviaQA performance
dataset, make significant
using different gains
over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG
numbers
[LPP 20]
+ of exemplars in the few-shot prompt. From Brown et al. (2020).
and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this
distribution, recovering strong performance in the few-shot setting.
On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in
the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot
to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to
TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia
specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution.
Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two
datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we
find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting
the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.
38 CHAPTER 4. PROMPTING AND ZERO-SHOT INFERENCE
4.3.2 ChainJason
ofWei
Thought Prompting
Xuezhi Wang Dale Schuurmans Maarten Bosma
Brian Ichter Fei Xia Ed H. Chi Quoc V. Le Denny Zhou
Multiple studies investigating the reasoning capabilities of large LMs, high-
arXiv:2201.11903v4 [cs.CL] 13 Jun 2022
Q: Roger has 5 tennis balls. He buys 2 more cans of Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now? tennis balls does he have now?
A: The answer is 11. A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples Q: The cafeteria had 23 apples. If they used 20 to
do they have? make lunch and bought 6 more, how many apples
do they have?
A: The answer is 27. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
answer is 9.
Figure 1: Chain of thought prompting enables large language models to tackle complex arithmetic,
Figurecommonsense, and symbolic
4.2: Example reasoning
of chain oftasks. Chain ofprompting,
thought thought reasoningfrom
processes
Weiare highlighted.
et al. (2022c).
GSM8K
40
provements were either negative or very small
datasets are made up
(see Appendix Table 3). of simpler 20
problems that,
Third, thefor the most
performance of chainpart,
of thought 0
prompting via GPT-3 175B and PaLM 540B
only require the application of a sin-
compares favorably to prior state of the art, 80
which typically finetunes ato
task-specific
obtain model
SVAMP
the final result.
how PaLM This
540B highlights howprompt-
uses chain of thought 40
ing to achieve new state of the art on GSM8K, 20
CoT prompting
SVAMP, and MAWPS (though note in
is most effective that stan-
dard prompting already passed 0
eliciting in-context learning onthe prior best for
multi-
SVAMP). On the other two datasets, AQuA and 100
step reasoning
ASDiv, problems.
PaLM with chain of thought prompting
MAWPS
pendix Table 2). 50
To better understand why chain of thought 25
prompting works, we manually examined model- 0
generated chains of thought by LaMDA 137B 0.42 8 137 0.35 7 175 8 62 540
for GSM8K. Of 50 random examples where the
model returned the correct final answer, all of Model scale (# parameters in billions)
the generated chains of thought were also log-
ically and mathematically correct except two Figure 4: Chain of thought prompting enables
that coincidentally arrived at the correct answer large language models to solve challenging math
(see Appendix D.1, and Table 8 for examples problems.
Figure 4.3: Notably, chain of thought
Performance reasoning
of three large
of correct model-generated chains of thought). is an emergent ability of increasing model scale.
We also randomly examined 50 random sam- LMsPrioron bestthree
numbers different
are from Cobbemath world
et al. (2021)
ples for which the model gave the wrong answer. for GSM8K, Jie et al. (2022) for SVAMP, and Lan
problem datasets,
The summary of this analysis is that 46% of the et al. (2021) for MAWPS.
from Wei et al.
chains of thought were almost correct, barring(2022c).
minor mistakes (calculator error, symbol mapping error, or one reasoning step missing), and that the
other 54% of the chains of thought had major errors in semantic understanding or coherence (see
Appendix D.2). To provide a small insight into why scaling improves chain of thought reasoning
ability, we performed a similar analysis of errors made by PaLM 62B and whether those errors were
fixed by scaling to PaLM 540B. The summary is that scaling PaLM to 540B fixes a large portion of
one-step missing and semantic understanding errors in the 62B model (see Appendix A.1).
The observed benefits of using chain of thought prompting raises the natural question of whether the
same performance improvements can be conferred via other types of prompting. Figure 5 shows an
ablation study with three variations of chain of thought described below.
Equation only. One reason for why chain of thought prompting might help is that it produces the
mathematical equation to be evaluated, and so we test a variation where the model is prompted
to output only a mathematical equation before giving the answer. Figure 5 shows that equation
only prompting does not help much for GSM8K, which implies that the semantics of the questions
in GSM8K are too challenging to directly translate into an equation without the natural language
reasoning steps in chain of thought. For datasets of one-step or two-step problems, however, we find
that equation only prompting does improve performance, since the equation can be easily derived
from the question (see Appendix Table 6).
5
Chapter 5
Multimodality
41
42 CHAPTER 5. MULTIMODALITY
section 5.1.1 and then present popular model architectures and pre-training
objectives for vision-language models (VLMs) in section ?? and section ??,
respectively.
• Image-Text Tasks. Image-text tasks are tasks that include images and
texts in their inputs and outputs. Image-text tasks are arguably the most
important and well studied tasks in VL research. We introduce the most
representative image-text tasks below:
The text and visual features are then fed into a multimodal fusion module
to produce cross-modal representations, which are then optionally fed into a
decoder before generating the final outputs. An illustration of this general
framework is shown in Figure 5.2.
In many cases, there are no clear boundaries among image/text backbones,
multimodal fusion module, and the decoder. In this paper, we refer to the part
of the model that only takes image/text features as input as the corresponding
vision/text encoder, and the part of the model that takes both image and text
features as input as the multimodal fusion module. Besides this, if there are
additional modules that take the multimodal features as input to generate
the output, we call it decoder. We next describe different kinds of vision
encoder, text encoder, and fusion module in detail.
Vision Encoder. There are three types of vision encoders: (i) an object
detector (OD), (ii) a plain CNN, and (iii) a vision Transformer.
• OD. The most widely used object detector for VL research is the Faster R-
CNN (Ren et al., 2015) pre-trained on the Visual Genome (VG) dataset (Kr-
ishna et al., 2017) as in BUTD (Anderson et al., 2018). In VinVL (Zhang
et al., 2021), a stronger OD model based on the ResNeXt-152 C4 architec-
ture is pre-trained on multiple public OD datasets (including COCO (Chen
et al., 2015), OpenImages (Kuznetsova et al., 2020), Objects365 (Shao
et al., 2019) and VG), and significant performance boost is observed across
a wide range of VL tasks by using this stronger OD model. Additional
care is taken to encode the location information of image regions, which is
typically represented as a 7-dimensional vector.1 Both visual and location
features are then fed through a fully-connected layer, to be projected
1
[x1 , y1 , x2 , y2 , w, h, w ∗ h] (normalized top/left/bottom/right coordinates, width, height,
and area)
46 CHAPTER 5. MULTIMODALITY
into the same embedding space. The final embedding for each region is
obtained by summing up the two FC outputs and then passing through a
layer normalization layer.
• ViT. Some recent pre-trained VLMs such as ALBEF (Li et al., 2021) and
ViLT (Kim et al., 2021) use Transformer-based vision encoders. Follow-
ing Dosovitskiy et al. (2021), an image is first split into image patches,
which are then flattened into vectors and linearly projected to obtain
patch embeddings. A learnable special token [CLS] embedding is also
prepended to the sequence. These patch embeddings, when summed up
together with learnable 1D position embeddings and a potential image-type
embedding, are sent into a multi-layer Transformer block to obtain the
final output image features. Different ViT variants have been studied for
VLP, such as plain ViT (Dosovitskiy et al., 2021), DeiT (Touvron et al.,
2021), BEiT (Bao et al., 2022a), Swin Transformer (Liu et al., 2021c), and
CLIP-ViT (Radford et al., 2021), to name a few.
In a nutshell, no matter what vision encoder is used, the input image
is represented as a set of feature vectors v = {v1 , · · · , vM }. VLMs can be
categorized into non end-to-end models, which use an OD model to get vision
features for the model, and end-to-end models that directly take raw images
as input.
Text Encoder. Following BERT (Devlin et al., 2019) and RoBERTa (Liu
et al., 2019), VLMs first segment the input sentence into a sequence of
subwords and then insert two special tokens at the beginning and the end
of the sentence to generate the input text sequence. After we obtain the
text embeddings, existing works either feed them directly to the multimodal
fusion module (Li et al., 2019; Chen et al., 2020), or to several text-specific
layers (Tan and Bansal, 2019; Lu et al., 2019) before the fusion. For the
former, the fusion module is typically initialized with BERT, and the role of
5.1. VISION LANGUAGE MODELS 47
Multimodal Fusion. For dual encoders like CLIP (Radford et al., 2021)
and ALIGN (Jia et al., 2021), fusion is essentially computing the similarity
between representation in the two modalities, which is typically performed
via a dot-product between two global image and text feature vectors. For
fusion encoder, it takes both v = {v1 , · · · , vM } and w = {w1 , · · · , wN } as
input, and learns contextualized multimodal representations denoted as
ṽ = {ṽ1 , · · · , ṽM } and w̃ = {w̃1 , · · · , w̃N }. There are mainly two types
of fusion modules, namely, merged attention and co-attention, shown in
Figure 5.3. Specifically,
• In a merged attention module, the text and visual features are simply
concatenated together, and then fed into a single Transformer block. This
design has been used in many previous works, such as VisualBERT (Li
et al., 2019), Unicoder-VL (Li et al., 2020a), VL-BERT (Su et al., 2019),
UNITER (Chen et al., 2020), OSCAR (Li et al., 2020b), VinVL (Zhang
et al., 2021), ViLT (Kim et al., 2021), GIT (Wang et al., 2022a), etc.
• In a co-attention module, on the other hand, the text and visual features
are fed into different Transformer blocks independently, and techniques
such as cross-attention are used to enable cross-modal interaction. This
design has been used in LXMERT (Tan and Bansal, 2019), ViLBERT (Lu
et al., 2019), ERNIE-ViL (Yu et al., 2021), METER (Dou et al., 2022),
etc. Also, in many models, only image-to-text cross-attention modules are
used, such as ALBEF (Li et al., 2021), BLIP (Li et al., 2022), CoCa (Yu
et al., 2022), and Flamingo (Alayrac et al., 2022). Most ViT-based models
adopts co-attention module since the image sequence can be very long and
doing merged attention can be very computationally inefficient.
Feedforward Feedforward
Feedforward
Mx Mx Mx
Cross-Attn Cross-Attn
Self-Attn
Self-Attn Self-Attn
Figure 5.3: Co-attention and merged attention design for multimodal fusion.
where θ denotes the trainable parameters. Each pair (w̃, ṽ) is sampled from
the whole training set D. There are several MLM variants used in VLP.
Specifically,
• LM: Direct language modeling is used in BLIP (Li et al., 2022) and
CoCa (Yu et al., 2022) for VLP. The model predicts the caption given an
image token-by-token autoregressively.
objective is proposed, where a sentence is first split into two parts, and the
bi-directional attention is enabled on the prefix sequence and the input
image, while a causal attention mask is adopted on the remaining tokens.
LITM (θ) = −E(w̃,ṽ)∼D [y log sθ (w̃, ṽ) + (1 − y) log(1 − sθ (w̃, ṽ))]) . (5.2)
• For OD-based VLP models, e.g., LXMERT (Tan and Bansal, 2019)
and UNITER (Chen et al., 2020), some of the input regions are randomly
masked (i.e., the visual features of the masked regions are replaced by
zeros), and the model is trained to regress the original region features via
minimizing the mean squared error loss. Researchers (Tan and Bansal,
2019; Lu et al., 2019; Chen et al., 2020) have also tried to first generate
object labels for each region using a pre-trained object detector, which
can contain high-level semantic information, and the model is trained to
predict the object labels for the masked regions instead of the original
region features.
• For end-to-end VLP models, e.g., ViLT (Kim et al., 2021) and ME-
TER (Dou et al., 2022), researchers have investigated the use of masked
patch regression/classification for masked image modeling. Specifically,
5.2 Knowledge-Enhancement
Language models are known to contain knowledge on their own (Petronic
et al., 2019; Petroni et al., 2020) and we cover the extraction of this knowledge
in a later chapter. Consider a question answering system based on a large
language model. When asked about the current president, it is able to provide
the correct answer. But on the day of inauguration, the true answer will
be different while the model will keep the old one. Because the information
about the current president is stored in the model’s parameters, the only
reliable way of updating this information in the model is to fine-tune it or
re-train it from scratch. However, changing a small piece of information
should not require these complicated and expensive operations. For this, we
turn to knowledge-enhanced language models.
Models which contain information indirectly in their parameters are
called parametric and their counterparts, models which rely on external
52 CHAPTER 5. MULTIMODALITY
5.2.1 kNN LM
The language model proposed by Khandelwal et al. (2019) utilizes memo-
rization of the training data during inference in order to predict otherwise
sparsely occurring words. In this approach, the authors first compute repre-
sentations of all sentence prefixes (last self-attention layer) and store them
as a mapping to the following word:
Then, their corresponding mapped words are combined together into a single
distribution over the vocabulary:
Formally, this alone would create its own kNN-based language model pξ .
However, its output is weighted to the main model’s own prediction using
5.2. KNOWLEDGE-ENHANCEMENT 53
manually set hyperparameter λ ∈ [0, 1]. The higher it is, more weight is put
to the retrieved prediction as opposed to the current model’s prediction:
Obama was senator for Illinois 4 Hawaii 3 Hawaii 0.7 Hawaii 0.8
Barack is married to Michelle 100 Illinois 4 Illinois 0.2 Illinois 0.2
Obama was born in Hawaii 5 Hawaii 5 Hawaii 0.1
… … … …
Obama is a native of Hawaii 3 Classification Interpolation
hidden state:
The experimented model (Transformer-XL) was half the size of the original
one, but again, on Wikipedia-103, it reduced the perplexity from 19.1 to 17.6
5.2.3 KnowBERT
KnowBERT (Peters et al., 2019) is one typical knowledge-enhanced language
model that injecting factual knowledge into hidden representations of tokens.
The idea is quite intuitive: if a language model can align mentions in the input
text with entities in the knowledge base, we can say that factual knowledge
is contained in this model. Note that in previous sections, we mentioned
that in Transformers, text is first tokenized and vectorized as embeddings.
Then, they are processed by Transformer modules. In KnowBERT, mentions
are not aligned with entities before input into the model. Instead, they are
both first vectorized, and then they are aligned in the format of hidden
representations. Next, we illustrate its idea with more details.
Hproj
i = Hi Wproj
1 + bproj
1 . (5.14)
5.2. KNOWLEDGE-ENHANCEMENT 55
The KAR starts with the knowledge base entity candidate selector that
provides a list of candidate mentions which it uses to compute C mention-span
representations sm ∈ RE . Hi is first projected to the entity dimension with a
linear projection, then, the KAR computes mention-span representations, one
for each candidate mention, by pooling over all word pieces in a mention-span
using the self-attentive span pooling. The mention-spans are stacked into a
matrix S ∈ RC×E .
(2) Entity linker functions as its name. Here, it is responsible for per-
forming entity disambiguation for each potential mention from available
candidates. It first runs mention-span self-attention to compute embedding
as
Se = TransformerBlock(S). (5.15)
And thus correspondingly, the loss function of entity linker can be written as
X exp ψmg
LEntityLinker = − log P (5.17)
m k exp ψmg
Then, the entity-span representations are updated with the weighted entity
embeddings as
′
sme = sem + e
em , (5.19)
′
which can be packed into a matrix as S e ∈ RC×E .
H’i = H’proj
i Wproj
2 + bproj
2 + Hi . (5.21)
5.2. KNOWLEDGE-ENHANCEMENT 57
Figure 5.6: The left part is the architecture of ERNIE. The right part is
the aggregator for the mutual integration of the input of tokens and entities.
Information fusion layer takes two kinds of input: one is the token embedding,
and the other one is the concatenation of the token embedding and entity
embedding. After information fusion, it outputs new token embeddings and
entity embeddings for the next layer
Note that there are many engineering tricks of the KnowBERT design. These
tricks are commonly used in many knowledge-enhanced language models. In
our following sections, we will leave out these tricks for simplicity.
The equation above will be used to compute the cross-entropy loss function
for dEA.
Relation extraction (Choi et al., 2018) Entity typing (Zhang et al., 2017)
Model
P R F1-Micro P R F1-Micro
BERT 67.2 64.8 66.0 76.4 71.0 73.6
ERNIE 70.0 66.1 68.0 78.4 72.9 75.6
KnowBERT 71.6 71.4 71.5 78.6 73.7 76.1
• Entity typing: Given an entity mention and its context, entity typing
requires systems to label the entity mention with its respective semantic
types.
60 CHAPTER 5. MULTIMODALITY
Chapter 6
Additional Topics
61
62 CHAPTER 6. ADDITIONAL TOPICS
GPT-3 175B zero shot GPT-3 175B few-shot FLAN 137B zero-shot
77.4
56.2 72.6 55.7 56.6
Performance 53.2
on unseen 63.7 49.8
42.9
task types
Figure 1:Figure
Top: overview of instruction
6.2: Performance of tuning and FLAN,
zero-shot FLAN. Instruction
comparedtuning finetunes aand
with zero-shot pretrained
language model on a mixture of tasks phrased as instructions. At inference time, we evaluate on
few-shot GPT-3, from Wei et al. (2022a).
an unseen task type; for instance, we could evaluate the model on natural language inference (NLI)
when no NLI tasks were seen during instruction tuning. Bottom: performance of zero-shot FLAN,
compared with zero-shot and few-shot GPT-3, on three unseen task types where instruction tuning
improved6.1.2 Reinforcement
performance substantially outLearning from Human
of ten we evaluate. Feedback
NLI datasets: ANLI R1–R3, CB, RTE.
Reading comprehension datasets: BoolQ, MultiRC, OBQA. Closed-book QA datasets: ARC-easy,
A different
ARC-challenge, training approach that leverages human instructions is reinforce-
NQ, TriviaQA.
ment learning from human feedback (Christiano et al., 2017). Reinforcement
⇤ learning (RL)
Lead contributors. is acontributions
Author branch of machine learning
listed at end of paper.that focuses on how agents can
learn to make decisions through interactions with an environment. In RL, an
agent receives rewards or penalties based on its actions in the environment,
1
and its objective is to maximize its cumulative reward over time. The reward
function is a critical component of RL, as it specifies the goals and incentives
for the agent. The agent’s behavior is captured by a policy, which maps
observations to actions. The goal of RL is to optimize the policy to maximize
the expected cumulative reward. The intuition behind RL from human
feedback (RLHF) is to fit a reward function to the human’s preferences while
simultaneously training a policy to optimize the current predicted reward
function.
In this setting, the agent that we are considering is a pretrained language
model that needs to be aligned with the user’s intention. To this end, human
preferences are used as a reward signal to train the model (process illustrated
in Fig. 6.3). This procedure incorporates both explicit intentions such as
following instructions and implicit intentions such as staying truthful, and
not being biased, toxic, or otherwise harmful. RLHF consists of the following
steps:
where rθ (x, y) is the scalar output of the reward model for prompt x
and completion y with parameters θ, yw is the preferred completion
out of the pair of yw and yl , and K is the number of ranked responses
to prompt x present in D.
4. Using the output of the reward model as a scalar reward, fine-tune the
supervised policy to optimize this reward using the proximal policy
optimization (PPO) algorithm (Schulman et al., 2017). The objective
function during this phase is:
This procedure aligns the behavior of the language model to the stated
preferences of a specific group of people.
An example of models trained via RLHF is the InstructGPT series
(Ouyang et al., 2022). For these models, the 175B-parameter GPT-3 is used
a SFT baseline model, while a smaller 6B model is used to model the reward.
A human evaluation of the outputs produced by InstructGPT showed the
efficacy of the RLHF technique: a set of labelers rated InstructGPT outputs
favorably, compared to the predictions of the non-instruction-trained baseline,
along multiple axes (e.g., compliance to the constraints in the prompt, lack
of hallucinations, use of appropriate language). An aggregation of this result
is displayed in Figure 6.4. The same procedure was used to train the popular
ChatGPT model, though with slight differences in the data collection setup.
Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how
Figure 6.4: Multiple models evaluated by how often their outputs were
often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT
models (PPO-ptx)
preferred to thoseas well
forasthe
its variant
175B trained
SFT without
model,pretraining mix (PPO)
according tosignificantly outperform
human evaluators.
the GPT-3 baselines
InstructGPT (GPT, GPT prompted);
(PPO-ptx), as well outputs
as itsfrom our 1.3Bwithout
variant PPO-ptx model are preferredmix
pretraining to
those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals.
(PPO), significantly outperform GPT-3.
used for many recent large LMs—predicting the next token on a webpage from the internet—is
different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019;
Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022). Thus, we say that
6.2 Scaling laws and Emergent Behavior
the language modeling objective is misaligned. Averting these unintended behaviors is especially
important for language models that are deployed and used in hundreds of applications.
With theprogress
We make improved accuracy
on aligning language of pretrained
models language
by training them models, with
to act in accordance researchers
the user’s
intentionimprovements
noticed (Leike et al., 2018).in
This encompasses both
performance explicit
with intentions such
increasing as following
model size.instructions
and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful.
Using the language of Askell et al. (2021), we want language models to be helpful (they should
help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and
harmless (they should not cause physical, psychological, or social harm to people or the environment).
We elaborate on the evaluation of these criteria in Section 3.6.
We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement
learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune
GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human
preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label
our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more
details). We then collect a dataset of human-written demonstrations of the desired output behavior
on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and
66 CHAPTER 6. ADDITIONAL TOPICS
2. For large models trained with a limited dataset with early stopping:
L(D) = (Dc /D)αD ; αD ∼ 0.095, Dc ∼ 5.4 × 1013 (tokens)
6 6
Compute-efficient
109 Params training stops far
short of convergence
4 4
Figure 6.6: Language model training runs, with models ranging in size from
103 to 109 parameters (excluding embeddings).
Wei et al. (2022b) compare how models of varying sizes perform on various
tasks and show that abilities to perform some tasks emerge in language models
after a certain scale. Figure 6.7 and 6.8 show how language models that are
trained above a certain number of FLOPS exhibit increased performance on
certain tasks.
68 CHAPTER 6. ADDITIONAL TOPICS
(A) Mod. arithmetic (B) IPA transliterate (C) Word unscramble (D) Persian QA
50 50 50 50
BLEU (%)
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
1018 1020 1022 1024 1018 1020 1022 1024 1018 1020 1022 1024 1018 1020 1022 1024
(E) TruthfulQA (F) Grounded mappings (G) Multi-task NLU (H) Word in context
70 70 70 70
60 60 60 60
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
50 50 50 50
40 40 40 40
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
1020 1022 1024 1020 1022 1024 1020 1022 1024 1020 1022 1024
Accuracy (%)
Accuracy (%)
80 40 40
60
30 30
40
70 20 20
20 10 10
0 60 0 0
1B 100B 1B 100B 8B 62B 540B 3B 9B 80B
Model scale (number of parameters)
A non-parametric 51
adapter 27
F P
fine-tuning 9 parametric 51
finetuning 25 pretrained model 9
pretraining 9
M prompting 8
masked language modelling 12
N S
next sentence prediction 12 self-supervision 9
71
72 INDEX
Bibliography
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,
Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention
for image captioning and visual question answering. In CVPR.
Alexei Baevski and Michael Auli. 2018. Adaptive input representations for
neural language modeling. arXiv preprint arXiv:1809.10853.
Hangbo Bao, Li Dong, and Furu Wei. 2022a. BEiT: BERT pre-training of
image transformers. In ICLR.
Hangbo Bao, Wenhui Wang, Li Dong, and Furu Wei. 2022b. VL-BEiT:
Generative vision-language pretraining. arXiv preprint arXiv:2206.01127.
Eyal Ben-David, Nadav Oved, and Roi Reichart. 2021. PADA: A prompt-
based autoregressive approach for adaptation to unseen domains.
Stevo Bozinovski and Ante Fulgosi. 1976. The influence of pattern similarity
and transfer learning upon training of a base perceptron b2. In Proceedings
of Symposium Informatica, volume 3, pages 121–126.
73
74 BIBLIOGRAPHY
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. 2020. Language models are few-shot learners.
Advances in neural information processing systems, 33:1877–1901.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe
Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: Universal image-text
representation learning. In ECCV.
Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultra-
fine entity typing. In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages
87–96, Melbourne, Australia. Association for Computational Linguistics.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and
Dario Amodei. 2017. Deep reinforcement learning from human preferences.
Advances in neural information processing systems, 30.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William
Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma,
et al. 2022. Scaling instruction-finetuned language models. arXiv preprint
arXiv:2210.11416.
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2022.
Why can gpt learn in-context? language models secretly perform gradient
descent as meta optimizers. arXiv preprint arXiv:2212.10559.
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,
José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In
CVPR.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
BERT: Pre-training of deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Computational Linguistics.
Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, and Jiawei Wang. 2023.
Write and paint: Generative vision-language models are unified modal
learners. In The Eleventh International Conference on Learning Represen-
tations.
Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng
Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, Jing Yi,
Weilin Zhao, Xiaozhi Wang, Zhiyuan Liu, Hai-Tao Zheng, Jianfei Chen,
Yang Liu, Jie Tang, Juanzi Li, and Maosong Sun. 2022. Delta tuning:
A comprehensive study of parameter efficient methods for pre-trained
language models. CoRR, abs/2203.06904.
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan
Wang, Chenguang Zhu, Zicheng Liu, Michael Zeng, et al. 2022. An
empirical study of training end-to-end vision-and-language transformers.
In CVPR.
Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu,
Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, et al.
76 BIBLIOGRAPHY
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao,
et al. 2022. Vision-language pre-training: Basics, recent advances, and
future trends. Foundations and Trends® in Computer Graphics and
Vision, 14(3–4):163–352.
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained
language models better few-shot learners. In Association for Computational
Linguistics (ACL).
Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu, and Maosong Sun. 2021.
PTR: Prompt tuning with rules for text classification.
Adi Haviv, Jonathan Berant, and Amir Globerson. 2021. BERTese: Learning
to speak to BERT. In Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Linguistics: Main Volume,
pages 3618–3623, Online. Association for Computational Linguistics.
Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, and
Jianlong Fu. 2021. Seeing Out of tHe bOx: End-to-End pre-training for
vision-language representation learning. In CVPR.
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu.
2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal
transformers. arXiv preprint arXiv:2004.00849.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham,
Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling
up visual and vision-language representation learning with noisy text
supervision. In ICML.
Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How
can we know what language models know? Transactions of the Association
for Computational Linguistics, 8:423–438.
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike
Lewis. 2019. Generalization through memorization: Nearest neighbor
language models. arXiv preprint arXiv:1911.00172.
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language
transformer without convolution or region supervision. In ICML.
78 BIBLIOGRAPHY
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and
Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.
arXiv preprint arXiv:2205.11916.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata,
Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A
Shamma, et al. 2017. Visual genome: Connecting language and vision
using crowdsourced dense image annotations. IJCV.
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin,
Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexan-
der Kolesnikov, et al. 2020. The open images dataset v4. IJCV.
Jaejun Lee, Raphael Tang, and Jimmy Lin. 2019. What would elsa do?
freezing layers during transformer fine-tuning. CoRR, abs/1911.03090.
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020a.
Unicoder-VL: A universal encoder for vision and language by cross-modal
pre-training. In AAAI.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrap-
ping language-image pre-training for unified vision-language understanding
and generation. In ICML.
BIBLIOGRAPHY 79
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei
Chang. 2019. VisualBERT: A simple and performant baseline for vision
and language. arXiv preprint arXiv:1908.03557.
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang,
Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng
Gao. 2020b. Oscar: Object-semantics aligned pre-training for vision-
language tasks. In ECCV.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona,
Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft
COCO: Common objects in context. In ECCV.
Haokun Liu, Derek Tam, Muqeeth Mohammed, Jay Mohta, Tenghao Huang,
Mohit Bansal, and Colin Raffel. 2022. Few-shot parameter-efficient fine-
tuning is better and cheaper than in-context learning. In Advances in
Neural Information Processing Systems.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and
Weizhu Chen. 2021a. What makes good in-context examples for gpt-3?
arXiv preprint arXiv:2101.06804.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and
Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey
of prompting methods in natural language processing. ACM Computing
Surveys, 55(9):1–35.
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang,
and Jie Tang. 2021b. GPT understands, too. CoRR, abs/2103.10385.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining approach. arXiv preprint
arXiv:1907.11692.
80 BIBLIOGRAPHY
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen
Lin, and Baining Guo. 2021c. Swin transformer: Hierarchical vision
transformer using shifted windows. In ICCV.
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretrain-
ing task-agnostic visiolinguistic representations for vision-and-language
tasks. In NeurIPS.
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stene-
torp. 2021. Fantastically ordered prompts and where to find them: Over-
coming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017.
Learned in translation: Contextualized word vectors. In Advances in
Neural Information Processing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-9, 2017, Long Beach,
CA, USA, pages 6294–6305.
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher.
2018. The natural language decathlon: Multitask learning as question
answering. arXiv preprint arXiv:1806.08730.
Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Ef-
ficient estimation of word representations in vector space. In International
Conference on Learning Representations.
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural
discrete representation learning. In NeurIPS.
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020. MAD-
X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer.
In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 7654–7673, Online. Association for
Computational Linguistics.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is
multilingual BERT? In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages 4996–5001, Florence,
Italy. Association for Computational Linguistics.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin,
Jack Clark, et al. 2021. Learning transferable visual models from natural
language supervision. In ICML.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al. 2019. Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the limits of transfer learning with a unified text-to-text transformer. J.
Mach. Learn. Res., 21:140:1–140:67.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image
generation. In ICML.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn:
Towards real-time object detection with region proposal networks. In
NeurIPS.
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika,
Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey,
M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma,
Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti
Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica,
Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas
Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli,
BIBLIOGRAPHY 83
Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella
Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. 2022. Multitask
prompted training enables zero-shot task generalization. In International
Conference on Learning Representations.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg
Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint
arXiv:1707.06347.
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu
Zhang, Jing Li, and Jian Sun. 2019. Objects365: A large-scale, high-quality
dataset for object detection. In ICCV.
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach,
Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2022. How much can
CLIP benefit vision-and-language tasks? In ICLR.
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and
Sameer Singh. 2020. AutoPrompt: Eliciting knowledge from language
models with automatically generated prompts. In Empirical Methods in
Natural Language Processing (EMNLP).
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai.
2019. VL-BERT: Pre-training of generic visual-linguistic representations.
In ICLR.
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav
Artzi. 2019. A corpus for reasoning about natural language grounded in
photographs. In ACL.
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder
representations from transformers. In EMNLP.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,
et al. 2016. Google’s neural machine translation system: Bridging the gap
between human and machine translation. arXiv preprint arXiv:1609.08144.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdi-
nov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining
for language understanding. In Advances in Neural Information Processing
Systems, volume 32. Curran Associates, Inc.
Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng
Wang. 2021. ERNIE-VIL: Knowledge enhanced vision-language represen-
tations through scene graphs. In AAAI.
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini,
and Yonghui Wu. 2022. Coca: Contrastive captioners are image-text
foundation models. TMLR.
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L
Berg. 2016. Modeling context in referring expressions. In ECCV.
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. BARTScore: Evaluat-
ing generated text as text generation.
Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. 2022. Bitfit: Simple
parameter-efficient fine-tuning for transformer-based masked language-
models. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin,
Ireland, May 22-27, 2022, pages 1–9. Association for Computational Lin-
guistics.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From
recognition to cognition: Visual commonsense reasoning. In CVPR.
Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and
Wangchunshu Zhou. 2022. X2 -VLM: All-In-One pre-trained model for
vision-language tasks. arXiv preprint arXiv:2211.12402.
86 BIBLIOGRAPHY
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan
Wang, Yejin Choi, and Jianfeng Gao. 2021. VinVL: Revisiting visual
representations in vision-language models. In CVPR.
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D.
Manning. 2017. Position-aware attention and supervised data improve
slot filling. In Proceedings of the 2017 Conference on Empirical Methods
in Natural Language Processing, pages 35–45, Copenhagen, Denmark.
Association for Computational Linguistics.
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun
Liu. 2019. ERNIE: Enhanced language representation with informative
entities. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 1441–1451, Florence, Italy. Association
for Computational Linguistics.
Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. Factual probing is
[MASK]: Learning vs. learning to recall. CoRR, abs/2104.05240.
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jian-
feng Gao. 2020. Unified vision-language pre-training for image captioning
and vqa. In AAAI.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies:
Towards story-like visual explanations by watching movies and reading
books. In Proceedings of the IEEE international conference on computer
vision, pages 19–27.