Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Visual GPT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image

Captioning

Jun Chen1 , Han Guo2 , Kai Yi 1 , Boyang Li3 , Mohamed Elhoseiny1


1
King Abdullah University of Science and Technology (KAUST),
2
Carnegie Mellon University, 3 Nanyang Technological University
{jun.chen,kai.yi,mohamed.elhoseiny}@kaust.edu.sa
hanguo@cs.cmu.edu, boyang.li@ntu.edu.sg

Abstract Encoder Layer K Decoder Layer 1

Masked Self


Attention
The limited availability of annotated data often hinders
Encoder Layer 1 Cross Attention
real-world applications of machine learning. To efficiently
Feed Forward Pretrained
learn from small quantities of multimodal data, we lever- 𝜷𝑣𝑖𝑠 𝜷𝑙𝑎𝑛 LM Weights
Self Attention
age the linguistic knowledge from a large pre-trained lan- Self-resurrecting
guage model (PLM) and quickly adapt it to new domains Encoder-decoder
Attention
of image captioning. To effectively utilize a pretrained Feed Forward
model, it is critical to balance the visual input and prior


linguistic knowledge from pretraining. We propose Visu-
alGPT, which employs a novel self-resurrecting encoder- Decoder Layer K

decoder attention mechanism to quickly adapt the PLM


with a small amount of in-domain image-text data. The A cop on brown horse on
proposed self-resurrecting activation unit produces sparse sidewalk next to truck.
activations that prevent accidental overwriting of linguis-
tic knowledge. When trained on 0.1%, 0.5% and 1% of Figure 1. Our VisualGPT model transfers the knowledge from
a pre-trained language model to the caption decoder. A self-
the respective training sets, VisualGPT surpasses the best
resurrecting encoder-decoder attention is designed to connect the
baseline by up to 10.0% CIDEr on MS COCO [45] and multi-level visual features and caption decoder.
17.9% CIDEr on Conceptual Captions [69]. Furthermore,
VisualGPT achieves the state-of-the-art result on IU X-ray
[15], a medical report generation dataset. Our code is
scaled. Improving the data efficiency of image captioning
available at https://github.com/Vision-CAIR/
networks would enable quick data curation, description of
VisualGPT.
rare objects, and applications in specialized domains.
In this paper, we investigate the data efficiency prob-
lem for image captioning. This problem is distinct from
1. Introduction the novel object captioning problem [1, 24], which relies on
Recent performance gains in image captioning [13, 25, abundant in-domain data but zero out-of-domain data. In-
29, 33, 81] are achieved on top of large-scale data corpora stead, we aim to improve the performance of image caption-
such as MS COCO [45] or Conceptual Captions [69], each ing systems trained on a small subset of in-domain data.
containing hundreds of thousands of captions. Manual an- We propose to improve data efficiency by leveraging pre-
notation of captions requires considerable time and effort. trained language models (PLMs) [17, 36, 48, 63], such as
On the other hand, semi-automatic collection of image- BERT [16], XLNet [83], and GPT [6, 61, 62]. Via self-
caption pairs from the Internet, as used by Conceptual Cap- supervised learning, these models acquire rich linguistic
tions [69], may generate incorrect or undesirable training and semantic knowledge, which has been shown to inform
data even after multiple rounds of cleaning. Data for spe- downstream tasks in NLP [7, 21]. However, the adapta-
cialized domains like medical report generation [15,42] and tion of PLMs pretrained on unimodal textual data for mul-
low-resource language captioning [18, 80] cannot be easily timodal tasks remain under-investigated.
image captioning. To our knowledge, this is the first
work that focuses on efficiently adapting large pre-
trained language models for image captioning.

• We propose a novel encoder-decoder attention with


self-resurrecting activation units (SRAUs), which can
balance features from the visual and textual modalities.
SRAU produces sparse activations that reduce acciden-
tal overwriting of pretrained weights.

2. Related Work
Figure 2. Comparison of the part-of-speech distributions of the Image Captioning. Image captioning has been exten-
MS COCO and WikiText-2 datasets [54]. We use the spacy parser sively studied in computer vision research. Early methods
and show only the most important categories.
[19, 33, 39, 71, 85] focus on filling templates with extracted
objects, attributes, and relationships. With the advent of
A key challenge in utilizing PLMs is to bridge the do- deep learning, researchers proposed end-to-end neural net-
main gap between multi-modal data and the unimodal tex- works that encode an image into vector representations and
tual data the PLMs are pre-trained on. In Figure 2, we decode a caption word by word [28, 77]. Many improve-
compare the part-of-speech distributions of MS COCO and ments to the encoder [11, 40, 52, 81, 82, 86, 87], the decoder
WikiText-2 [54]. MS COCO employs 75% more nouns but [78, 79, 84], and the attention mechanism [8, 13, 25, 35, 38]
14% fewer verbs, which indicates a bias toward descriptions has since been proposed. Encoding the image using object
of static objects rather than actions. This suggests that, in regions has proven beneficial [2]. Reinforcement learning
order to effectively utilize PLMs in image captioning, we enables model optimization with non-differentiable evalua-
must balance prior linguistic knowledge acquired from pre- tion metrics [14, 47, 65, 70]. [9, 12] investigate fine-grained
training and visual input information. control of caption generation. [14, 70] adopt GAN-like ar-
chitectures that encourage human-like captions.
Figure 1 depicts the overall architecture of our pro-
posed model, dubbed as VisualGPT. In the commonly used A few formulations of the image captioning problem
encoder-decoder architecture for image captioning, we ini- deviate from the traditional supervised learning paradigm.
tialize the parameters of the decoder from PLMs such as Novel object captioning aims to describe objects that do
GPT-2 [62], whereas the encoder layers are randomly ini- not exist in the training data [1, 24, 43, 53, 76]. Feng et al.
tialized. In addition, we propose an attention mechanism [20] propose unsupervised captioning without using paired
with self-resurrecting activation units (SRAUs), which bal- image-caption supervision. Kim et al [30] focus on learning
ances the input from the visual encoder and the linguistic efficiency and improve the data efficiency by learning from
input from the previous decoder layer. The proposed mech- auxiliary unpaired image-caption data.
anism can produce sparse activations while not being as vul- Self-supervised NLP Models. Self-supervised training of
nerable to the zero-gradient problem as regular gates; the large neural networks on textual data proves to be an im-
self-resurrecting gates can be “turned on” again after being portant technique in the creation of high-performance NLP
zeroed out. models. Several self-supervision signals have been pro-
Empirical results demonstrate that, when trained on posed, such as autoregressive language modeling [5, 55],
0.1%, 0.5%, and 1% of the MS COCO and Conceptual which includes the GPT series of models [6, 61, 62], and
Captions data, VisualGPT outperforms several strong base- masked language modeling, which includes ELMo [59] and
line models. We achieve the state-of-the-art result on IU X- BERT-related methods [16, 34, 49].
ray [15], a medical report generation dataset. With several In this paper, we propose a quick adaptation technique
ablation experiments, we verify the effectiveness of PLMs for network weights obtained using the language model-
and the proposed self-resurrecting attention mechanism. ing (LM) objective. However, the proposed technique can
Contributions. We make the following contributions: easily be applied to other models, as the masked language
modeling objective can be converted to the LM objective
• We explore the data efficiency problem for image by masking only the last word in the textual sequence. Un-
captioning by utilizing pretrained language models like neural networks pretrained on multimodal data (e.g.,
(PLMs) as the caption decoder. With only a small [41, 51, 60, 72, 73, 88, 89]), our method only requires a small
amount of in-domain training data, the proposed tech- amount of multimodal training data and focuses on adapting
nique quickly adapts PLMs to the cross-modal task of linguistic knowledge learned from the textual modality.
Transformer AoA Transformer M2 Transformer VisualGPT (Ours)
Decoder Layer m Decoder Layer m Decoder Layer m Decoder Layer m
Add & Norm Add & Norm Add & Norm Add & Norm

Feed Forward 𝜶 Feed Forward Feed Forward

Feed Forward
Add & Norm Meshed Connection Sum Meshed Connection Sum
pretrained
𝜶 LM init.
Concat
Add & Norm 𝑩" 𝑩! random init.
I I
Cross Attention Cross Attention I Cross I Cross
Attention Attention 𝜶
H gated units
H H H
B
Add & Norm Add & Norm Add & Norm Add & Norm

Masked Self Attention Masked Self Attention Masked Self Attention Masked Self Attention

Zm-1 Zm-1 Zm-1 Zm-1


(a) (b) (c) (d)

Figure 3. Architectures of vanilla Transformer [74], Transformer with AoA module [25] (AoA Transformer), M2 Transformer [13], and
VisualGPT. We denote I and H as the visual and language features, respectively. Zm−1 is the output from decoder layer m − 1. Within
the circles, α, B V and B L represent different gating units.

3. Preliminaries: Transformer for Captioning Researchers have proposed other variants of the encoder-
decoder attention. In Figure 3, we contrast these decoder
The Transformer [74] has become one of the standard architectures with the proposed VisualGPT model. The
models for image captioning. At its core lies the multi-head Attention-on-Attention (AoA) module [25] provides an al-
dot-product attention mechanism. Taking three input matri- ternative method for combining the visual encoding I and
ces, query Q, key K, and value V , the attention function the linguistic information H from the decoder. For an-
can be written as other method for combining visual and linguistic informa-
tion, M2 Transformer [13] connects all decoder layers to
\text {Attn}(Q,K,V) =\text {softmax}\left ( \frac {(W^q Q) (W^k K)^{\top }}{\sqrt {D}}\right ) W^v V, \label {eq:fatt} all encoder layers. In Figure 3, it is represented by the box
(1) labeled as Meshed Connection Sum.
where W q , W k , and W v are trainable parameters and D
is a scaling factor. Intuitively, the attention operation can 4. VisualGPT
be seen as encoding W q Q as convex combination of the
Pretrained language models (PLMs) such as GPT-2 [62]
row vectors of W v V . The multi-head attention repeats the
are trained on data from a single modality. We use a PLM as
process with multiple sets of W q , W k , and W v ; the results
the caption decoder and feed visual information to the PLM
are concatenated and linearly projected back to the same
via the encoder-decoder attention, which plays a crucial role
dimensionality.
in quickly adapting the PLMs.
In visual captioning tasks, we apply a visual encoder
With the design of the encoder-decoder attention, we aim
whose output is I ∈ RO×S . O is the length of the input
to carefully balance visual information from the encoder
sequence, which in this work is a sequence of objects in the
and linguistic knowledge stored in the PLM. During the
image. S is the hidden dimension size. The decoder net-
generation of visual words, such as “person”, “truck”, or
work outputs words in the caption sequentially.
“dog”, the model should attend to visual information. In
When decoding word t+1, the encoder-decoder attention
contrast, the generation of determiners or connectives re-
takes as input the visual encoding I and the current state of
quires only linguistic knowledge. Ideally, we would like to
the decoder H ∈ Rt×S . We apply the attention operation
exploit the massive amount of linguistic knowledge stored
with H as the query and I as both the key and the value.
in the PLM weights (e.g., [46]), while referring to the vi-
The encoder-decoder attention is then
sual input only when required. To achieve this goal, we
\text {EncDecAttn}(H, I) = \text {Attn}(H, I, I). (2) introduce a pair of specialized gating units.

4.1. Self-Resurrecting Activation Unit


After that, we apply the AddNorm operator, which contains
a residual connection and layer normalization [3] and can The encoder-decoder attention EncDecAttn(H, I) may
be written as LayerNorm(EncDecAttn(H, I) + H). be seen as encoding the linguistic information H with visual
Ordinary Complementary Sigmoid Gates Normalized SRAU: =0.2
information I. In VisualGPT, we control the balance be- 1.0 1.0

tween these two modalities using two complementary gates 0.8 0.8

B vis and B lan . The output of this module is 0.6 Bmlan 0.6 Bmlan
Bmvis

y
Bmvis

y
0.4 0.4
B^{\text {vis}} \otimes \text {EncDecAttn}(H, I) + B^{\text {lan}} \otimes H, (3)
0.2 0.2
where ⊗ denotes element-wise multiplication. Letting 0.0 0.0
B vis [i, j] and B lan [i, j]] denote the elements in the matrices, 10.0 7.5 5.0 2.5 0.0
x
2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0
x
2.5 5.0 7.5 10.0

they are computed in pairs as SRAU: =0.2


1.0

\label {eq:srau-2} \begin {split} B^{\text {vis}}[i, j] & = \sigma (H[i,j]) \mathbbm {1} (\sigma (H[i,j])> \tau ) , \\ B^{\text {lan}}[i, j] & = (1- \sigma (H[i,j]) \mathbbm {1}(1- \sigma (H[i,j])> \tau ), \end {split} 0.8
(4) 0.6 Bmlan
Bmvis

y
0.4

where τ is a predefined threshold hyperparameter and 1(·) 0.2

is the indicator function, which returns 1 if the inner state- 0.0


10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
ment is true and 0 otherwise. x
An alternative to SRAU is ordinary complementary gates
Figure 4. Top Left: Ordinary complementary sigmoid gates. Top
(OCG), computed as σ(H[i, j]) and 1−σ(H[i, j]) (see Fig-
Right: Normalized SRAU τ =0.2. Bottom: SRAU τ =0.2. The x-
ure 4, top left). OCG can output values that are very close
axis indicates the function inputs and the y-axis indicates outputs.
to zero. In contrast, with the indicator functions SRAU di-
rectly sets values less than the threshold τ to zero, thereby image using an off-the-shelf object detection network. After
introducing sparsity. When τ is set to 0, SRAU becomes that, we feed the spatial location into the image encoder. As
OCG. As the gradient cannot backpropagate through zero such, the image encoder outputs I of dimension S × O × K.
gates, SRAU prevents optimization from disrupting pre- The caption decoder contains M layers and its param-
trained weights that capture linguistic knowledge. This eters are initialized from a PLM. We insert the encoder-
property is crucial in effective utilizing of pretrained mod- decoder module, which is randomly initialized. We also
els. In contrast, when the OCG gates output near-zero val- apply meshed connections between the encoder and the de-
ues, some small but non-zero gradients may still overwrite coder like in M2 Transformer. The network is trained to
existing linguistic knowledge. maximize the probability of the next token wt conditioned
Another advantage of SRAU is its ability to escape from on tokens w1 , . . . , wt−1 and the encoder output I. Af-
zero outputs. It is possible for one gate to output zero and ter a predefined number of epochs on supervised learning,
have zero gradient while the gradient for the other gate we switch to self-critical reinforcement learning [65] with
remains usable (e.g., when x in Fig 4 is close to 1.3 or CIDEr as the reward.
−1.3). The asymmetry allows gradient-based optimization
to change the zero-outputting gate by changing the other 5. Experiments
gate. For this reason, we name these gates self-resurrecting
activation units. 5.1. Datasets and Evaluation Metrics
The asymmetry of SRAU may appear counter-intuitive. We evaluate our model on three datasets, MS COCO
We contrast SRAU with a “normalized” version where the [45], Conceptual Captions [69], and IU X-ray [15]. MS
two gates B̃ vis [i, j] and B̃ lan [i, j] become symmetric. COCO contains 123, 287 images and each of them is an-
notated with 5 different captions. We follow the Karpa-
\label {eq:srau-2} \begin {split} \tilde {B}^{\text {vis}}[i, j] & = \frac {B^{\text {vis}}[i, j] }{B^{\text {vis}}[i, j] + B^{\text {lan}}[i, j] } , \\ \tilde {B}^{\text {lan}}[i, j] & = \frac {B^{\text {lan}}[i, j] }{B^{\text {vis}}[i, j] + B^{\text {lan}}[i, j] }. \end {split}
thy split [29] for the validation and test set. The Concep-
(5) tual Captions dataset [69] contains around 3.3M images for
training and 28K for validation, with much higher diversity
than COCO. As the test data is not publicly available, we
These gates lose the asymmetry that enables the self- instead use the public validation data as our test set, and ran-
resurrecting property. domly sample 5000 different image-caption pairs from the
In Figure 4, we visualize OCG, SRAU, and normalized training set as the validation set. To create the small train-
SRAU. In ablation experiments, we show that SRAU out- ing data setup for MS COCO and Conceptual Captions, we
performs than both OCG and normalized SRAU. randomly sample 0.1%, 0.5% and 1% image-caption pairs
as training data, which matches to (567, 2,835 and 5,670
4.2. The Architecture and Training of VisualGPT
pairs) for COCO and (3,300, 16,500, and 33,000 pairs) for
For completeness, we introduce the overall architecture Conceptual Captions. We repeat the experiments 4 times
for VisualGPT. The image encoder comprising K Trans- with different random seeds, and report the average perfor-
former layers. Given an image, we extract objects in the mance. We report metrics for BLEU [57], METEOR [4],
COCO Conceptual
Method PLM B1 B4 M R C B1 B4 M R C
0.1% training data
Transformer [74] None 57.4 13.1 16.7 40.7 40.8 12.4 2.4 4.9 15.2 21.2
M2 Transformer [13] None 56.9 13.1 16.9 40.6 40.9 13.1 2.8 4.8 15.5 23.5
AoA Transformer [25] None 56.6 13.5 15.9 40.7 38.4 11.4 2.4 4.6 14.7 20.9
X-Transfomrer [56] None 56.7 12.9 16.5 40.6 40.4 12.8 2.7 4.7 15.3 23.1
OSCAR [41] BERT 53.8 11.9 17.1 39.5 41.0 12.2 2.4 4.3 14.8 21.9
Transformer GPT 56.8 15.3 17.0 41.2 42.9 13.2 2.5 5.0 15.1 21.9
M2 Transformer GPT 54.9 14.7 16.6 41.1 41.0 11.9 2.6 4.9 15.4 24.0
AoA Transformer GPT 55.5 14.4 16.2 40.7 40.1 11.8 2.8 4.6 13.9 20.5
VisualGPT (Normalized SRAU) GPT 55.7 15.0 16.8 41.2 42.4 13.3 2.9 5.1 15.8 25.8
VisualGPT (Our SRAU) GPT 58.2 16.4 18.5 41.9 45.1 13.9 3.2 5.6 16.7 27.7
0.5% training data
Transformer None 62.8 18.8 19.4 25.2 59.2 13.2 3.3 5.5 16.3 29.6
M2 Transformer None 63.3 19.4 19.8 45.6 61.3 14.5 3.6 6.0 17.1 32.0
AoA Transformer None 63.5 20.2 19.4 45.8 63.9 13.8 3.3 5.6 17.9 31.8
X-Transformer None 62.9 19.0 19.6 45.7 62.0 14.2 3.5 5.8 17.3 32.1
OSCAR BERT 59.2 18.0 21.0 45.3 60.2 14.4 3.7 6.1 17.2 33.5
Transformer GPT 65.1 21.8 20.6 46.6 69.5 16.2 3.8 6.5 18.3 35.6
M2 Transformer GPT 64.7 21.8 20.7 47.1 68.5 13.9 3.6 6.0 17.2 34.1
AoA Transformer GPT 64.2 21.2 20.5 46.5 67.2 14.8 3.6 6.2 17.6 34.1
VisualGPT (Normalized SRAU) GPT 65.3 21.8 20.9 47.0 69.3 14.9 3.9 6.1 18.0 35.9
VisualGPT (Our SRAU) GPT 66.2 22.1 21.1 47.3 70.3 15.9 4.2 6.7 18.5 37.2
1% training data
Transformer None 66.0 21.9 21.1 47.3 71.9 13.9 3.7 6.3 18.1 37.9
M2 Transformer None 67.1 23.4 21.3 48.3 73.0 16.0 4.1 6.8 18.9 39.8
AoA Transformer None 67.6 23.6 21.5 48.4 75.5 14.9 4.1 6.5 18.6 39.0
X-Transformer None 67.0 23.6 21.2 48.1 47.1 15.6 4.0 6.6 18.7 39.5
OSCAR BERT 67.2 23.3 22.5 49.1 78.4 16.1 4.2 6.7 18.9 40.6
Transformer GPT 68.5 25.1 22.1 49.0 80.5 17.8 4.2 6.7 19.0 40.2
M2 Transformer GPT 68.2 25.0 22.4 49.2 80.4 15.4 3.9 6.5 17.9 39.1
AoA Transformer GPT 68.5 24.6 22.0 48.6 78.4 15.4 3.9 6.5 17.9 38.5
VisualGPT (Normalized SRAU) GPT 68.7 25.2 22.3 49.2 80.6 15.3 4.2 6.7 18.3 40.3
VisualGPT (Our SRAU) GPT 69.5 25.6 22.6 49.6 80.9 16.3 4.3 6.9 19.3 40.9

Table 1. Performance of the compared methods training on 0.1%, 0.5% and 1% of MS COCO and Conceptual Caption image-caption
pairs. The best performance in each configuration is in bold. Ablated models are marked in gray .

ROUGE [44], and CIDEr [75]. • AoA Transformer, which inserts an attention-on-
IU X-ray [15] is a radiography dataset containing 7, 470 attention (AoA) module [25] into every transformer
chest X-ray images and 3, 955 human-written reports. As layer, as depicted by Figure 3 (b). Following [13], we
the dataset is already small, we follow the original split, slightly update the original AoA network in [25] by re-
which has a training set of 5, 226 images and 2, 770 reports. placing the LSTM with Transformers in order to create
Most reports have two images corresponding to the frontal a fair Transformer-to-Transformer comparison.
and lateral viewpoints.
• M2 Transformer [13], which proposes a meshed con-
5.2. Experimental Settings nection between encoder and decoder and is one of the
Baselines. We compare our model with several state-of-the- best-performing models on MS COCO.
art transformer-based models, including:
• X-Transformer [56], which employs bilinear pooling
• Plain Transformer [74]. to selectively capitalize on visual information and is
Models B-1 B-2 B-3 B-4 R M C Models B-1 B-4 M R C
Att2in 22.4 12.9 8.9 6.8 30.8 - 29.7 Kim et al. [31] 58.1 13.4 15.9 - 36.0
CoAtt 45.5 28.8 20.5 15.4 36.9 - 27.7 Kim et al. + unpaired 63.0 18.7 20.7 - 55.2
HRGR 43.8 29.8 20.8 15.1 32.2 - 34.3
CMAS-RL 46.4 30.1 21.0 15.4 37.1 - 27.5 Gu et al. [22] 46.2 5.4 13.2 - 17.7
Chen et al. 47.0 30.4 21.9 16.5 37.1 18.7 - Feng et al. [20] 58.9 18.6 17.9 - 54.9

VisualGPT (ours) 48.0 31.3 22.2 15.9 37.4 20.5 49.7 VisualGPT (ours) 67.1 24.3 21.9 48.6 75.8

Table 2. Performance on the IU X-ray dataset. Table 3. Comparison with unsupervised and semi-supervised
learning methods using Kim et al.’s split of MS COCO. Kim et
al. employ only 1% images for training in contrast to 1% image-
one of best-performing models on MS COCO. caption pairs from Table 1. Note that Kim et al. + unpaired also
use the rest of training data as unpaired images and texts. The
• OSCAR [41], which finetunes BERT initialization on gray shading denotes baselines that use a large amount of un-
image-language dataset. paired images and texts during training.
Since VisualGPT has GPT as the pretrained decoder, for fair
comparisons, we also create variants of Transformer, AoA
including Att2in [65], CoAtt [27], HRGR [37], CMAS-
Transformer and M2 Transformer with GPT as the decoder.
RL [26] and the model from Chen et al. [10]. This dataset
For VisualGPT, we set τ to 0.2 in all experiments. We also
only contains around 2, 770 medical reports in the training
explored the effect of different τ and find τ in the range
set, which is less than 1% COCO data and poses a data-
of [0, 0.2] to offer the right level of sparsity. For all other
efficiency challenge. We follow the same experimental set-
baselines, we tune the hyperparameters on the validation
ting as in [10]. The results show that VisualGPT outper-
set of MS COCO. We train our model and all the baselines
forms the baselines for most evaluation metrics and creates
in reinforcement learning setting following the work in [13].
a new state-of-the-art. It shows the value of leveraging GPT
Please see the supplemental material for more details on hy-
knowledge into the highly specific domain which has very
perparameters and experimental results.
“expensive” and insufficient paired data. We hope our find-
5.3. Quantitative Results ing could inspire future work in other domains.
Small In-domain Training Data. Results on MS COCO Comparison Against Semi-supervised and Unsuper-
and Conceptual Captions are presented in Tables 1. Visu- vised Methods. Kim et al. [31] proposed a semi-supervised
alGPT outperforms the best-performing baseline model by learning method to improve the data efficiency of image
4.1 CIDEr when trained on 0.1% of MS COCO data, 6.4 captioning. They used 1% of images and all their captions
CIDEr when trained on 0.5% data and 2.5 CIDEr with 1% as training data, rather than 1% of all the image-caption
training data. On Conceptual Caption dataset, VisualGPT pairs in Table 1, hence they cover less images since each
also outperforms all the baselines. It outperforms the best image is associated to more than 1 caption. For Kim et al.
baseline model by 4.2 CIDEr under 0.1% training data, 3.5 + unpaired, they also employ the other 99% of MS COCO
CIDEr under 0.5% data and 0.3 CIDEr under 1% data. as unpaired images and captions for training. We replicate
their setup by only training with 1% of images. As shown in
Comparison with BERT-based model. We compared Table 3, without using additional unpaired images and cap-
with OSCAR [41] which is a BERT-based [16] model with tions, the proposed VisualGPT method outperforms Kim et
good performing results in many benchmarks. We run their al. [31] by 20.6 CIDEr score.
model without pretraining on a large-scale image-language
We also compare VisualGPT against unsupervised meth-
corpus for the fair comparison with our model. The main
ods of Gu et al. [22] and Feng et al. [20], which use tens
difference between BERT and GPT is their different pre-
of millions of unpaired images and captions. Even though
training objectives, where BERT uses masked language
these are not fair comparisons, it is encouraging to see Vi-
modeling and GPT is the autoregressive prediction of the
sualGPT surpassing these baselines by utilizing the super-
next word. GPT has more similar learning behaviors to the
vision of only 1133 training images.
image captioning model compared to BERT since they are
both optimized by autoregressively generating the next lan- 5.4. Ablation Studies
guage word. The experimental result in Table 1 shows that
VisualGPT is better than OSCAR in both datasets, which Ablation on cross-attention: To fairly compare our SRAU
confirms our selection choice of using GPT as a decoder. with other cross-attention mechanisms in the baselines, we
also initialize their decoder with 12-layer GPT and keep
Medical Report Generation. We compared VisualGPT the same encoder as VisualGPT. We contrast between plain
against state-of-the-art medical report generation models cross-attention, meshed cross-attention, and attention-on-
Method 0.1% data 0.5% data 1% data
CIDEr performance on different thresholds for SRAU
115 Transformer 18.4% 17.2% 16.8%
0.10% 0.50% 1% 5%
103
AoA Transformer 11.5% 20.9% 25.0%
105 100.5
99.5 M2 Transformer 30.9% 22.8% 20.8%
95 VisualGPT 39.2% 39.1% 37.4%
85 82.5 81.5 80.9
Table 4. The percentage of votes received by VisualGPT and base-
CIDEr

75 70.3
line models under different quantity of training data.
69 67.9
65
Q1. Does the caption miss things shown in the image?
55 Answer Ours M2 Transformer Transformer AoA GT
44.3 42.7 45.1 No 719 624 633 621 973
45
Yes 367 438 456 447 73
35 No Rate 0.66 0.59 0.58 0.58 0.93
0 0.1 0.2
Threshold (τ) Q2. Does the caption describe things not in the image?
Answer Ours M2 Transformer Transformer AoA GT
Figure 5. CIDEr performance v.s. different thresholds τ with 0.1% No 720 692 633 655 448
0.5%, 1% and 5% training data. Yes 360 418 423 412 43
No Rate 0.67 0.62 0.60 0.61 0.96

attention (AoA) modules. For AoA Transformer, we add the Table 5. Human evaluation of object hallucination and omission.
AoA module on top of cross-attention. Table 1 shows the GT denotes the ground-truth captions.
results, which demonstrate that SRAU is better than other
cross-attention modules in exploiting the GPT knowledge GT: the lady is sitting on the wood bench
within the image-caption task.
Ours a woman sitting on a bench in a park
Ablation on SRAU: We create an ablation called Normal- attention 0.7 0.78 0.82 0.76 0.8 0.96 0.8 0.69 0.85
ized SRAU, where we replace the SRAU with the normal-
ized SRAU (see Figure 4) and use GPT2 initialization. We GT: a laptop with a keyboard and mouse are on this desk
provided the results in table 1. The normalized SRAU
Ours a laptop sitting on a desk with a mouse
results in substantially lowered performance, decreasing
attention 0.7 0.78 0.81 0.7 0.7 0.92 0.85 0.64 0.76
CIDEr from full VisualGPT by 2.7, 1.0, and 0.3 respec-
tively on the three setups on MS COCO, and it also de-
GT: a cat is sitting in front of a television
creases from Full VisualGPT by 2.2, 1.3 and 0.6 respec-
tively on Conceptual Caption. This demonstrates that the Ours a cat is sitting in front of a television

self-resurrecting property is beneficial for learning from attention 0.8 0.86 0.8 0.83 0.7 0.72 0.6 0.71 0.93

small data. We experimented with Leaky ReLU and GELU,


which ameliorate zero gradients, but the training crashed GT: a number of people sitting on a snowy surface with skis
due to the lack of upper limits for function values.
Ours a couple of people sitting on a snowy surface
We explored different τ among (0, 0.1 0.2) and show attention 0.8 0.87 0.71 0.85 0.91 0.76 0.71 0.94 0.95
their CIDEr performance on different percentage of COCO
training data in the Figure 5. τ =0 is equivalent to ordinary
complementary sigmoid gates. We can observe that τ = 0.2 Figure 6. Visual scores of words in generated captions. We show
can give us the best performance in most cases, indicating the raw visual scores and highlight them according to normalized
visual scores. High visual scores are in blue and low scores in red.
the usefulness of incorporating sparsity in our SRAU com-
plementary gates.
0.1%, 0.5%, and 1% training data. For every image, we
5.5. Human Study
generated one caption from VisualGPT and each of three
In addition to automatic evaluation metrics, we conduct high-performing baselines from Table 1, Transformer [74],
two human studies to further evaluate the quality of gen- M2 Transformer [13], and AoA Transformer [25], all with
erated captions. In the first study, we asked participants three decoder layers. Every image was evaluated by 5 dif-
directly for preference over generated captions. We ran- ferent Turkers, who chose the caption that most accurately
domly selected 250 test images from the three setups of described the image content. We received 3750 (250 images
max visual the visual scores across the dataset to the [0, 1] interval and
bench highlight the words accordingly. Blue indicates high visual
wooden
sitting
scores and red indicates low visual scores. We observe that,
clock in agreement with our intuition, VisualGPT assigns high vi-
toilet sual scores to words like “desk” and “snowy surface” and
low visual scores to determiners and prepositions.
min visual In Figure 7, we plot the distribution of B vis and B lan at
to
every decoder layer as a box-and-whisker diagram. We also
of
on
show the words with the highest and lowest visual scores,
the which are again in line with our expectations. Additionally,
a we observe that, going from layer 0 to layer 9, the decoder
makes increasing use of visual information, but the upper-
most layers, 10 and 11, make more balanced use of informa-
Figure 7. Distributions of linguistic attention (B lan ) and visual tion. We hypothesize that the low layers focus on low-level
attention (B vis ) at every decoding layer. We also show the words linguistics like syntax, whereas the middle layers learn to
generated with the highest and lowest visual attention.
fuse linguistic information with visual information. Finally,
the two information sources become balanced in the upper-
× 5 Turkers × 3 setups) valid responses. most layers.
We summarize the results in Table 4. Overall, the cap- 5.7. Limitation
tions generated by VisualGPT received the largest share of
votes, 39.2% for the 0.1% training data split, 39.1% for the One limitation of our proposal is that, as experiments in
0.5% split, and 37.4% for the 1% split. For each train- the supplementary material show, the gap between baseline
ing setup, we conducted Pearson’s Chi-square test [58], models and VisualGPT gradually vanishes as in-domain
which shows the differences are statistical significant with training data increase. The phenomenon is more pro-
p < 0.05 in all cases. nounced in COCO than Conceptual Captions, which has
In the second study, we evaluate if using pretrained a more diverse vocabulary. We hypothesize that linguistic
language models introduces excessive linguistic prior that knowledge from pretrained models is the most useful when
could cause the known object hallucination problem [67]. the training data are small and do not provide sufficient cov-
From the models trained using 1% COCO data. We ran- erage of the vocabulary.
domly sampled 250 images with the generated caption from
each model. For each image, we asked 5 different partic- 6. Conclusions
ipants if the caption (1) described non-existent objects or We present VisualGPT, a data efficient image caption-
(2) missed objects existing in the image. To catch random ing model which leverages the linguistic knowledge from
clickers, we created 5 images with verified captions, so that the pretrained language model. To bridge the semantic gap
we knew the right answers of these questions. Participants between different modalities, we design a novel encoder-
who answered these questions wrongly were considered un- decoder attention mechanism with an unsaturated rectified
reliable and removed from the results. gating function. We evaluate our model on 0.1%, 0.5% and
The results are in Table 5. Compared to the baselines, 1.0% of MS COCO and Conceptual Captions, and IU X-
VisualGPT has less hallucination and higher coverage of ray, a small medical imaging report dataset. VisualGPT
objects. The study also finds that the ground-truth captions achieves the state-of-the-art result on IU X-ray and outper-
has the least amount of hallucination and highest coverage forms strong baseline models.
of objects in the image. This finding lends positive support VisualGPT may solve the realistic need when training
to the validity of the experimental protocol. captioning models on low-resource languages or highly spe-
5.6. Analysis cialized domains, where it could be challenging to find an-
notators to collect a large amount of data.
In this section, we visually examine examples from the
VisualGPT model trained on 1% of MS COCO. First, we Acknowledgments. This work is funded by KAUST
show example captions generated by VisualGPT in Figure BAS/1/1685-01-0, KAUST-FCC/1/2533-17-01, and Na-
6 and the associated B vis at the last decoder layer. Note tional Research Foundation Fellowship (NRF-NRFF13-
that for every word generated, we have a 768-dimensional 2021-0006), Singapore.
visual gate vector, which is a slice of B vis at different de-
coding time steps. We take the mean of the gate vector as
the visual score for that word. After that, we normalize
References collection of radiology examinations for distribution and re-
trieval. Journal of the American Medical Informatics Asso-
[1] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, ciation, 23(2), 2016. 1, 2, 4, 5
Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste-
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
fan Lee, and Peter Anderson. nocaps: novel object caption-
Toutanova. Bert: Pre-training of deep bidirectional trans-
ing at scale. In ICCV, 2019. 1, 2
formers for language understanding. In NAACL, 2019. 1, 2,
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien 6
Teney, Mark Johnson, Stephen Gould, and Lei Zhang. [17] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu,
Bottom-up and top-down attention for image captioning and Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon.
visual question answering. In CVPR, 2018. 2, 12 Unified language model pre-training for natural language un-
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. derstanding and generation. In NeurIPS, 2019. 1
Layer normalization. arXiv 1607.06450, 2016. 3 [18] Obeida ElJundi., Mohamad Dhaybi., Kotaiba Mokadam.,
[4] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic Hazem Hajj., and Daniel Asmar. Resources and end-to-
metric for mt evaluation with improved correlation with hu- end neural network models for arabic image captioning. In
man judgments. In Proceedings of the acl workshop on in- Proceedings of the 15th International Joint Conference on
trinsic and extrinsic evaluation measures for machine trans- Computer Vision, Imaging and Computer Graphics Theory
lation and/or summarization, 2005. 4 and Applications - Volume 5: VISAPP,, pages 233–241. IN-
[5] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and STICC, SciTePress, 2020. 1
Christian Jauvin. A neural probabilistic language model. [19] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Pe-
Journal of machine learning research, 3(Feb), 2003. 2 ter Young, Cyrus Rashtchian, Julia Hockenmaier, and David
[6] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Forsyth. Every picture tells a story: Generating sentences
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- from images. In ECCV. Springer, 2010. 2
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. [20] Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. Unsupervised
Language models are few-shot learners. arXiv preprint image captioning. In CVPR, 2019. 2, 6
arXiv:2005.14165, 2020. 1, 2 [21] Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko,
[7] Pawel Budzianowski and Ivan Vulic. Hello, it’s GPT-2 - Kyryl Truskovskyi, Alexander Tselousov, and Thomas Wolf.
how can I help you? towards the use of pretrained language Large-scale transfer learning for natural language genera-
models for task-oriented dialogue systems. In Alexandra tion. In ACL, 2019. 1
Birch, Andrew M. Finch, Hiroaki Hayashi, Ioannis Konstas, [22] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. Un-
Thang Luong, Graham Neubig, Yusuke Oda, and Katsuhito paired image captioning by language pivoting. In ECCV,
Sudoh, editors, EMNLP-IJCNLP. Association for Computa- 2018. 6
tional Linguistics, 2019. 1 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[8] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Deep residual learning for image recognition. In CVPR,
Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and 2016. 12
channel-wise attention in convolutional networks for image [24] Lisa Anne Hendricks, Subhashini Venugopalan, Marcus
captioning. In CVPR, 2017. 2 Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Dar-
[9] Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. Say as you rell. Deep compositional captioning: Describing novel ob-
wish: Fine-grained control of image caption generation with ject categories without paired training data. In CVPR, 2016.
abstract scene graphs. In CVPR, 2020. 2 1, 2
[10] Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang [25] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei.
Wan. Generating radiology reports via memory-driven trans- Attention on attention for image captioning. In ICCV, 2019.
former. In EMNLP, 2020. 6 1, 2, 3, 5, 7, 13
[11] Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim. [26] Baoyu Jing, Zeya Wang, and Eric Xing. Show, describe and
Attend to you: Personalized image captioning with context conclude: On exploiting the structure information of chest
sequence memory networks. In CVPR, 2017. 2 x-ray reports. In ACL, 2019. 6
[12] Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. [27] Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic
Show, control and tell: A framework for generating control- generation of medical imaging reports. In ACL, 2018. 6
lable and grounded captions. In CVPR, 2019. 2 [28] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap:
[13] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Fully convolutional localization networks for dense caption-
Rita Cucchiara. Meshed-memory transformer for image cap- ing. In CVPR, 2016. 2
tioning. In CVPR, 2020. 1, 2, 3, 5, 6, 7, 12, 13 [29] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align-
[14] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. To- ments for generating image descriptions. In CVPR, 2015. 1,
wards diverse and natural image descriptions via a condi- 4
tional gan. In ICCV, 2017. 2 [30] Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So
[15] Dina Demner-Fushman, Marc D Kohli, Marc B Rosen- Kweon. Image captioning with very scarce supervised data:
man, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, Adversarial semi-supervised learning approach. In Kentaro
George R Thoma, and Clement J McDonald. Preparing a Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,
EMNLP-IJCNLP. Association for Computational Linguis- [46] Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew E
tics, 2019. 2 Peters, and Noah A Smith. Linguistic knowledge and trans-
[31] Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So ferability of contextual representations. In NAACL, 2019. 3
Kweon. Image captioning with very scarce supervised data: [47] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and
Adversarial semi-supervised learning approach. In EMNLP, Kevin Murphy. Improved image captioning via policy gradi-
Hong Kong, China, Nov. 2019. Association for Computa- ent optimization of SPIDEr. In ICCV, 2017. 2
tional Linguistics. 6 [48] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[32] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- moyer, and Veselin Stoyanov. RoBERTa: A robustly opti-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome: mized BERT pretraining approach. arXiv Preprint, arXiv
Connecting language and vision using crowdsourced dense 1907.11692, 2019. 1
image annotations. IJCV, 123(1), 2017. 12 [49] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[33] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sag- Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
nik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and moyer, and Veselin Stoyanov. Roberta: A robustly optimized
Tamara L Berg. Babytalk: Understanding and generating bert pretraining approach. 2019. 2
simple image descriptions. TPAMI, 35(12), 2013. 1, 2 [50] Ilya Loshchilov and Frank Hutter. Fixing weight decay reg-
[34] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin ularization in adam. 2018. 12
Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite [51] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil-
bert for self-supervised learning of language representations. bert: Pretraining task-agnostic visiolinguistic representa-
In ICLR, 2019. 2 tions for vision-and-language tasks. In Hanna M. Wallach,
[35] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi- Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-
aodong He. Stacked cross attention for image-text matching. Buc, Emily B. Fox, and Roman Garnett, editors, NeurIPS,
In ECCV, 2018. 2 2019. 2
[36] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine- [52] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher.
jad, Abdelrahman Mohamed, Omer Levy, Veselin Stoy- Knowing when to look: Adaptive attention via a visual sen-
anov, and Luke Zettlemoyer. BART: Denoising sequence-to- tinel for image captioning. In CVPR, 2017. 2
sequence pre-training for natural language generation, trans- [53] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
lation, and comprehension. 2019. 1 Neural baby talk. In CVPR, 2018. 2
[37] Christy Y Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. [54] Stephen Merity, Caiming Xiong, James Bradbury, and
Hybrid retrieval-generation reinforced agent for medical im- Richard Socher. Pointer sentinel mixture models. 2017. 2
age report generation. In NeurIPS, 2018. 6 [55] Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan
Černockỳ, and Sanjeev Khudanpur. Extensions of recurrent
[38] Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. Entangled
neural network language model. In ICASSP. IEEE, 2011. 2
transformer for image captioning. In ICCV, 2019. 2
[56] Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. X-linear
[39] Siming Li, Girish Kulkarni, Tamara Berg, Alexander Berg,
attention networks for image captioning. In Proceedings of
and Yejin Choi. Composing simple image descriptions using
the IEEE/CVF Conference on Computer Vision and Pattern
web-scale n-grams. In CoNLL, 2011. 2
Recognition, pages 10971–10980, 2020. 5
[40] Xiangyang Li and Shuqiang Jiang. Know more say less: Im-
[57] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
age captioning based on scene graphs. IEEE Transactions on
Zhu. Bleu: a method for automatic evaluation of machine
Multimedia, 21(8), 2019. 2
translation. In ACL, 2002. 4
[41] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei [58] Karl Pearson. X. on the criterion that a given system of de-
Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu viations from the probable in the case of a correlated system
Wei, et al. Oscar: Object-semantics aligned pre-training for of variables is such that it can be reasonably supposed to
vision-language tasks. In ECCV. Springer, 2020. 2, 5, 6 have arisen from random sampling. The London, Edinburgh,
[42] Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. Hy- and Dublin Philosophical Magazine and Journal of Science,
brid retrieval-generation reinforced agent for medical image 50(302), 1900. 8
report generation. In NeurIPS. 2018. 1 [59] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gard-
[43] Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.
Mei. Pointing novel objects in image captioning. In CVPR, Deep contextualized word representations. In NAACL-HLT,
2019. 2 2018. 2
[44] Chin-Yew Lin and Eduard Hovy. Manual and automatic eval- [60] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and
uation of summaries. In Proceedings of the ACL-02 Work- Arun Sacheti. Imagebert: Cross-modal pre-training with
shop on Automatic Summarization-Volume 4, 2002. 5 large-scale weak-supervised image-text data. arXiv preprint
[45] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, arXiv:2001.07966, 2020. 2
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [61] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
Zitnick. Microsoft COCO: Common objects in context. In Sutskever. Improving language understanding by generative
ECCV. Springer, 2014. 1, 4 pre-training. 2018. 1, 2
[62] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario [78] Qingzhong Wang and Antoni B. Chan. Cnn+cnn: Convo-
Amodei, and Ilya Sutskever. Language models are unsuper- lutional decoders for image captioning. arXiv 1805.09019,
vised multitask learners. OpenAI blog, 1(8), 2019. 1, 2, 3 2018. 2
[63] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, [79] Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, and Garri-
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and son W Cottrell. Skeleton key: Image captioning by skeleton-
Peter J. Liu. Exploring the limits of transfer learning with a attribute decomposition. In CVPR, 2017. 2
unified text-to-text transformer. Journal of Machine Learn- [80] Yike Wu, Shiwan Zhao, Jia Chen, Ying Zhang, Xiaojie Yuan,
ing Research, 21:1–67, 2020. 1 and Zhong Su. Improving captioning for low-resource lan-
[64] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. guages by cycle consistency. In 2019 IEEE International
Faster r-cnn: Towards real-time object detection with region Conference on Multimedia and Expo (ICME), pages 362–
proposal networks. In Advances in neural information pro- 367, 2019. 1
cessing systems, 2015. 12 [81] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
[65] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Ross, and Vaibhava Goel. Self-critical sequence training for Bengio. Show, attend and tell: Neural image caption gen-
image captioning. In CVPR, 2017. 2, 4, 6 eration with visual attention. In ICML, 2015. 1, 2
[66] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret [82] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai.
Ross, and Vaibhava Goel. Self-critical sequence training for Auto-encoding scene graphs for image captioning. In CVPR,
image captioning. In CVPR, 2017. 12 2019. 2
[67] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor [83] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,
Darrell, and Kate Saenko. Object hallucination in image cap- Russ R Salakhutdinov, and Quoc V Le. Xlnet: General-
tioning. In EMNLP, 2018. 8 ized autoregressive pretraining for language understanding.
[68] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural In NeurIPS, 2019. 1
machine translation of rare words with subword units. In [84] Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, and
ACL. The Association for Computer Linguistics, 2016. 12 Russ R Salakhutdinov. Review networks for caption genera-
[69] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu tion. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R.
Soricut. Conceptual captions: A cleaned, hypernymed, im- Garnett, editors, NeurIPS, volume 29, 2016. 2
age alt-text dataset for automatic image captioning. In ACL, [85] Benjamin Z Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and
2018. 1, 4 Song-Chun Zhu. I2t: Image parsing to text description. Pro-
[70] Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, ceedings of the IEEE, 98(8), 2010. 2
Mario Fritz, and Bernt Schiele. Speaking the same language: [86] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring
Matching machine to human captions by adversarial training. visual relationship for image captioning. In ECCV, 2018. 2
In ICCV, 2017. 2 [87] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Hierarchy
[71] Richard Socher and Li Fei-Fei. Connecting modalities: parsing for image captioning. In ICCV, 2019. 2
Semi-supervised segmentation and annotation of images us- [88] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang,
ing unaligned text corpora. In CVPR. IEEE, 2010. 2 Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
[72] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Vinvl: Revisiting visual representations in vision-language
Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual- models. In Proceedings of the IEEE/CVF Conference on
linguistic representations. In ICLR, 2020. 2 Computer Vision and Pattern Recognition, pages 5579–
[73] Hao Tan and Mohit Bansal. LXMERT: learning cross- 5588, 2021. 2
modality encoder representations from transformers. In Ken- [89] Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin,
taro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Ben Chen, Haoming Zhou, Minghui Qiu, and Ling Shao.
EMNLP-IJCNLP. ACL, 2019. 2 Kaleido-bert: Vision-language pre-training on fashion do-
[74] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- main. In Proceedings of the IEEE/CVF Conference on Com-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia puter Vision and Pattern Recognition, pages 12647–12657,
Polosukhin. Attention is all you need. In NeurIPS, 2017. 3, 2021. 2
5, 7, 13
[75] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Parikh. Cider: Consensus-based image description evalua-
tion. In CVPR, 2015. 5
[76] Subhashini Venugopalan, Lisa Anne Hendricks, Marcus
Rohrbach, Raymond Mooney, Trevor Darrell, and Kate
Saenko. Captioning images with diverse objects. In CVPR,
2017. 2
[77] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
mitru Erhan. Show and tell: Lessons learned from the 2015
MSCOCO image captioning challenge. TPAMI, 39(4), 2016.
2
A. Supplementary material
A.1. Additional implementation details
Image and Word Features. Following [2], we use a Faster
R-CNN networks [64] with ResNet-101 [23] as a backbone
to train on Visual Genome dataset [32], and we extract a
2048-dimensional feature vector for each object.
We use the Byte Pair Encoding (BPE) [68], which ef-
fectively incorporate sub-word information and is benefi-
cial for dealing with out-of-vocabulary words. We employ
learnable positional encoding and initialize token embed-
ding from pretrained weights of GPT-2.
Architecture and Hyperparameters. We have 3 layers in Figure 8. Evaluation on different percentage of COCO data
the encoder and 12 layers in the decoder with 12 heads in
each layer. The hidden size D in each layer is 768. We load
the GPT-2 (small) pretrained weights, which has 117M pa-
rameters into the decoder. We use the learning rate of 1e−4
under XE loss and 1e−5 during the reinforcement learning.
We train the models with the AdamW optimizer [50] and a
batch size 25. The beam size is equal to 5. The threshold τ
is tuned on the validation set for different training data.
Training Details. We train all the models in two steps.
We first train the models with cross-entropy (XE) loss and
then finetune them using reinforcement learning. The cross-
entropy loss LXE is the traditional autoregressive classifi-
cation loss
Figure 9. Evaluation on different percentage of Conceptual Cap-
\mathcal {L}_{XE} = - \sum _{t=1}^{T} log((w_{t} | w_{1:t-1})) (6) tions

where w1:T represents the target ground truth sequence. The whole experiments highlight our model’s effectiveness
For reinforcement learning, we employ a variant of Self- on low data regimes.
Critical Sequence training [66]. Following [13], we sample
1 L On the other hand, we should also notice that M2 Trans-
L sentences, ŵ1:T , . . . , ŵ1:T , with beam search and use the
former surpasses the VisualGPT’s performance when there
mean reward from the L sentences as the baseline b. The
are 50% and 100% COCO training data. But when we train
gradient is
with the same number of Conceptual images, VisualGPT
continuously outperforms all the baselines. This leads us
\nabla _{\theta }\mathcal {L}_{RL}(\theta ) = - \frac {1}{k} \sum _{i=1}^{L} \bigg ((r(\hat {w}^i_{1:T}) - b ) \nabla _{\theta } \log p(\hat {w}^i_{1:T}) \bigg ) to think of the reason why VisualGPT show different per-
forming behaviors on these two datasets. The difference
(7) between these two datasets is that the Conceptual Captions
where r(·) represents the CIDEr-D reward. contain more diverse vocabularies and image contents. In
contrast, COCO captions only cover 80 common image ob-
A.2. Train VisualGPT with more COCO and Con-
jects. Therefore, the appearance frequency for each word
ceptual Caption Datasets
in COCO is much higher than that in Conceptual Captions
Figure 8 shows other results obtained by training net- and COCO vocabulary diversity is also much lower than
works on the 5% , 10%, 20%, 50% and 100% (82,783 im- Conceptual Caption. We hypotheses the reason for this per-
ages) MS COCO data. Figure 9 shows the performance with formance difference is that when the captions have a small
the data scaling up to 2.5% (82,958 images) Conceptual coverage of each word, the caption generation will be ben-
Captions, in which the dataset scale is similar to the whole efited a lot from the GPT inherent knowledge and GPT can
COCO data. For MS COCO, VisualGPT outperforms other help the model quickly adapt into the new domain. But
baseline models when we sample ≤ 20% training data. For when there is a lot of in-domain data, the current image-
Conceptual Caption, VisualGPT consistently outperforms captioning models can already generalize well on it and it
all the baselines when we sample ≤ 2.5% training images. potentially contradicts to the GPT original knowledge.
A.3. Attention over Different types of words
We use the Spacy parser to detect the part-of-speech of
words in captions and calculate the mean value of the vi-
sual attention score. The result is presented in Fig. 10.
We found PoS that tend to visual content, like noun (0.71),
verb (0.71) and adjective (0.72), have high visual attention
scores, whereas linguistic PoS like pronoun (0.53), punctu- GT: the large red flower is inside of a clear glass vase
ation (0.58), and determiner (0.61) receive low attention.
Ours a red vase of roses sitting on top of a glass
attention 0.8 0.93 0.94 0.64 0.87 0.84 0.67 0.55 0.57 0.43 0.86

GT: a tennis player jumps and hits a ball


Ours a tennis player jumping on a tennis court holding a ball
attention 0.7 0.77 0.75 0.72 0.67 0.64 0.89 0.79 0.74 0.6 0.76

GT: a motorcycle parked next to a white building

Ours a motorcycle parked next to a building


attention 0.6 0.78 0.85 0.74 0.34 0.6 0.75

GT: a small boats in a body of water


Ours a large boat sits on a field with a lake
attention 0.6 0.77 0.78 0.83 0.71 0.6 0.74 0.66 0.63 0.73

GT: a kitchen with wooden cabinets a sink and a dish washer

Ours a kitchen with a white cabinets and a sink


Figure 10. Attention Scores over different part-of-speech words
attention 0.73 0.86 0.8 0.7 0.9 0.91 0.8 0.8 0.9

A.4. More Qualitative Examples GT: a train sitting under a display inside a building

Ours a steam engine sitting in a display


In Figure 11, we provide more examples of visual atten- attention 0.69 0.84 0.79 0.8 0.7 0.6 0.83
tions. Blue indicates high visual scores and red indicates
low visual scores. We can observe that VisualGPT assigns GT: two captive elephants stand bored behind the fake
stone fence
higher scores to words like “steam engine”, “elephants”, Ours elephants standing next to a stone fence
“horse”, “lush” and “cabinets”, and it assigns low visual attention 0.8 0.74 0.77 0.47 0.5 0.77 0.76
scores to determiners and prepositions like “to” and “at”.
GT: a white horse standing in a field on top of grass
We also show some examples of generated captions by
Ours a white horse grazing on a lush green field
our VisualGPT and several strong baseline models includ-
attention 0.67 0.75 0.83 0.74 0.65 0.66 0.85 0.8 0.77
ing Transformer (3 layers) [74], M2 Transformer (3 lay-
ers) [13] and AoA Transformer [25] in the Table 6, Table 7 GT: a man in a restaurant smiling while holding up a camera

and Table 8. Overall, we can observe that our VisualGPT is Ours a man in a store looking at his camera
able to describe the image content more accurately than the attention 0.65 0.69 0.72 0.67 0.77 0.65 0.47 0.49 0.7

baseline models.
GT: a man sitting on a bench next to a few bags
Ours a young man holding a backpack on a bench
attention 0.7 0.82 0.74 0.7 0.54 0.84 0.59 0.55 0.83

Figure 11. More examples of visual attention for each word in


generated captions. High visual scores are in blue and low scores
in red.
Image Generated Captions Ground Truth
GT1: the people are walking through snow
in a wooded area
Transformer: a woman riding some skis on skis GT2: two people wearing skis traveling
M2 Transformer: a couple of skiers are through the snow
standing near the snow GT3: a man is walking down a path covered
AoA Transformer: a man with skis in the snow in a snow
VisualGPT (ours): a group of people walk on a GT4: a couple is skiing through the
snowy mountain snowy woods
GT5: a couple of people that are in a
snowy field
GT1: a yellow traffic light above a street
Transformer: a street that has some street next to houses
in it GT2:a street scene of an intersection with a
M2 Transformer: a traffic light over a street street light
light under a traffic light GT3: a stop light hanging over an
AoA Transformer: a street with people on a intersection in a residential area
city street GT4: a traffic signal at an intersection is
VisualGPT (ours): a street with tall signs and suspended on wire
traffic signs GT5: a street intersection with a traffic
light over it
GT1: a batch of bread slices sitting on a
Transformer: some pizza are sitting on a plate
plate GT2: a plate with some pieces of bread on it
M2 Transformer: a plate with food and GT3: sliced french bread is on a plat that
a knife on it is lying on a table
AoA Transformer: a plate of pizza on a table GT4: bread that is sitting on a plate that is
VisualGPT (ours): a plate of bread are served on a table
on a table GT5: a white plate with lots topped with
garlic bread
GT1: a man holding a racquet on top of a
Transformer: two tennis player playing tennis tennis court
on the ball GT2: a man with a tennis racket reaches
M2 Transformer: a tennis player about to for a ball
hit a ball GT3: a man with a tennis racket is running
AoA Transformer: a baseball players on a game on a court
playing a game GT4: a young man is playing a game of
VisualGPT (ours): a tennis player hits a ball tennis
with a racket GT5: a tennis player in a blue shirt
runs toward a ball
GT1: a bird is perched a top a branch over
Transformer: a group of birds that are
a river
standing in the grass
GT2: a bird sits on a branch above a stream
M2 Transformer: a flock of birds perched
GT3: a bird on top of a tree branch over
in a tree branch
water
AoA Transformer: several giraffe are standing
GT4: a picture of an outside region that
next to each trees
appears incredible
VisualGPT (ours): a bird standing in the
GT5: a bird on a fallen branch in a body of
middle of a pond
water

Table 6. Caption generated by our VisualGPT, Transformer, M2 Transformer and AoA Transformer on 0.1% MS COCO data split
Image Generated Captions Ground Truth
GT1: a blue boat docked on a green lush
Transformer: several boats are sitting in
shore
the middle of a lake
GT2: a small marina with boats docked there
M2 Transformer: a boat filled with
GT3: a group of boats sitting together with
boats floating in the water
no one around
AoA Transformer: an empty boat that has
GT4: some boats parked in the water at
water and water
a dock
VisualGPT (ours): a canal filled with boats
GT5: boats sitting around the side of a
in the water
lake by a tree

GT1: a set of five pizzas sitting next


Transformer:pizza slices and pizza in a to each other each with different toppings
plate covered pizza GT2:a handful of prepared pizzas sit next
M2 Transformer: people sitting at a to each other
table eating pizza and other salad GT3: five uncooked pizzas with a variety
AoA Transformer: two pizza eating a table with of different toppings
pizza on the table GT4: five unbaked pizzas that include
VisualGPT (ours): a group of pizza on a various types of cheeses
iron plate with toppings GT5: five different pizzas are being
prepared over a metal tray
GT1:two dogs are playing on the beach
Transformer: a dog holding a frisbee in the water catching a frisbee
M2 Transformer: a dog holding a frisbee in GT2: of two dogs only one may be the victor
a body of water GT3: a dog catching a frisbee by another
AoA Transformer: a dog walking during a frisbee dog on a beach
in a stone day GT4: dog jumping up in the air to catch a
VisualGPT (ours):a dog walking through the water frisbee in the summer time
with a frisbee GT5: a dog jumping up into the air to
catch a frisbee
GT1: a group of men standing around a room
Transformer: a group of people taking a
GT2: some people are waiting in a long room
child in a in a building
GT3: people are standing in a room looking
M2 Transformer: a group of people in
at a television screen
an airport with their hands
GT4: a person sitting on a bench while the
AoA Transformer: a picture of a young
rest look somehwere else
group of people standing for men
GT5: a man in red winter clothes sits on
VisualGPT (ours): a group of people
a bench with people behind him gather in
standing around a tv
front of a tv
GT1: two adult elephants are surrounding
Transformer: an elephant eating a elephant a baby elephant
has a elephant GT2: a baby elephant kneeling in front of
M2 Transformer: elephant with its trunk two bigger elephants
with their elephant with its trunk GT3: a baby elephant and it ’s parents
AoA Transformer: two elephants standing at eat fruit
a lot of trees GT4: elephants eat fruit a baby elephant
VisualGPT (ours): three elephants standing rummaging in the food
next to some trees GT5: a pair of adult elephants with a baby
elephant eat from a pile of fruit

Table 7. Caption generated by our VisualGPT, Transformer, M2 Transformer and AoA Transformer on 0.5% MS COCO data split
Image Generated Captions Ground Truth
GT1: several people are purchasing tickets
Transformer: a man in a suit and a woman at a bus station
standing in a shop GT2: some people are checking in at the
M2 Transformer: a man is standing in ticket counter somewhere in asia
a shop with a people holding people GT3: people waiting in line with luggage
AoA Transformer: a man is working on a bus at a ticket counter
in a GT4: people are standing near an airport
VisualGPT (ours): a group of people standing ticket kiosk
at an airport with their luggage GT5: customers stand at a kiosk waiting
for tickets
GT1: people standing outside of a blue and
Transformer: a bus that is parked in front white bus
of a building GT2: an image of a tour bus that is picking
M2 Transformer: a couple of people walking people up
down the side of a street GT3: several people standing around buses
AoA Transformer: a bus is parked in a city and most wearing orange vests
street GT4: a public transit bus pulling up to pick
VisualGPT (ours): a while and blue bus is up passengers
parked on the side of a city street GT5: a city bus at a stop waiting to pick up
passengers
GT1: there ’s and airplane in the sky flying
Transformer: a blue and white airplane flying over some trees
through a sky GT2: a large plane is flying over a crowd
M2 Transformer: an air plane flying in the of trees
air GT3: a aeroplane soaring high in the sky
AoA Transformer: a plane airplane flying above the trees
down in the sky GT4: a passenger plane flies in the sky
VisualGPT (ours): a plane is flying in the air over a forest
over the trees GT5: an airplane is seen flying over several
trees
GT1: a cat climbing into a bathroom sink
looking at someone
Transformer: a white toilet sitting in a
GT2: a cat looks up as it stands in the
white bathroom next to a sink
2 bathroom sink
M Transformer: a cat sitting in the toilet
GT3: a large cat stands inside of a clean
AoA Transformer: a bathroom with a toilet
bathroom sink
and a sink
GT4: cat is caught stepping in to the
VisualGPT (ours): a cat sitting on top of a
bathroom sink
bathroom sink
GT5: a cute kitty cat in the sink of a
bathroom near a brush and other items
GT1: a woman and child stand next to a
table with cake on it
Transformer: a little girl is eating a
GT2: a lady standing near the table with a
birthday cake
2 baby is posing for the camera
M Transformer: a child and a child are
GT3: a woman stands beside a baby in a
sitting at a table with table with table
high chair a table is set with a birthday
AoA Transformer: two children sitting at a
cake and champagne
table with a laptop computer
GT4: a woman setting up her house for a
VisualGPT (ours): a woman and a girl sitting
party
at a table with a birthday cake
GT5: a person standing next to a child in a
booster seat

Table 8. Caption generated by our VisualGPT, Transformer, M2 Transformer and AoA Transformer on 1% MS COCO data split

You might also like