Visual GPT
Visual GPT
Visual GPT
Captioning
Masked Self
…
Attention
The limited availability of annotated data often hinders
Encoder Layer 1 Cross Attention
real-world applications of machine learning. To efficiently
Feed Forward Pretrained
learn from small quantities of multimodal data, we lever- 𝜷𝑣𝑖𝑠 𝜷𝑙𝑎𝑛 LM Weights
Self Attention
age the linguistic knowledge from a large pre-trained lan- Self-resurrecting
guage model (PLM) and quickly adapt it to new domains Encoder-decoder
Attention
of image captioning. To effectively utilize a pretrained Feed Forward
model, it is critical to balance the visual input and prior
…
linguistic knowledge from pretraining. We propose Visu-
alGPT, which employs a novel self-resurrecting encoder- Decoder Layer K
2. Related Work
Figure 2. Comparison of the part-of-speech distributions of the Image Captioning. Image captioning has been exten-
MS COCO and WikiText-2 datasets [54]. We use the spacy parser sively studied in computer vision research. Early methods
and show only the most important categories.
[19, 33, 39, 71, 85] focus on filling templates with extracted
objects, attributes, and relationships. With the advent of
A key challenge in utilizing PLMs is to bridge the do- deep learning, researchers proposed end-to-end neural net-
main gap between multi-modal data and the unimodal tex- works that encode an image into vector representations and
tual data the PLMs are pre-trained on. In Figure 2, we decode a caption word by word [28, 77]. Many improve-
compare the part-of-speech distributions of MS COCO and ments to the encoder [11, 40, 52, 81, 82, 86, 87], the decoder
WikiText-2 [54]. MS COCO employs 75% more nouns but [78, 79, 84], and the attention mechanism [8, 13, 25, 35, 38]
14% fewer verbs, which indicates a bias toward descriptions has since been proposed. Encoding the image using object
of static objects rather than actions. This suggests that, in regions has proven beneficial [2]. Reinforcement learning
order to effectively utilize PLMs in image captioning, we enables model optimization with non-differentiable evalua-
must balance prior linguistic knowledge acquired from pre- tion metrics [14, 47, 65, 70]. [9, 12] investigate fine-grained
training and visual input information. control of caption generation. [14, 70] adopt GAN-like ar-
chitectures that encourage human-like captions.
Figure 1 depicts the overall architecture of our pro-
posed model, dubbed as VisualGPT. In the commonly used A few formulations of the image captioning problem
encoder-decoder architecture for image captioning, we ini- deviate from the traditional supervised learning paradigm.
tialize the parameters of the decoder from PLMs such as Novel object captioning aims to describe objects that do
GPT-2 [62], whereas the encoder layers are randomly ini- not exist in the training data [1, 24, 43, 53, 76]. Feng et al.
tialized. In addition, we propose an attention mechanism [20] propose unsupervised captioning without using paired
with self-resurrecting activation units (SRAUs), which bal- image-caption supervision. Kim et al [30] focus on learning
ances the input from the visual encoder and the linguistic efficiency and improve the data efficiency by learning from
input from the previous decoder layer. The proposed mech- auxiliary unpaired image-caption data.
anism can produce sparse activations while not being as vul- Self-supervised NLP Models. Self-supervised training of
nerable to the zero-gradient problem as regular gates; the large neural networks on textual data proves to be an im-
self-resurrecting gates can be “turned on” again after being portant technique in the creation of high-performance NLP
zeroed out. models. Several self-supervision signals have been pro-
Empirical results demonstrate that, when trained on posed, such as autoregressive language modeling [5, 55],
0.1%, 0.5%, and 1% of the MS COCO and Conceptual which includes the GPT series of models [6, 61, 62], and
Captions data, VisualGPT outperforms several strong base- masked language modeling, which includes ELMo [59] and
line models. We achieve the state-of-the-art result on IU X- BERT-related methods [16, 34, 49].
ray [15], a medical report generation dataset. With several In this paper, we propose a quick adaptation technique
ablation experiments, we verify the effectiveness of PLMs for network weights obtained using the language model-
and the proposed self-resurrecting attention mechanism. ing (LM) objective. However, the proposed technique can
Contributions. We make the following contributions: easily be applied to other models, as the masked language
modeling objective can be converted to the LM objective
• We explore the data efficiency problem for image by masking only the last word in the textual sequence. Un-
captioning by utilizing pretrained language models like neural networks pretrained on multimodal data (e.g.,
(PLMs) as the caption decoder. With only a small [41, 51, 60, 72, 73, 88, 89]), our method only requires a small
amount of in-domain training data, the proposed tech- amount of multimodal training data and focuses on adapting
nique quickly adapts PLMs to the cross-modal task of linguistic knowledge learned from the textual modality.
Transformer AoA Transformer M2 Transformer VisualGPT (Ours)
Decoder Layer m Decoder Layer m Decoder Layer m Decoder Layer m
Add & Norm Add & Norm Add & Norm Add & Norm
Feed Forward
Add & Norm Meshed Connection Sum Meshed Connection Sum
pretrained
𝜶 LM init.
Concat
Add & Norm 𝑩" 𝑩! random init.
I I
Cross Attention Cross Attention I Cross I Cross
Attention Attention 𝜶
H gated units
H H H
B
Add & Norm Add & Norm Add & Norm Add & Norm
Masked Self Attention Masked Self Attention Masked Self Attention Masked Self Attention
Figure 3. Architectures of vanilla Transformer [74], Transformer with AoA module [25] (AoA Transformer), M2 Transformer [13], and
VisualGPT. We denote I and H as the visual and language features, respectively. Zm−1 is the output from decoder layer m − 1. Within
the circles, α, B V and B L represent different gating units.
3. Preliminaries: Transformer for Captioning Researchers have proposed other variants of the encoder-
decoder attention. In Figure 3, we contrast these decoder
The Transformer [74] has become one of the standard architectures with the proposed VisualGPT model. The
models for image captioning. At its core lies the multi-head Attention-on-Attention (AoA) module [25] provides an al-
dot-product attention mechanism. Taking three input matri- ternative method for combining the visual encoding I and
ces, query Q, key K, and value V , the attention function the linguistic information H from the decoder. For an-
can be written as other method for combining visual and linguistic informa-
tion, M2 Transformer [13] connects all decoder layers to
\text {Attn}(Q,K,V) =\text {softmax}\left ( \frac {(W^q Q) (W^k K)^{\top }}{\sqrt {D}}\right ) W^v V, \label {eq:fatt} all encoder layers. In Figure 3, it is represented by the box
(1) labeled as Meshed Connection Sum.
where W q , W k , and W v are trainable parameters and D
is a scaling factor. Intuitively, the attention operation can 4. VisualGPT
be seen as encoding W q Q as convex combination of the
Pretrained language models (PLMs) such as GPT-2 [62]
row vectors of W v V . The multi-head attention repeats the
are trained on data from a single modality. We use a PLM as
process with multiple sets of W q , W k , and W v ; the results
the caption decoder and feed visual information to the PLM
are concatenated and linearly projected back to the same
via the encoder-decoder attention, which plays a crucial role
dimensionality.
in quickly adapting the PLMs.
In visual captioning tasks, we apply a visual encoder
With the design of the encoder-decoder attention, we aim
whose output is I ∈ RO×S . O is the length of the input
to carefully balance visual information from the encoder
sequence, which in this work is a sequence of objects in the
and linguistic knowledge stored in the PLM. During the
image. S is the hidden dimension size. The decoder net-
generation of visual words, such as “person”, “truck”, or
work outputs words in the caption sequentially.
“dog”, the model should attend to visual information. In
When decoding word t+1, the encoder-decoder attention
contrast, the generation of determiners or connectives re-
takes as input the visual encoding I and the current state of
quires only linguistic knowledge. Ideally, we would like to
the decoder H ∈ Rt×S . We apply the attention operation
exploit the massive amount of linguistic knowledge stored
with H as the query and I as both the key and the value.
in the PLM weights (e.g., [46]), while referring to the vi-
The encoder-decoder attention is then
sual input only when required. To achieve this goal, we
\text {EncDecAttn}(H, I) = \text {Attn}(H, I, I). (2) introduce a pair of specialized gating units.
tween these two modalities using two complementary gates 0.8 0.8
B vis and B lan . The output of this module is 0.6 Bmlan 0.6 Bmlan
Bmvis
y
Bmvis
y
0.4 0.4
B^{\text {vis}} \otimes \text {EncDecAttn}(H, I) + B^{\text {lan}} \otimes H, (3)
0.2 0.2
where ⊗ denotes element-wise multiplication. Letting 0.0 0.0
B vis [i, j] and B lan [i, j]] denote the elements in the matrices, 10.0 7.5 5.0 2.5 0.0
x
2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0
x
2.5 5.0 7.5 10.0
\label {eq:srau-2} \begin {split} B^{\text {vis}}[i, j] & = \sigma (H[i,j]) \mathbbm {1} (\sigma (H[i,j])> \tau ) , \\ B^{\text {lan}}[i, j] & = (1- \sigma (H[i,j]) \mathbbm {1}(1- \sigma (H[i,j])> \tau ), \end {split} 0.8
(4) 0.6 Bmlan
Bmvis
y
0.4
Table 1. Performance of the compared methods training on 0.1%, 0.5% and 1% of MS COCO and Conceptual Caption image-caption
pairs. The best performance in each configuration is in bold. Ablated models are marked in gray .
ROUGE [44], and CIDEr [75]. • AoA Transformer, which inserts an attention-on-
IU X-ray [15] is a radiography dataset containing 7, 470 attention (AoA) module [25] into every transformer
chest X-ray images and 3, 955 human-written reports. As layer, as depicted by Figure 3 (b). Following [13], we
the dataset is already small, we follow the original split, slightly update the original AoA network in [25] by re-
which has a training set of 5, 226 images and 2, 770 reports. placing the LSTM with Transformers in order to create
Most reports have two images corresponding to the frontal a fair Transformer-to-Transformer comparison.
and lateral viewpoints.
• M2 Transformer [13], which proposes a meshed con-
5.2. Experimental Settings nection between encoder and decoder and is one of the
Baselines. We compare our model with several state-of-the- best-performing models on MS COCO.
art transformer-based models, including:
• X-Transformer [56], which employs bilinear pooling
• Plain Transformer [74]. to selectively capitalize on visual information and is
Models B-1 B-2 B-3 B-4 R M C Models B-1 B-4 M R C
Att2in 22.4 12.9 8.9 6.8 30.8 - 29.7 Kim et al. [31] 58.1 13.4 15.9 - 36.0
CoAtt 45.5 28.8 20.5 15.4 36.9 - 27.7 Kim et al. + unpaired 63.0 18.7 20.7 - 55.2
HRGR 43.8 29.8 20.8 15.1 32.2 - 34.3
CMAS-RL 46.4 30.1 21.0 15.4 37.1 - 27.5 Gu et al. [22] 46.2 5.4 13.2 - 17.7
Chen et al. 47.0 30.4 21.9 16.5 37.1 18.7 - Feng et al. [20] 58.9 18.6 17.9 - 54.9
VisualGPT (ours) 48.0 31.3 22.2 15.9 37.4 20.5 49.7 VisualGPT (ours) 67.1 24.3 21.9 48.6 75.8
Table 2. Performance on the IU X-ray dataset. Table 3. Comparison with unsupervised and semi-supervised
learning methods using Kim et al.’s split of MS COCO. Kim et
al. employ only 1% images for training in contrast to 1% image-
one of best-performing models on MS COCO. caption pairs from Table 1. Note that Kim et al. + unpaired also
use the rest of training data as unpaired images and texts. The
• OSCAR [41], which finetunes BERT initialization on gray shading denotes baselines that use a large amount of un-
image-language dataset. paired images and texts during training.
Since VisualGPT has GPT as the pretrained decoder, for fair
comparisons, we also create variants of Transformer, AoA
including Att2in [65], CoAtt [27], HRGR [37], CMAS-
Transformer and M2 Transformer with GPT as the decoder.
RL [26] and the model from Chen et al. [10]. This dataset
For VisualGPT, we set τ to 0.2 in all experiments. We also
only contains around 2, 770 medical reports in the training
explored the effect of different τ and find τ in the range
set, which is less than 1% COCO data and poses a data-
of [0, 0.2] to offer the right level of sparsity. For all other
efficiency challenge. We follow the same experimental set-
baselines, we tune the hyperparameters on the validation
ting as in [10]. The results show that VisualGPT outper-
set of MS COCO. We train our model and all the baselines
forms the baselines for most evaluation metrics and creates
in reinforcement learning setting following the work in [13].
a new state-of-the-art. It shows the value of leveraging GPT
Please see the supplemental material for more details on hy-
knowledge into the highly specific domain which has very
perparameters and experimental results.
“expensive” and insufficient paired data. We hope our find-
5.3. Quantitative Results ing could inspire future work in other domains.
Small In-domain Training Data. Results on MS COCO Comparison Against Semi-supervised and Unsuper-
and Conceptual Captions are presented in Tables 1. Visu- vised Methods. Kim et al. [31] proposed a semi-supervised
alGPT outperforms the best-performing baseline model by learning method to improve the data efficiency of image
4.1 CIDEr when trained on 0.1% of MS COCO data, 6.4 captioning. They used 1% of images and all their captions
CIDEr when trained on 0.5% data and 2.5 CIDEr with 1% as training data, rather than 1% of all the image-caption
training data. On Conceptual Caption dataset, VisualGPT pairs in Table 1, hence they cover less images since each
also outperforms all the baselines. It outperforms the best image is associated to more than 1 caption. For Kim et al.
baseline model by 4.2 CIDEr under 0.1% training data, 3.5 + unpaired, they also employ the other 99% of MS COCO
CIDEr under 0.5% data and 0.3 CIDEr under 1% data. as unpaired images and captions for training. We replicate
their setup by only training with 1% of images. As shown in
Comparison with BERT-based model. We compared Table 3, without using additional unpaired images and cap-
with OSCAR [41] which is a BERT-based [16] model with tions, the proposed VisualGPT method outperforms Kim et
good performing results in many benchmarks. We run their al. [31] by 20.6 CIDEr score.
model without pretraining on a large-scale image-language
We also compare VisualGPT against unsupervised meth-
corpus for the fair comparison with our model. The main
ods of Gu et al. [22] and Feng et al. [20], which use tens
difference between BERT and GPT is their different pre-
of millions of unpaired images and captions. Even though
training objectives, where BERT uses masked language
these are not fair comparisons, it is encouraging to see Vi-
modeling and GPT is the autoregressive prediction of the
sualGPT surpassing these baselines by utilizing the super-
next word. GPT has more similar learning behaviors to the
vision of only 1133 training images.
image captioning model compared to BERT since they are
both optimized by autoregressively generating the next lan- 5.4. Ablation Studies
guage word. The experimental result in Table 1 shows that
VisualGPT is better than OSCAR in both datasets, which Ablation on cross-attention: To fairly compare our SRAU
confirms our selection choice of using GPT as a decoder. with other cross-attention mechanisms in the baselines, we
also initialize their decoder with 12-layer GPT and keep
Medical Report Generation. We compared VisualGPT the same encoder as VisualGPT. We contrast between plain
against state-of-the-art medical report generation models cross-attention, meshed cross-attention, and attention-on-
Method 0.1% data 0.5% data 1% data
CIDEr performance on different thresholds for SRAU
115 Transformer 18.4% 17.2% 16.8%
0.10% 0.50% 1% 5%
103
AoA Transformer 11.5% 20.9% 25.0%
105 100.5
99.5 M2 Transformer 30.9% 22.8% 20.8%
95 VisualGPT 39.2% 39.1% 37.4%
85 82.5 81.5 80.9
Table 4. The percentage of votes received by VisualGPT and base-
CIDEr
75 70.3
line models under different quantity of training data.
69 67.9
65
Q1. Does the caption miss things shown in the image?
55 Answer Ours M2 Transformer Transformer AoA GT
44.3 42.7 45.1 No 719 624 633 621 973
45
Yes 367 438 456 447 73
35 No Rate 0.66 0.59 0.58 0.58 0.93
0 0.1 0.2
Threshold (τ) Q2. Does the caption describe things not in the image?
Answer Ours M2 Transformer Transformer AoA GT
Figure 5. CIDEr performance v.s. different thresholds τ with 0.1% No 720 692 633 655 448
0.5%, 1% and 5% training data. Yes 360 418 423 412 43
No Rate 0.67 0.62 0.60 0.61 0.96
attention (AoA) modules. For AoA Transformer, we add the Table 5. Human evaluation of object hallucination and omission.
AoA module on top of cross-attention. Table 1 shows the GT denotes the ground-truth captions.
results, which demonstrate that SRAU is better than other
cross-attention modules in exploiting the GPT knowledge GT: the lady is sitting on the wood bench
within the image-caption task.
Ours a woman sitting on a bench in a park
Ablation on SRAU: We create an ablation called Normal- attention 0.7 0.78 0.82 0.76 0.8 0.96 0.8 0.69 0.85
ized SRAU, where we replace the SRAU with the normal-
ized SRAU (see Figure 4) and use GPT2 initialization. We GT: a laptop with a keyboard and mouse are on this desk
provided the results in table 1. The normalized SRAU
Ours a laptop sitting on a desk with a mouse
results in substantially lowered performance, decreasing
attention 0.7 0.78 0.81 0.7 0.7 0.92 0.85 0.64 0.76
CIDEr from full VisualGPT by 2.7, 1.0, and 0.3 respec-
tively on the three setups on MS COCO, and it also de-
GT: a cat is sitting in front of a television
creases from Full VisualGPT by 2.2, 1.3 and 0.6 respec-
tively on Conceptual Caption. This demonstrates that the Ours a cat is sitting in front of a television
self-resurrecting property is beneficial for learning from attention 0.8 0.86 0.8 0.83 0.7 0.72 0.6 0.71 0.93
where w1:T represents the target ground truth sequence. The whole experiments highlight our model’s effectiveness
For reinforcement learning, we employ a variant of Self- on low data regimes.
Critical Sequence training [66]. Following [13], we sample
1 L On the other hand, we should also notice that M2 Trans-
L sentences, ŵ1:T , . . . , ŵ1:T , with beam search and use the
former surpasses the VisualGPT’s performance when there
mean reward from the L sentences as the baseline b. The
are 50% and 100% COCO training data. But when we train
gradient is
with the same number of Conceptual images, VisualGPT
continuously outperforms all the baselines. This leads us
\nabla _{\theta }\mathcal {L}_{RL}(\theta ) = - \frac {1}{k} \sum _{i=1}^{L} \bigg ((r(\hat {w}^i_{1:T}) - b ) \nabla _{\theta } \log p(\hat {w}^i_{1:T}) \bigg ) to think of the reason why VisualGPT show different per-
forming behaviors on these two datasets. The difference
(7) between these two datasets is that the Conceptual Captions
where r(·) represents the CIDEr-D reward. contain more diverse vocabularies and image contents. In
contrast, COCO captions only cover 80 common image ob-
A.2. Train VisualGPT with more COCO and Con-
jects. Therefore, the appearance frequency for each word
ceptual Caption Datasets
in COCO is much higher than that in Conceptual Captions
Figure 8 shows other results obtained by training net- and COCO vocabulary diversity is also much lower than
works on the 5% , 10%, 20%, 50% and 100% (82,783 im- Conceptual Caption. We hypotheses the reason for this per-
ages) MS COCO data. Figure 9 shows the performance with formance difference is that when the captions have a small
the data scaling up to 2.5% (82,958 images) Conceptual coverage of each word, the caption generation will be ben-
Captions, in which the dataset scale is similar to the whole efited a lot from the GPT inherent knowledge and GPT can
COCO data. For MS COCO, VisualGPT outperforms other help the model quickly adapt into the new domain. But
baseline models when we sample ≤ 20% training data. For when there is a lot of in-domain data, the current image-
Conceptual Caption, VisualGPT consistently outperforms captioning models can already generalize well on it and it
all the baselines when we sample ≤ 2.5% training images. potentially contradicts to the GPT original knowledge.
A.3. Attention over Different types of words
We use the Spacy parser to detect the part-of-speech of
words in captions and calculate the mean value of the vi-
sual attention score. The result is presented in Fig. 10.
We found PoS that tend to visual content, like noun (0.71),
verb (0.71) and adjective (0.72), have high visual attention
scores, whereas linguistic PoS like pronoun (0.53), punctu- GT: the large red flower is inside of a clear glass vase
ation (0.58), and determiner (0.61) receive low attention.
Ours a red vase of roses sitting on top of a glass
attention 0.8 0.93 0.94 0.64 0.87 0.84 0.67 0.55 0.57 0.43 0.86
A.4. More Qualitative Examples GT: a train sitting under a display inside a building
and Table 8. Overall, we can observe that our VisualGPT is Ours a man in a store looking at his camera
able to describe the image content more accurately than the attention 0.65 0.69 0.72 0.67 0.77 0.65 0.47 0.49 0.7
baseline models.
GT: a man sitting on a bench next to a few bags
Ours a young man holding a backpack on a bench
attention 0.7 0.82 0.74 0.7 0.54 0.84 0.59 0.55 0.83
Table 6. Caption generated by our VisualGPT, Transformer, M2 Transformer and AoA Transformer on 0.1% MS COCO data split
Image Generated Captions Ground Truth
GT1: a blue boat docked on a green lush
Transformer: several boats are sitting in
shore
the middle of a lake
GT2: a small marina with boats docked there
M2 Transformer: a boat filled with
GT3: a group of boats sitting together with
boats floating in the water
no one around
AoA Transformer: an empty boat that has
GT4: some boats parked in the water at
water and water
a dock
VisualGPT (ours): a canal filled with boats
GT5: boats sitting around the side of a
in the water
lake by a tree
Table 7. Caption generated by our VisualGPT, Transformer, M2 Transformer and AoA Transformer on 0.5% MS COCO data split
Image Generated Captions Ground Truth
GT1: several people are purchasing tickets
Transformer: a man in a suit and a woman at a bus station
standing in a shop GT2: some people are checking in at the
M2 Transformer: a man is standing in ticket counter somewhere in asia
a shop with a people holding people GT3: people waiting in line with luggage
AoA Transformer: a man is working on a bus at a ticket counter
in a GT4: people are standing near an airport
VisualGPT (ours): a group of people standing ticket kiosk
at an airport with their luggage GT5: customers stand at a kiosk waiting
for tickets
GT1: people standing outside of a blue and
Transformer: a bus that is parked in front white bus
of a building GT2: an image of a tour bus that is picking
M2 Transformer: a couple of people walking people up
down the side of a street GT3: several people standing around buses
AoA Transformer: a bus is parked in a city and most wearing orange vests
street GT4: a public transit bus pulling up to pick
VisualGPT (ours): a while and blue bus is up passengers
parked on the side of a city street GT5: a city bus at a stop waiting to pick up
passengers
GT1: there ’s and airplane in the sky flying
Transformer: a blue and white airplane flying over some trees
through a sky GT2: a large plane is flying over a crowd
M2 Transformer: an air plane flying in the of trees
air GT3: a aeroplane soaring high in the sky
AoA Transformer: a plane airplane flying above the trees
down in the sky GT4: a passenger plane flies in the sky
VisualGPT (ours): a plane is flying in the air over a forest
over the trees GT5: an airplane is seen flying over several
trees
GT1: a cat climbing into a bathroom sink
looking at someone
Transformer: a white toilet sitting in a
GT2: a cat looks up as it stands in the
white bathroom next to a sink
2 bathroom sink
M Transformer: a cat sitting in the toilet
GT3: a large cat stands inside of a clean
AoA Transformer: a bathroom with a toilet
bathroom sink
and a sink
GT4: cat is caught stepping in to the
VisualGPT (ours): a cat sitting on top of a
bathroom sink
bathroom sink
GT5: a cute kitty cat in the sink of a
bathroom near a brush and other items
GT1: a woman and child stand next to a
table with cake on it
Transformer: a little girl is eating a
GT2: a lady standing near the table with a
birthday cake
2 baby is posing for the camera
M Transformer: a child and a child are
GT3: a woman stands beside a baby in a
sitting at a table with table with table
high chair a table is set with a birthday
AoA Transformer: two children sitting at a
cake and champagne
table with a laptop computer
GT4: a woman setting up her house for a
VisualGPT (ours): a woman and a girl sitting
party
at a table with a birthday cake
GT5: a person standing next to a child in a
booster seat
Table 8. Caption generated by our VisualGPT, Transformer, M2 Transformer and AoA Transformer on 1% MS COCO data split