16249-Article Text-19743-1-2-20210518

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

Xiaowei Hu, Xi Yin, Kevin Lin, Lei Zhang,
Jianfeng Gao, Lijuan Wang, Zicheng Liu
Microsoft Corporation
{xiaowh, keli, lijuanw, leizhang, jfgao, zliu}@microsoft.com, yinxi.whu@gmail.com
Abstract VIVO Pre-training

dog
It is highly desirable yet challenging to generate image cap- + o + couch

tions that can describe novel objects which are unseen in oo
oo
caption-labeled training data, a capability that is evaluated in o ++ ++
the novel object captioning challenge (nocaps). In this chal- accordion,
instrument
lenge, no additional image-caption training data, other than visual vocabulary man, person
(image, tags)
COCO Captions, is allowed for model training. Thus, con-
Fine-tuning Inference
ventional Vision-Language Pre-training (VLP) methods can-
not be applied. This paper presents VIsual VOcabulary pre- A person holding a A person holding a
training (VIVO) that performs pre-training in the absence of dog sitting on a black umbrella and
couch. an accordion.
caption annotations. By breaking the dependency of paired
image-caption training data in VLP, VIVO can leverage large (image, sentence, tags)
amounts of paired image-tag data to learn a visual vocabulary.

This is done by pre-training a multi-layer Transformer model Figure 1: VIVO pre-training uses paired image-tag data to
that learns to align image-level tags with their corresponding learn a rich visual vocabulary where image region features
image region features. To address the unordered nature of im- and tags of the semantically similar objects are mapped into
age tags, VIVO uses a Hungarian matching loss with masked vectors that are close to each other. Fine-tuning is conducted
tag prediction to conduct pre-training. on paired image-caption data that only cover a limited num-
We validate the effectiveness of VIVO by fine-tuning the bers of objects (in blue). During inference, our model can
pre-trained model for image captioning. In addition, we per- generalize to describe novel objects (in yellow) that are
form an analysis of the visual-text alignment inferred by our learnt during VIVO pre-training.
model. The results show that our model can not only gener-
ate fluent image captions that describe novel objects, but also
identify the locations of these objects. Our single model has
achieved new state-of-the-art results on nocaps and surpassed trained on such datasets with limited visual concepts gener-
the human CIDEr score. alize poorly to in-the-wild images (Tran et al. 2016).
To improve image captioning in the wild, the nocaps
benchmark (Agrawal et al. 2019) is developed to evaluate
Introduction Novel Object Captioning (NOC)1 at scale. The training data
Image captioning is a long-standing task in artificial intelli- for nocaps is the COCO dataset consisting of image-caption
gence (Farhadi et al. 2010; Kulkarni et al. 2013; Kuznetsova pairs and the Open Images dataset (Kuznetsova et al. 2020)
et al. 2012; Mitchell et al. 2012; Yang et al. 2011; Fang containing bounding boxes and image-level tags. The test
et al. 2015). The task is challenging in that it requires vi- data consists of images selected from Open Images, con-
sual perception and recognition, and natural language gen- taining nearly 400 objects that are not or rarely seen in the
eration grounded in perception and real-world knowledge COCO dataset. This raises the challenge of how to gener-
(Kuznetsova et al. 2012; Yang et al. 2011). With recent ate captions that describe novel objects unseen in the paired
progress in computer vision (He et al. 2017; Ren et al. 2015), image-caption training data. A common strategy is to resort
natural language processing (Devlin et al. 2018; Radford to alternative data sources without caption supervision. Prior
2018; Vaswani et al. 2017), and vision-language understand- works on NOC (Lu et al. 2018; Wu et al. 2018) propose to
ing (Li et al. 2020; Sharma et al. 2018; Zhou et al. 2020a), generate template sentences that can be filled in with de-
the performance on image captioning has been substantially tected visual concepts for NOC. However, the relationship
improved on public benchmarks like COCO (Chen et al. between image and text is not fully explored in their frame-
2015) and Flickr30k (Young et al. 2014). However, models works. We will show that the performance of NOC can be
1
Copyright c 2021, Association for the Advancement of Artificial We use “NOC” to represent the task of novel object captioning
Intelligence (www.aaai.org). All rights reserved. and “nocaps” to refer to the nocaps benchmark.
1575
significantly improved by pursuing image-text aligned rep- ing methods for different applications. For example, Song
resentation learning. et al. (2019); Wang, Chen, and Hu (2019); Gao et al. (2019);
In this paper, we present VIsual VOcabulary (VIVO) pre- Huang et al. (2019); Pan et al. (2020); Guo et al. (2020);
training that leverages large amounts of vision data with- Cornia et al. (2020) explore different attention mechanisms
out caption annotations to learn a rich visual vocabulary for in captioning modeling. Other works improve the perfor-
NOC. As shown in Figure 1, we define visual vocabulary as mance with reinforcement learning (Rennie et al. 2017; Li,
a joint embedding space where image region features and Chen, and Liu 2019; Yang et al. 2020) or adversarial learn-
tags of semantically similar objects are mapped into vectors ing (Chen et al. 2019; Dognin et al. 2019). Different appli-
that are close to each other, e.g., “person” and “man”, “ac- cations such as dense captioning (Johnson, Karpathy, and
cordion” and “instrument”. Once the visual vocabulary is Fei-Fei 2016; Yin et al. 2019; Li, Jiang, and Han 2019),
pre-trained, we can fine-tune the model using image-caption grounded captioning (Ma et al. 2020; Zhou et al. 2020b), im-
pairs for caption generation. Note that the dataset used for age captioning with reading comprehension (Sidorov et al.
fine-tuning only covers a small subset of the most commonly 2020) have been studied. However, all these methods assume
occurred objects in the learnt visual vocabulary. Neverthe- that most of the visual objects in test data are seen in training
less, our model can generalize to any images that contain data. Thus, they do not work well for NOC, where the ob-
similar scenes (e.g., people sitting in couch in Figure 1) with jects presented in test images are often unseen in the caption-
novel objects unseen in the fine-tuning dataset, like “accor- annotated training data.
dion”, thanks to the pre-trained visual vocabulary. Novel Object Captioning (NOC) NOC requires a model
The VIVO pre-training method is motivated to learn the to generate image captions that describe novel objects that
cross-modality semantic alignment, similarly as in conven- are unseen in the paired image-caption training data. Since
tional Vision-Language Pre-training (VLP) methods. How- the task setting resembles that in real-world applications, it
ever, unlike existing VLP models which are pre-trained us- draws growing interest in the research community. The early
ing image-caption pairs, VIVO is pre-trained on image-tag works, such as Deep Compositional Captioner (Hendricks
pairs. To the best of our knowledge, VIVO is the first VLP et al. 2016) and Novel Object Captioner (Venugopalan et al.
method that does not rely on caption annotations. Thus, it 2017), propose to use unpaired image and sentence data to
opens the possibility of leveraging, for VLP, many existing transfer knowledge among semantically similar visual con-
vision datasets originally developed for image tagging or ob- cepts. Empirical evaluation on the COCO dataset by holding
ject detection tasks like ImageNet (Deng et al. 2009), Open out 8 novel object categories suggests that these methods
Images (Kuznetsova et al. 2020), Objects365 (Shao et al. might be applicable to NOC.
2019), etc. Moreover, we can also leverage large amounts of
images, paired with machine-generated tags as weak super- Recent studies propose to explicitly leverage the object
vision signals, for VLP. detection results for NOC. Yao et al. (2017) use LSTM-C
VIVO pre-training aims to learn a joint representation of with a copying mechanism to assemble the detected novel
visual and text input. We feed to a multi-layer Transformer objects for caption generation. Neural Baby Talk (Lu et al.
model an input consisting of image region features and a 2018) and Decoupled Novel Object Captioner (Wu et al.
paired image-tag set. We then randomly mask one or more 2018) generate template sentences that are later filled in
tags, and ask the model to predict these masked tags con- with visual concepts recognized by object detectors. Sim-
ditioned on the image region features and the other tags. ilarly, Constrained Beam Search (Anderson et al. 2017) is
Given that tags are not ordered, we employ the Hungar- exploited to generate captions that contain detected novel
ian matching loss (Stewart, Andriluka, and Ng 2016; Carion objects (Agrawal et al. 2019).
et al. 2020) for tag prediction optimization. Extensive exper- None of the aforementioned methods for NOC fully ex-
iments show that VIVO pre-training significantly improves ploits the relationship between image and text, which we ar-
the captioning performance on NOC. In addition, our model gue is crucial to the quality of generated captions. In this
can precisely align the object mentions in a generated cap- study, we pre-train a Transformer model to learn a visual
tion with the regions in the corresponding image. vocabulary where object tags are aligned with their corre-
In summary, we make the following contributions. sponding image feature representations in a semantic space.
• We propose a new VIVO pre-training method that lever- Vision and Language Pre-training Motivated by
ages large amounts of vision data without caption annota- BERT (Devlin et al. 2018), many VLP methods have been
tions for vision-language representation learning. proposed to learn vision-language representations by pre-
• We develop a Hungarian matching loss with masked tag training large-scale Transformer models (Lu et al. 2019; Tan
prediction to conduct pre-training with image-tag pairs. and Bansal 2019; Su et al. 2019; Chen et al. 2020; Zhou et al.
• With a single model, our method achieves the new state- 2020a; Li et al. 2020). Most existing VLP methods are de-
of-the-art result on the nocaps benchmark and surpasses veloped for understanding tasks such as image-text retrieval
the human CIDEr score. and visual question answering. Only a few of them (Zhou
et al. 2020a; Li et al. 2020) can be applied to image caption-
ing. But these methods use paired image-caption data for
Prior Work pre-training, and are not applicable to NOC. In this study,
Image Captioning Prior works on image captioning have we break the dependency on image-caption pairs in VLP for
focused on exploring different model structures and learn- the first time. The proposed VIVO pre-training learns vision-
1576
accordion
Open Images
Multi-layer Transformer
6.4K tags
w/o caption
animal, person, [MASK], hat
attention mask
(a) Pre-training: learn visual vocabulary
person dog
COCO
80 objects
w/ caption
[CLS] a [MASK] holding a [MASK]
person, dog, couch
“A person holding a dog sitting on a couch.” sitting on a couch. [SEP]
(b) Fine-tuning: learn sentence description attention mask
accordion
A person holding a black
umbrella and accordion.
[CLS] a person holding a person, umbrella, accordion
black umbrella and [MASK]
(c) Inference: novel object captioning
Figure 2: The proposed two-stage training scheme. (a) In VIVO pre-training, we train a Transformer-based model on image-tag
pairs for tag prediction, where it learns cross-modal representations for rich visual concepts. (b) In fine-tuning, we train the
same model on limited image-caption pairs to learn how to generate captions conditional on the image and tags. (c) During
inference, given the image and detected tags, our model is applied iteratively to generate a sequence of words describing novel
objects in an auto-regressive manner.
language alignment on image-tag pairs, improving the im- ure 2(b)), given image-caption pairs and their corresponding
age captioning results on both NOC and the general image object tags detected (e.g., “person” and “dog”), the model
captioning task. learns to map an image to a sentence conditioned on the de-
tected objects, e.g., “[A] holding [B] ...”, where [A] and [B]
Proposed Method could attend to object tags. While the sentences are learned
from image-caption pairs, the object tags may refer to novel
Recent image captioning models have achieved impressive visual objects that are unseen in image-caption pairs (but
results on the tasks where large amounts of paired image- seen in image-tag data in this example). Thus, our model
caption training data is available. But they generalize poorly achieves the compositionality generalization, allowing for
to images in the wild, where there are a wide variety of vi- zero-shot generalization to novel objects for image caption-
sual objects that are unseen in the caption corpora for training. As shown in Figure 2(c), at inference time the model is
ing. For example, the models trained on COCO Captions can able to recognize objects (e.g., “person”, “accordion”) and
faithfully describe images containing objects such as “peo- compose familiar constituents in a novel way to form a cap-
ple”, “dogs”, or “a couch”, but fail to generate a reasonable tion “a person holding an accordion”.
caption for any image containing “an accordion” since the The model architecture is shown in Figure 3. It consists
object is unseen in COCO Captions. of multiple Transformer layers to encode the input into a
To address this problem, we propose a weakly supervised feature vector and a linear layer with softmax to generate
learning approach to pre-training image captioning models the text description of the visual objects in the image. In
on image-tag pairs that, compared to image-caption pairs, what follows, we describe in detail the way the model is pre-
are of larger amounts and contain many more diverse vi- trained and fine-tuned.
sual objects. Our approach uses a two-stage training scheme
that consists of VIVO pre-training and fine-tuning. Fig- VIVO Pre-training
ure 2 illustrates our approach using an example. First, in the We pre-train the Transformer model on a large-scale dataset
pre-training stage (Figure 2(a)), an image captioning model with abundant tags, e.g., the Open Images training set with
learns to label image regions using tags (e.g., “person”, “ac- 6.4K classes of image-level tags. Unlike many existing VLP
cordion”) using image-tag pairs as training data, where the methods that rely on image-caption pairs, VIVO pre-training
object “accordion” is included. Then in fine-tuning (Fig- is conducted solely on image-tag pairs, which are much eas-
1577
ier to collect by either human labeling or auto tagging. The As shown in Figure 2 (a), we use bi-directional attention
training objective is to predict the missing (masked) tags mask in VIVO pre-training. In order to predict a missing tag,
given a bag of image-level tags and image regions. We de- the model will have to resort to image region features and
note the training set as D = {Ii , Gi }N
i=1 with N images and the other tags. So it learns a joint representation containing
their corresponding tags, where Gi = {gij }L j=1 is a set of Li
i information from both image regions and textual tags. This
image-level tags that are associated with the image Ii . These facilitates the cross-modality alignment between representa-
tags are textual labels of the visual objects presented in the tions of image regions and tags.
image, e.g., “person”, “cat”, “dinning table”, etc. In the rest
of the paper, we omit the subscript i for simplicity. Fine-tuning and Inference
We use a multi-layer Transformer model to learn a joint After pre-training, the Transformer model is fine-tuned on
representation for both vision and language domains. The a dataset where both captions and tags are available, e.g.,
input to the Transformer model consists of image region the COCO set annotated with tags from 80 object classes
features V and tag tokens T, where V = {vk }K k=1 are and captions. The tags can also be automatically generated
extracted from image I using a detector trained on Visual using a pre-trained tagging or detection model. Given im-
Genome dataset (Anderson et al. 2018), and T = {tj }Tj=1 age regions and tags, the model learns to predict the condi-
are tokenized tags in G. During training, some tokens are tional caption sentence where some positions are randomly
randomly masked out for the model to predict. masked out. More specifically, the input to the model dur-
The main difference between a caption and a set of tags ing fine-tuning is a triplet of image region features V, a set
is that words in the caption are ordered while tags are not of tags T and a caption C, where V and T are constructed
ordered. This unordered nature may result in ambiguity in in the same way as described in pre-training, and C is a se-
tag prediction when two tags are masked out simultaneously. quence of tokens. During fine-tuning, we randomly mask out
For example, if the masked tokens are “dog” and “cat”, we some of the tokens in a caption sentence for prediction, and
can predict each token in either position without restricting optimize the model parameters using the cross-entropy loss.
to the original position or order in the input. To resolve this To make the model generate captions from left to right at in-
issue, we propose to use the Hungarian matching loss (Stew- ference time, during fine-tuning we apply the uni-directional
art, Andriluka, and Ng 2016; Carion et al. 2020) to formulate attention mask on a caption sequence to prevent the posi-
the tag prediction as a set-matching problem. tions from attending to subsequent positions.
We denote the set of M masked tokens as T̃ = {tm }M m=1
During inference, we first extract image region features
where tm is the token id in the vocabulary, and the prediction and detect tags from a given image. Then the model is ap-
probabilities of the corresponding representations in the final plied to generate a sequence, one token at a time, until it
layer of Transformer as P = {pi }M i=1 where pi is the clas-
outputs the end of sentence token or reaches the maximum
sification probabilities for the i-th masked position. Since length. At each step the model is auto-regressive, consum-
the target tokens in T̃ are unordered, we need an one-to- ing the previously generated tokens as additional input when
one mapping from T̃ to P such that the prediction for each generating the next.
masked position is assigned one of the target tokens. Once In the next section, we present extensive experimental re-
such an assignment α is known, the loss is defined as: sults, showing that our model can generate captions to de-
scribe novel objects and that the alignment between image
M
X regions and tags, learned from VIVO pre-training, is crucial
L(T̃, P, α) = (− log(pi (tα(i) ))) (1) to the model’s superior performance on NOC.
i=1
where α is a permutation of the M indices, i.e., α(i) is Experiments

the index of the target token assigned to the i-th prediction. Experimental Settings
Since the assignment is unknown, we want α to be the best
possible mapping between T̃ and P. Formally, we define Datasets We use the Open Images V5 challenge training
such best possible α to be the one that minimizes the fol- set, which has 1.7M images, for VIVO pre-training. We se-
lowing total cost among all the valid2 permutations: lect 500 classes3 from bounding box annotations and 6.4K
classes from human verified image-level labels. The joint
M
X image-tag pairs, containing 6.4K unique classes in total, are
α̂ = arg min C(pi , tα(i) ), (2) used in VIVO pre-training. In the fine-tuning stage, the train-
α
i=1 ing data is the COCO training set of 118K images, each with
where C(pi , tm ) = 1 − pi (tm ) is the cost function of as- 5 captions. We evaluate our model on the validation and test
signing the target tm to the i-th prediction. The reason why sets of nocaps, which consist of 4.5K and 10.6K images
we use C(pi , tm ) instead of − log(pi (tα(i) )) as in (1) is from the Open Images validation and test sets, respectively.
that it is bounded. Now we can compute the final loss as Implementation Details We use the object detector from
L(T̃, P, α̂), where L is defined in (1) and α̂ is defined in UpDown (Anderson et al. 2018) to extract image region fea-
(2). 3
Only 500 out of 600 objects are used in the challenge set, as
2
For a tag tokenized into multiple tokens, the order of tokens we further refine the labels by removing classes that are “parts”
within the tag cannot be changed. (e.g., human eyes).
1578
hamburger
tures, which are concatenated with scaled bounding boxes
to form a 2054-dimension vector (2048D for the visual fea- Softmax
tures and 6D for the bounding box encoding including top- Linear
left and bottom-up corners as well as the box’s width and Estimate cosine
similarity between
height). We use an object detector trained on the Open Im- region feature and tag
ages dataset to detect object tags for all datasets. For pre-
Add & Norm
Transformer Encoder
training and fine-tuning, we also add the ground-truth tags
from the training sets. No ground-truth tags are used on the Feed Forward
nocaps validation and test sets. The Transformer model is ×N
initialized using BERT-base (Devlin et al. 2018) where we Add & Norm
add a linear layer to transform the image region features to Self-Attention

the vectors with same size as the word embeddings.
In VIVO pre-training, we use a maximum of 50 image
regions and 15 tag tokens per image. The model is trained fries [MASK] lettuce
for 160K iterations (about 100 epochs) with a batch size Object tags Image region features
of 1024 and a learning rate of 5 × 10−5 . In fine-tuning, we
Figure 3: Overview of our VIVO pre-trained Transformer
set the maximum caption length to 40 and the maximum tag
model. Our model consists of multiple Transformer encoder
length to 30. The model is trained for 30 epochs with a batch
layers followed by a linear layer and a softmax layer. We
size of 256 and a learning rate of 5 × 10−5 , optimized using
use masked tag prediction to conduct pre-training. To ana-
the cross-entropy loss. To further boost the performance, we
lyze the visual-text alignment, we use the outputs of the last
perform the SCST optimization (Rennie et al. 2017) with a
layer of the encoder layers to estimate the cosine similarity
learning rate of 2 × 10−6 for 5 epochs. During inference,
between the image region and tag.
we use greedy decoding to generate image captions with a
maximum length of 20.
for NOC.
Novel Object Captioning Although object tags are used in both VIVO pre-training
and fine-tuning stages, we show that the model’s capabil-
We compare our method with UpDown (Anderson et al. ity of generating captions that precisely describe novel ob-
2018; Agrawal et al. 2019) and OSCAR4 (Li et al. 2020), jects at inference time attributes largely to pre-training. We
which holds the state-of-the-art result on the nocaps bench- compare the distribution of object tags on COCO and no-
mark. The training data for the baselines is the COCO caps, which are generated by the object detector trained on
dataset. Following prior settings, we also report the re- the Open Images dataset and used for fine-tuning and infer-
sults after our model is optimized using SCST (Rennie ence, respectively. As shown in Table 3, COCO has a long-
et al. 2017) and generates captions using Constrained Beam tail distribution where 415 out of 568 categories amounts
Search (CBS) (Anderson et al. 2017). only to 2.43% of all the tags. The under-representation of
The evaluation results on nocaps validation and test sets novel objects makes the trained model statistically unlikely
are shown in Table 1. By leveraging VIVO pre-training on to generate plausible captions that describe these novel ob-
the Open Images dataset, our method has achieved signif- jects. Therefore, our VIVO pre-training, which mitigates the
icant improvement compared to all prior works. Our plain data imbalance issue by leveraging diverse tags in image-tag
version (VIVO) already outperforms UpDown+ELMo+CBS pairs, is crucial to improving model’s generalization prop-
and OSCAR by a large margin. It is worth noting that CBS erty, as empirically demonstrated on NOC.
brings absolute gains of 17.8% and 15.5% for UpDown and
OSCAR, respectively, but it only improves VIVO by 3.8%. Visual-Text Alignment
This suggests that our model is more capable of generat-
ing captions with novel objects without explicitly adding any To further understand the effects of VIVO pre-training in
constrains. Our best results are new state-of-the-art and sur- learning visual vocabulary, which aligns image regions with
passes the human CIDEr score on the overall dataset. object tags, we show how the novel object tags can be
To quantitatively evaluate how well the model can de- grounded in image regions in Figure 4. Given the images
scribe novel objects, we also calculate the F1-score follow- from the Open Images validation set, we extract image re-
ing Hendricks et al. (2016), where all the objects mentioned gion features using the same object detector from UpDown
in the generated caption sentences are compared against the and generate captions from the captioning model with VIVO
ground-truth object tags. Table 2 shows the comparison with pre-training. After identifying the novel objects in the gener-
OSCAR on the nocaps validation set. We see that VIVO ated captions, as shown in Figure 3, we feed the novel object
improves OSCAR in F1-scores substantially especially for tags, together with the extracted image region features, to
out-of-domain objects. This again verifies the effectiveness the VIVO pre-trained Transformer model. The output of the
of VIVO pre-training in learning to recognize novel objects last encoder layer is used as the contextualized representa-
tion of the corresponding input. We then calculate the cosine
4
We compare with OSCAR base whose model size is the same similarity between representations of each pair of image re-
as ours. In fact, our model with 12 layers and hidden size of 768 gion and object tag. We highlight the pairs with high scores
even outperforms the OSCAR large model. in Figure 4. The result shows that our model can precisely
1579
in-domain near-domain out-of-domain overall
method
CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE
Validation Set
UpDown (Agrawal et al. 2019) 78.1 11.6 57.7 10.3 31.3 8.3 55.3 10.1
UpDown + CBS 80.0 12.0 73.6 11.3 66.4 9.7 73.1 11.1
UpDown + ELMo + CBS 79.3 12.4 73.8 11.4 71.7 9.9 74.3 11.2
O SCAR (Li et al. 2020) 79.6 12.3 66.1 11.5 45.3 9.7 63.8 11.2
O SCAR + CBS 80.0 12.1 80.4 12.2 75.3 10.6 79.3 11.9
O SCAR + SCST + CBS 83.4 12.0 81.6 12.0 77.6 10.6 81.1 11.7
VIVO 88.8 12.9 83.2 12.6 71.1 10.6 81.5 12.2
VIVO + CBS 90.4 13.0 84.9 12.5 83.0 10.7 85.3 12.2
VIVO + SCST + CBS 92.2 12.9 87.8 12.6 87.5 11.5 88.3 12.4
Human 84.4 14.3 85.0 14.3 95.7 14.0 87.1 14.2
Test Set
VIVO + SCST + CBS 89.0 12.9 87.8 12.6 80.1 11.1 86.6 12.4
Human 80.6 15.0 84.6 14.7 91.6 14.2 85.3 14.6
Table 1: Evaluation on nocaps validation and test sets.
model in-domain out-of-domain entire pre-training BLEU4 Meteor CIDEr SPICE

OSCAR (Li et al. 2020) 39.5 15.7 20.7 NO 33.7 27.9 114.7 21.2
VIVO 46.3 30.6 33.8 OSCAR 34.8 28.4 118.2 21.6
OSCAR + VIVO 34.9 28.4 119.8 21.7
Table 2: Comparison of F1-scores (in %) on object classes of
Open Images, evaluated on the nocaps validation set. There Table 4: Evaluation on COCO Karpathy test split (Karpathy
are 504 classes in total. 105 of them are in-domain, which and Fei-Fei 2015). All results are based on single model with
are 80 common classes from COCO and 25 objects fre- cross-entropy optimization.
quently appearing in COCO Captions. The remaining 399
classes are the out-of-domain objects.
improves the model performance across all metrics evalu-
ated on the COCO test set, especially in CIDEr score. We
#occur in COCO (<=) 0 10 100 1K 10K
do observe, however, that the gain on the COCO benchmark
#categories 194 274 415 522 563 is not as substantial as that on the nocaps benchmark. We
percentage in COCO 0.0 0.14 2.43 15.62 64.01 conjecture that this is due to the COCO dataset containing
percentage in nocaps 0.24 5.05 15.98 35.71 69.91 only a small number of visual concepts and thus diminish-
Table 3: Distribution of 568 object categories on COCO ing the benefit of learning a large visual vocabulary. It is also
training images and nocaps validation images. Each column worth noting that using machine-generated image tags rather
is a subset of object categories whose number of occurrences than human-written captions makes it possible to utilize po-
are below the threshold. The percentage is calculated by di- tentially unlimited amounts of images, which we will pursue
viding the counts of those objects by the total counts of all in our future work.
objects in the dataset.
Tag size BLEU4 Meteor CIDEr SPICE
0 (w/o VIVO) 18.3 24.2 69.6 11.3
align the mentions of these novel objects in captions with 500 classes 20.6 25.4 76.5 11.9
the corresponding image regions. 6.4K classes 21.2 25.4 77.8 12.0
Table 5: Adding VIVO pre-training makes substantial im-
General Image Captioning provement on NOC. Using more labels in pre-training also
VIVO pre-training does not require the paired image-caption gives better results. All the models are fine-tuned on COCO
data for model training as in conventional VLP methods. It and evaluated on the validation set of nocaps.
opens up an opportunity to leverage additional data sources
to improve image captioning models. To demonstrate the ef-
fectiveness of VIVO pre-training on general image caption- Loss BLEU4 Meteor CIDEr SPICE
ing tasks, we trained two versions of OSCAR, following the Mask only one token 20.6 25.2 74.9 11.8
setting in Li et al. (2020). The first OSCAR model is trained w/o Hungarian matching 21.0 25.4 75.8 11.8
solely on Conceptual Captions (CC) (Sharma et al. 2018), w/ Hungarian matching 21.2 25.4 77.8 12.0
as described in Li et al. (2020). The second OSCAR model
is pre-trained using VIVO on Open Images (OI), and then Table 6: Ablation study of the proposed Hungarian matching
fine-tuned on CC. As shown in Table 4, VIVO pre-training loss.
1580
B: a large piece of art is displayed on the beach B: a group of four colored light up in the night sky B: a close up of a fruit with leaves
V: a turtle that is laying down on the beach V: a bunch of red lantern lights on a street V: a close up of a peach on a tree branch
B: a spider sitting on top of a plate

B: a street light with a yellow light B: a small orange vase with a handle B: a hamburger and fries on a plate
on a dirt ground
in the background on a table V: a hamburger with lettuce and
V: a spider sitting on the ground
V: a lamp that is on top of a pole V: a cello is on display in a glass case tomato on a plate with french fries
next to a coin
Figure 4: Image captioning results on nocaps. B: our baseline without adding VIVO pre-training. V: our approach with VIVO
pre-training. Red text represents novel objects. For each image, we show the similarity scores of each image region to the novel
objects appear in the captions. The bounding box color is brighter when the similarity is higher.
Ablation Study performance.

We select a subset of 10% images from the Open Images
training set to conduct an ablation study. We fine-tune with Conclusions
cross-entropy loss on the COCO dataset and report the per-
We have presented a weakly supervised learning approach
formance on the nocaps validation set.
to training image captioning models in two steps. First, a
Using a Larger Set of Tags We investigate whether using a Transformer-based model is pre-trained on large amounts
larger set of tags in pre-training improves performance of the of image-tag pairs to learn a visual vocabulary without the
downstream image captioning task. We select 500 classes of need of using image-caption pairs which are harder to ob-
objects, which are used to train the object detector, from the tain. Then, the model is fine-tuned on image-caption pairs to
overall 6.4K classes of tags to conduct VIVO pre-training. learn to incorporate information from the pre-trained visual
As shown in Table 5, VIVO pre-training with 500 classes vocabulary and compose image captions that can describe
significantly improves the performance on nocaps by 6.9% novel visual objects unseen in the training data of image-
compared to no pre-training. Expanding the labels to 6.4K caption pairs.
classes can further improve the performance, although the Our experiments on the nocaps benchmark dataset
gain is limited due to the increased diversity of objects pre- demonstrate that our model achieves compositional gener-
sented in test images. alization, allowing for zero-shot generalization to novel ob-
Using Hungarian Matching Loss We evaluate the effec- jects for image captioning. As a result, our best single model
tiveness of the proposed Hungarian matching in VIVO pre- creates new state-of-the-art that surpasses the human CIDEr
training to predict a set of tags. Training without Hungarian score on nocaps. A detailed analysis reveals that the general-
matching reduces the tag prediction to the standard masked ization is attributed to a large degree to the visual vocabulary
language modeling task, which predicts the masked tokens learned in model pre-training, which maps visual objects or
in the same order as that in the input sequence. In addition, regions with similar semantic meanings to feature vectors
we also perform VIVO pre-training by masking only one to- that are close to each other in a discrete semantic space.
ken in input, which makes word order information not use- Since our pre-training does not need paired image-caption
ful. The evaluation results on the nocaps validation set are in data, one of our future works is to leverage large amounts of
Table 6. We can see that masking only one token is not effec- vision data, beyond image-tag pairs used in this paper, to
tive, and using Hungarian matching leads to the best model significantly improve the quality of the visual vocabulary.
1581
Acknowledgements Guo, L.; Liu, J.; Zhu, X.; Yao, P.; Lu, S.; and Lu, H. 2020.
We thank Jianfeng Wang, Ehsan Azarnasab, Lin Liang, Normalized and Geometry-Aware Self-Attention Network
Pengchuan Zhang, Xiujun Li, Chunyuan Li, Jianwei Yang, for Image Captioning. In CVPR.
Yu Wang, Houdong Hu, Furu Wei, Dong Li for valuable dis- He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask
cussions and comments. R-CNN. In ICCV.
Hendricks, L. A.; Venugopalan, S.; Rohrbach, M.; Mooney,
References R.; Saenko, K.; and Darrell, T. 2016. Deep composi-
Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; John- tional captioning: Describing novel object categories with-
son, M.; Batra, D.; Parikh, D.; Lee, S.; and Anderson, P. out paired training data. In CVPR.
2019. nocaps: novel object captioning at scale. In ICCV. Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Atten-
Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. tion on attention for image captioning. In ICCV.
2017. Guided open vocabulary image captioning with con- Johnson, J.; Karpathy, A.; and Fei-Fei, L. 2016. Densecap:
strained beam search. In EMNLP. Fully convolutional localization networks for dense caption-
ing. In CVPR.
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.;
Gould, S.; and Zhang, L. 2018. Bottom-up and top-down at- Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic
tention for image captioning and visual question answering. alignments for generating image descriptions. In CVPR.
In CVPR. Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.;
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, Choi, Y.; Berg, A. C.; and Berg, T. L. 2013. Babytalk: Un-
A.; and Zagoruyko, S. 2020. End-to-End Object Detection derstanding and generating simple image descriptions. IEEE
with Transformers. In ECCV. Transactions on Pattern Analysis and Machine Intelligence
35(12): 2891–2903.
Chen, C.; Mu, S.; Xiao, W.; Ye, Z.; Wu, L.; and Ju, Q. 2019.
Improving image captioning with conditional generative ad- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.;
versarial nets. In AAAI. Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Duerig,
T.; et al. 2020. The open images dataset v4: Unified im-
Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; age classification, object detection, and visual relationship
Dollár, P.; and Zitnick, C. L. 2015. Microsoft coco cap- detection at scale. IJCV .
tions: Data collection and evaluation server. arXiv preprint
arXiv:1504.00325 . Kuznetsova, P.; Ordonez, V.; Berg, A.; Berg, T.; and Choi, Y.
2012. Collective generation of natural image descriptions. In
Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A. E.; Ahmed, F.; Gan, Z.; ACL.
Cheng, Y.; and Liu, J. 2020. UNITER: Learning universal
image-text representations. In ECCV. Li, N.; Chen, Z.; and Liu, S. 2019. Meta learning for image
captioning. In AAAI.
Cornia, M.; Stefanini, M.; Baraldi, L.; and Cucchiara, R.
Li, X.; Jiang, S.; and Han, J. 2019. Learning Object Context
2020. Meshed-Memory Transformer for Image Captioning.
for Dense Captioning. In AAAI.
In CVPR.
Li, X.; Yin, X.; Li, C.; Hu, X.; Zhang, P.; Zhang, L.;
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- Wang, L.; Hu, H.; Dong, L.; Wei, F.; Choi, Y.; and Gao,
Fei, L. 2009. Imagenet: A large-scale hierarchical image J. 2020. Oscar: Object-semantics aligned pre-training for
database. In CVPR. vision-language tasks. In ECCV.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert:
Bert: Pre-training of deep bidirectional transformers for lan- Pretraining task-agnostic visiolinguistic representations for
guage understanding. In NAACL. vision-and-language tasks. In NeurIPS.
Dognin, P.; Melnyk, I.; Mroueh, Y.; Ross, J.; and Sercu, T. Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2018. Neural baby
2019. Adversarial semantic alignment for improved image talk. In CVPR.
captions. In CVPR.
Ma, C.-Y.; Kalantidis, Y.; AlRegib, G.; Vajda, P.; Rohrbach,
Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R. K.; Deng, L.; M.; and Kira, Z. 2020. Learning to Generate Grounded Vi-
Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; et al. sual Captions without Localization Supervision. In ECCV.
2015. From captions to visual concepts and back. In CVPR.
Mitchell, M.; Dodge, J.; Goyal, A.; Yamaguchi, K.; Stratos,
Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; K.; Han, X.; Mensch, A.; Berg, A.; Berg, T.; and Daumé III,
Rashtchian, C.; Hockenmaier, J.; and Forsyth, D. 2010. Ev- H. 2012. Midge: Generating image descriptions from com-
ery picture tells a story: Generating sentences from images. puter vision detections. In Proceedings of the 13th Confer-
In ECCV. ence of the European Chapter of the Association for Com-
Gao, L.; Fan, K.; Song, J.; Liu, X.; Xu, X.; and Shen, H. T. putational Linguistics.
2019. Deliberate attention networks for image captioning. Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-Linear Attention
In AAAI. Networks for Image Captioning. In CVPR.
1582
Radford, A. 2018. Improving Language Understanding by Yin, G.; Sheng, L.; Liu, B.; Yu, N.; Wang, X.; and Shao, J.
Generative Pre-Training. 2019. Context and attribute grounded dense captioning. In
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R- CVPR.
CNN: Towards real-time object detection with region pro- Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014.
posal networks. In NeurIPS. From image descriptions to visual denotations: New simi-
larity metrics for semantic inference over event descriptions.
Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel,
Transactions of the Association for Computational Linguis-
V. 2017. Self-critical sequence training for image caption-
tics 2: 67–78.
ing. In CVPR.
Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J. J.; and
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, Gao, J. 2020a. Unified Vision-Language Pre-Training for
J.; and Sun, J. 2019. Objects365: A large-scale, high-quality Image Captioning and VQA. In AAAI.
dataset for object detection. In ICCV.
Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; and Zhang, H. 2020b.
Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. More Grounded Image Captioning by Distilling Image-Text
Conceptual captions: A cleaned, hypernymed, image alt-text Matching Model. In CVPR.
dataset for automatic image captioning. In ACL.
Sidorov, O.; Hu, R.; Rohrbach, M.; and Singh, A. 2020.
TextCaps: a Dataset for Image Captioning with Reading
Comprehension. In ECCV.
Song, L.; Liu, J.; Qian, B.; and Chen, Y. 2019. Connect-
ing Language to Images: A Progressive Attention-Guided
Network for Simultaneous Image Captioning and Language
Grounding. In AAAI.
Stewart, R.; Andriluka, M.; and Ng, A. Y. 2016. End-to-end
people detection in crowded scenes. In CVPR.
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; and Dai, J.
2019. VL-BERT: Pre-training of Generic Visual-Linguistic
Representations. In ICLR.
Tan, H.; and Bansal, M. 2019. LXMERT: Learning Cross-
Modality Encoder Representations from Transformers. In
EMNLP.
Tran, K.; He, X.; Zhang, L.; Sun, J.; Carapcea, C.; Thrasher,
C.; Buehler, C.; and Sienkiewicz, C. 2016. Rich image cap-
tioning in the wild. In CVPR.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. In NeurIPS.
Venugopalan, S.; Anne Hendricks, L.; Rohrbach, M.;
Mooney, R.; Darrell, T.; and Saenko, K. 2017. Captioning
images with diverse objects. In CVPR.
Wang, W.; Chen, Z.; and Hu, H. 2019. Hierarchical attention
network for image captioning. In AAAI.
Wu, Y.; Zhu, L.; Jiang, L.; and Yang, Y. 2018. Decoupled
novel object captioner. In ACM Multimedia.
Yang, X.; Zhang, H.; Jin, D.; Liu, Y.; Wu, C.-H.; Tan, J.;
Xie, D.; Wang, J.; and Wang, X. 2020. Fashion Captioning:
Towards Generating Accurate Descriptions with Semantic
Rewards. In ECCV.
Yang, Y.; Teo, C.; Daumé III, H.; and Aloimonos, Y. 2011.
Corpus-guided sentence generation of natural images. In
EMNLP.
Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2017. Incorporating
copying mechanism in image captioning for learning novel
objects. In CVPR.
1583

16249-Article Text-19743-1-2-20210518

Uploaded by

Copyright:

Available Formats

16249-Article Text-19743-1-2-20210518

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

16249-Article Text-19743-1-2-20210518

Uploaded by

Copyright:

Available Formats

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

Abstract VIVO Pre-training

It is highly desirable yet challenging to generate image cap- + o + couch

amounts of paired image-tag data to learn a visual vocabulary.

animal, person, [MASK], hat

(b) Fine-tuning: learn sentence description attention mask

where α is a permutation of the M indices, i.e., α(i) is Experiments

add a linear layer to transform the image region features to Self-Attention

model in-domain out-of-domain entire pre-training BLEU4 Meteor CIDEr SPICE

B: a spider sitting on top of a plate

Ablation Study performance.

You might also like