16249-Article Text-19743-1-2-20210518
16249-Article Text-19743-1-2-20210518
16249-Article Text-19743-1-2-20210518
1575
significantly improved by pursuing image-text aligned rep- ing methods for different applications. For example, Song
resentation learning. et al. (2019); Wang, Chen, and Hu (2019); Gao et al. (2019);
In this paper, we present VIsual VOcabulary (VIVO) pre- Huang et al. (2019); Pan et al. (2020); Guo et al. (2020);
training that leverages large amounts of vision data with- Cornia et al. (2020) explore different attention mechanisms
out caption annotations to learn a rich visual vocabulary for in captioning modeling. Other works improve the perfor-
NOC. As shown in Figure 1, we define visual vocabulary as mance with reinforcement learning (Rennie et al. 2017; Li,
a joint embedding space where image region features and Chen, and Liu 2019; Yang et al. 2020) or adversarial learn-
tags of semantically similar objects are mapped into vectors ing (Chen et al. 2019; Dognin et al. 2019). Different appli-
that are close to each other, e.g., “person” and “man”, “ac- cations such as dense captioning (Johnson, Karpathy, and
cordion” and “instrument”. Once the visual vocabulary is Fei-Fei 2016; Yin et al. 2019; Li, Jiang, and Han 2019),
pre-trained, we can fine-tune the model using image-caption grounded captioning (Ma et al. 2020; Zhou et al. 2020b), im-
pairs for caption generation. Note that the dataset used for age captioning with reading comprehension (Sidorov et al.
fine-tuning only covers a small subset of the most commonly 2020) have been studied. However, all these methods assume
occurred objects in the learnt visual vocabulary. Neverthe- that most of the visual objects in test data are seen in training
less, our model can generalize to any images that contain data. Thus, they do not work well for NOC, where the ob-
similar scenes (e.g., people sitting in couch in Figure 1) with jects presented in test images are often unseen in the caption-
novel objects unseen in the fine-tuning dataset, like “accor- annotated training data.
dion”, thanks to the pre-trained visual vocabulary. Novel Object Captioning (NOC) NOC requires a model
The VIVO pre-training method is motivated to learn the to generate image captions that describe novel objects that
cross-modality semantic alignment, similarly as in conven- are unseen in the paired image-caption training data. Since
tional Vision-Language Pre-training (VLP) methods. How- the task setting resembles that in real-world applications, it
ever, unlike existing VLP models which are pre-trained us- draws growing interest in the research community. The early
ing image-caption pairs, VIVO is pre-trained on image-tag works, such as Deep Compositional Captioner (Hendricks
pairs. To the best of our knowledge, VIVO is the first VLP et al. 2016) and Novel Object Captioner (Venugopalan et al.
method that does not rely on caption annotations. Thus, it 2017), propose to use unpaired image and sentence data to
opens the possibility of leveraging, for VLP, many existing transfer knowledge among semantically similar visual con-
vision datasets originally developed for image tagging or ob- cepts. Empirical evaluation on the COCO dataset by holding
ject detection tasks like ImageNet (Deng et al. 2009), Open out 8 novel object categories suggests that these methods
Images (Kuznetsova et al. 2020), Objects365 (Shao et al. might be applicable to NOC.
2019), etc. Moreover, we can also leverage large amounts of
images, paired with machine-generated tags as weak super- Recent studies propose to explicitly leverage the object
vision signals, for VLP. detection results for NOC. Yao et al. (2017) use LSTM-C
VIVO pre-training aims to learn a joint representation of with a copying mechanism to assemble the detected novel
visual and text input. We feed to a multi-layer Transformer objects for caption generation. Neural Baby Talk (Lu et al.
model an input consisting of image region features and a 2018) and Decoupled Novel Object Captioner (Wu et al.
paired image-tag set. We then randomly mask one or more 2018) generate template sentences that are later filled in
tags, and ask the model to predict these masked tags con- with visual concepts recognized by object detectors. Sim-
ditioned on the image region features and the other tags. ilarly, Constrained Beam Search (Anderson et al. 2017) is
Given that tags are not ordered, we employ the Hungar- exploited to generate captions that contain detected novel
ian matching loss (Stewart, Andriluka, and Ng 2016; Carion objects (Agrawal et al. 2019).
et al. 2020) for tag prediction optimization. Extensive exper- None of the aforementioned methods for NOC fully ex-
iments show that VIVO pre-training significantly improves ploits the relationship between image and text, which we ar-
the captioning performance on NOC. In addition, our model gue is crucial to the quality of generated captions. In this
can precisely align the object mentions in a generated cap- study, we pre-train a Transformer model to learn a visual
tion with the regions in the corresponding image. vocabulary where object tags are aligned with their corre-
In summary, we make the following contributions. sponding image feature representations in a semantic space.
• We propose a new VIVO pre-training method that lever- Vision and Language Pre-training Motivated by
ages large amounts of vision data without caption annota- BERT (Devlin et al. 2018), many VLP methods have been
tions for vision-language representation learning. proposed to learn vision-language representations by pre-
• We develop a Hungarian matching loss with masked tag training large-scale Transformer models (Lu et al. 2019; Tan
prediction to conduct pre-training with image-tag pairs. and Bansal 2019; Su et al. 2019; Chen et al. 2020; Zhou et al.
• With a single model, our method achieves the new state- 2020a; Li et al. 2020). Most existing VLP methods are de-
of-the-art result on the nocaps benchmark and surpasses veloped for understanding tasks such as image-text retrieval
the human CIDEr score. and visual question answering. Only a few of them (Zhou
et al. 2020a; Li et al. 2020) can be applied to image caption-
ing. But these methods use paired image-caption data for
Prior Work pre-training, and are not applicable to NOC. In this study,
Image Captioning Prior works on image captioning have we break the dependency on image-caption pairs in VLP for
focused on exploring different model structures and learn- the first time. The proposed VIVO pre-training learns vision-
1576
accordion
Open Images
Multi-layer Transformer
6.4K tags
w/o caption
attention mask
(a) Pre-training: learn visual vocabulary
person dog
COCO
Multi-layer Transformer
80 objects
w/ caption
[CLS] a [MASK] holding a [MASK]
person, dog, couch
“A person holding a dog sitting on a couch.” sitting on a couch. [SEP]
accordion
Multi-layer Transformer
A person holding a black
umbrella and accordion.
[CLS] a person holding a person, umbrella, accordion
black umbrella and [MASK]
(c) Inference: novel object captioning
Figure 2: The proposed two-stage training scheme. (a) In VIVO pre-training, we train a Transformer-based model on image-tag
pairs for tag prediction, where it learns cross-modal representations for rich visual concepts. (b) In fine-tuning, we train the
same model on limited image-caption pairs to learn how to generate captions conditional on the image and tags. (c) During
inference, given the image and detected tags, our model is applied iteratively to generate a sequence of words describing novel
objects in an auto-regressive manner.
language alignment on image-tag pairs, improving the im- ure 2(b)), given image-caption pairs and their corresponding
age captioning results on both NOC and the general image object tags detected (e.g., “person” and “dog”), the model
captioning task. learns to map an image to a sentence conditioned on the de-
tected objects, e.g., “[A] holding [B] ...”, where [A] and [B]
Proposed Method could attend to object tags. While the sentences are learned
from image-caption pairs, the object tags may refer to novel
Recent image captioning models have achieved impressive visual objects that are unseen in image-caption pairs (but
results on the tasks where large amounts of paired image- seen in image-tag data in this example). Thus, our model
caption training data is available. But they generalize poorly achieves the compositionality generalization, allowing for
to images in the wild, where there are a wide variety of vi- zero-shot generalization to novel objects for image caption-
sual objects that are unseen in the caption corpora for train- ing. As shown in Figure 2(c), at inference time the model is
ing. For example, the models trained on COCO Captions can able to recognize objects (e.g., “person”, “accordion”) and
faithfully describe images containing objects such as “peo- compose familiar constituents in a novel way to form a cap-
ple”, “dogs”, or “a couch”, but fail to generate a reasonable tion “a person holding an accordion”.
caption for any image containing “an accordion” since the The model architecture is shown in Figure 3. It consists
object is unseen in COCO Captions. of multiple Transformer layers to encode the input into a
To address this problem, we propose a weakly supervised feature vector and a linear layer with softmax to generate
learning approach to pre-training image captioning models the text description of the visual objects in the image. In
on image-tag pairs that, compared to image-caption pairs, what follows, we describe in detail the way the model is pre-
are of larger amounts and contain many more diverse vi- trained and fine-tuned.
sual objects. Our approach uses a two-stage training scheme
that consists of VIVO pre-training and fine-tuning. Fig- VIVO Pre-training
ure 2 illustrates our approach using an example. First, in the We pre-train the Transformer model on a large-scale dataset
pre-training stage (Figure 2(a)), an image captioning model with abundant tags, e.g., the Open Images training set with
learns to label image regions using tags (e.g., “person”, “ac- 6.4K classes of image-level tags. Unlike many existing VLP
cordion”) using image-tag pairs as training data, where the methods that rely on image-caption pairs, VIVO pre-training
object “accordion” is included. Then in fine-tuning (Fig- is conducted solely on image-tag pairs, which are much eas-
1577
ier to collect by either human labeling or auto tagging. The As shown in Figure 2 (a), we use bi-directional attention
training objective is to predict the missing (masked) tags mask in VIVO pre-training. In order to predict a missing tag,
given a bag of image-level tags and image regions. We de- the model will have to resort to image region features and
note the training set as D = {Ii , Gi }N
i=1 with N images and the other tags. So it learns a joint representation containing
their corresponding tags, where Gi = {gij }L j=1 is a set of Li
i information from both image regions and textual tags. This
image-level tags that are associated with the image Ii . These facilitates the cross-modality alignment between representa-
tags are textual labels of the visual objects presented in the tions of image regions and tags.
image, e.g., “person”, “cat”, “dinning table”, etc. In the rest
of the paper, we omit the subscript i for simplicity. Fine-tuning and Inference
We use a multi-layer Transformer model to learn a joint After pre-training, the Transformer model is fine-tuned on
representation for both vision and language domains. The a dataset where both captions and tags are available, e.g.,
input to the Transformer model consists of image region the COCO set annotated with tags from 80 object classes
features V and tag tokens T, where V = {vk }K k=1 are and captions. The tags can also be automatically generated
extracted from image I using a detector trained on Visual using a pre-trained tagging or detection model. Given im-
Genome dataset (Anderson et al. 2018), and T = {tj }Tj=1 age regions and tags, the model learns to predict the condi-
are tokenized tags in G. During training, some tokens are tional caption sentence where some positions are randomly
randomly masked out for the model to predict. masked out. More specifically, the input to the model dur-
The main difference between a caption and a set of tags ing fine-tuning is a triplet of image region features V, a set
is that words in the caption are ordered while tags are not of tags T and a caption C, where V and T are constructed
ordered. This unordered nature may result in ambiguity in in the same way as described in pre-training, and C is a se-
tag prediction when two tags are masked out simultaneously. quence of tokens. During fine-tuning, we randomly mask out
For example, if the masked tokens are “dog” and “cat”, we some of the tokens in a caption sentence for prediction, and
can predict each token in either position without restricting optimize the model parameters using the cross-entropy loss.
to the original position or order in the input. To resolve this To make the model generate captions from left to right at in-
issue, we propose to use the Hungarian matching loss (Stew- ference time, during fine-tuning we apply the uni-directional
art, Andriluka, and Ng 2016; Carion et al. 2020) to formulate attention mask on a caption sequence to prevent the posi-
the tag prediction as a set-matching problem. tions from attending to subsequent positions.
We denote the set of M masked tokens as T̃ = {tm }M m=1
During inference, we first extract image region features
where tm is the token id in the vocabulary, and the prediction and detect tags from a given image. Then the model is ap-
probabilities of the corresponding representations in the final plied to generate a sequence, one token at a time, until it
layer of Transformer as P = {pi }M i=1 where pi is the clas-
outputs the end of sentence token or reaches the maximum
sification probabilities for the i-th masked position. Since length. At each step the model is auto-regressive, consum-
the target tokens in T̃ are unordered, we need an one-to- ing the previously generated tokens as additional input when
one mapping from T̃ to P such that the prediction for each generating the next.
masked position is assigned one of the target tokens. Once In the next section, we present extensive experimental re-
such an assignment α is known, the loss is defined as: sults, showing that our model can generate captions to de-
scribe novel objects and that the alignment between image
M
X regions and tags, learned from VIVO pre-training, is crucial
L(T̃, P, α) = (− log(pi (tα(i) ))) (1) to the model’s superior performance on NOC.
i=1
1578
hamburger
tures, which are concatenated with scaled bounding boxes
to form a 2054-dimension vector (2048D for the visual fea- Softmax
tures and 6D for the bounding box encoding including top- Linear
left and bottom-up corners as well as the box’s width and Estimate cosine
similarity between
height). We use an object detector trained on the Open Im- region feature and tag
ages dataset to detect object tags for all datasets. For pre-
Add & Norm
Transformer Encoder
training and fine-tuning, we also add the ground-truth tags
from the training sets. No ground-truth tags are used on the Feed Forward
nocaps validation and test sets. The Transformer model is ×N
initialized using BERT-base (Devlin et al. 2018) where we Add & Norm
1579
in-domain near-domain out-of-domain overall
method
CIDEr SPICE CIDEr SPICE CIDEr SPICE CIDEr SPICE
Validation Set
UpDown (Agrawal et al. 2019) 78.1 11.6 57.7 10.3 31.3 8.3 55.3 10.1
UpDown + CBS 80.0 12.0 73.6 11.3 66.4 9.7 73.1 11.1
UpDown + ELMo + CBS 79.3 12.4 73.8 11.4 71.7 9.9 74.3 11.2
O SCAR (Li et al. 2020) 79.6 12.3 66.1 11.5 45.3 9.7 63.8 11.2
O SCAR + CBS 80.0 12.1 80.4 12.2 75.3 10.6 79.3 11.9
O SCAR + SCST + CBS 83.4 12.0 81.6 12.0 77.6 10.6 81.1 11.7
VIVO 88.8 12.9 83.2 12.6 71.1 10.6 81.5 12.2
VIVO + CBS 90.4 13.0 84.9 12.5 83.0 10.7 85.3 12.2
VIVO + SCST + CBS 92.2 12.9 87.8 12.6 87.5 11.5 88.3 12.4
Human 84.4 14.3 85.0 14.3 95.7 14.0 87.1 14.2
Test Set
VIVO + SCST + CBS 89.0 12.9 87.8 12.6 80.1 11.1 86.6 12.4
Human 80.6 15.0 84.6 14.7 91.6 14.2 85.3 14.6
Table 1: Evaluation on nocaps validation and test sets.
1580
B: a large piece of art is displayed on the beach B: a group of four colored light up in the night sky B: a close up of a fruit with leaves
V: a turtle that is laying down on the beach V: a bunch of red lantern lights on a street V: a close up of a peach on a tree branch
1581
Acknowledgements Guo, L.; Liu, J.; Zhu, X.; Yao, P.; Lu, S.; and Lu, H. 2020.
We thank Jianfeng Wang, Ehsan Azarnasab, Lin Liang, Normalized and Geometry-Aware Self-Attention Network
Pengchuan Zhang, Xiujun Li, Chunyuan Li, Jianwei Yang, for Image Captioning. In CVPR.
Yu Wang, Houdong Hu, Furu Wei, Dong Li for valuable dis- He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask
cussions and comments. R-CNN. In ICCV.
Hendricks, L. A.; Venugopalan, S.; Rohrbach, M.; Mooney,
References R.; Saenko, K.; and Darrell, T. 2016. Deep composi-
Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; John- tional captioning: Describing novel object categories with-
son, M.; Batra, D.; Parikh, D.; Lee, S.; and Anderson, P. out paired training data. In CVPR.
2019. nocaps: novel object captioning at scale. In ICCV. Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Atten-
Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. tion on attention for image captioning. In ICCV.
2017. Guided open vocabulary image captioning with con- Johnson, J.; Karpathy, A.; and Fei-Fei, L. 2016. Densecap:
strained beam search. In EMNLP. Fully convolutional localization networks for dense caption-
ing. In CVPR.
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.;
Gould, S.; and Zhang, L. 2018. Bottom-up and top-down at- Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic
tention for image captioning and visual question answering. alignments for generating image descriptions. In CVPR.
In CVPR. Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.;
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, Choi, Y.; Berg, A. C.; and Berg, T. L. 2013. Babytalk: Un-
A.; and Zagoruyko, S. 2020. End-to-End Object Detection derstanding and generating simple image descriptions. IEEE
with Transformers. In ECCV. Transactions on Pattern Analysis and Machine Intelligence
35(12): 2891–2903.
Chen, C.; Mu, S.; Xiao, W.; Ye, Z.; Wu, L.; and Ju, Q. 2019.
Improving image captioning with conditional generative ad- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.;
versarial nets. In AAAI. Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Duerig,
T.; et al. 2020. The open images dataset v4: Unified im-
Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; age classification, object detection, and visual relationship
Dollár, P.; and Zitnick, C. L. 2015. Microsoft coco cap- detection at scale. IJCV .
tions: Data collection and evaluation server. arXiv preprint
arXiv:1504.00325 . Kuznetsova, P.; Ordonez, V.; Berg, A.; Berg, T.; and Choi, Y.
2012. Collective generation of natural image descriptions. In
Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A. E.; Ahmed, F.; Gan, Z.; ACL.
Cheng, Y.; and Liu, J. 2020. UNITER: Learning universal
image-text representations. In ECCV. Li, N.; Chen, Z.; and Liu, S. 2019. Meta learning for image
captioning. In AAAI.
Cornia, M.; Stefanini, M.; Baraldi, L.; and Cucchiara, R.
Li, X.; Jiang, S.; and Han, J. 2019. Learning Object Context
2020. Meshed-Memory Transformer for Image Captioning.
for Dense Captioning. In AAAI.
In CVPR.
Li, X.; Yin, X.; Li, C.; Hu, X.; Zhang, P.; Zhang, L.;
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei- Wang, L.; Hu, H.; Dong, L.; Wei, F.; Choi, Y.; and Gao,
Fei, L. 2009. Imagenet: A large-scale hierarchical image J. 2020. Oscar: Object-semantics aligned pre-training for
database. In CVPR. vision-language tasks. In ECCV.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert:
Bert: Pre-training of deep bidirectional transformers for lan- Pretraining task-agnostic visiolinguistic representations for
guage understanding. In NAACL. vision-and-language tasks. In NeurIPS.
Dognin, P.; Melnyk, I.; Mroueh, Y.; Ross, J.; and Sercu, T. Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2018. Neural baby
2019. Adversarial semantic alignment for improved image talk. In CVPR.
captions. In CVPR.
Ma, C.-Y.; Kalantidis, Y.; AlRegib, G.; Vajda, P.; Rohrbach,
Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R. K.; Deng, L.; M.; and Kira, Z. 2020. Learning to Generate Grounded Vi-
Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; et al. sual Captions without Localization Supervision. In ECCV.
2015. From captions to visual concepts and back. In CVPR.
Mitchell, M.; Dodge, J.; Goyal, A.; Yamaguchi, K.; Stratos,
Farhadi, A.; Hejrati, M.; Sadeghi, M. A.; Young, P.; K.; Han, X.; Mensch, A.; Berg, A.; Berg, T.; and Daumé III,
Rashtchian, C.; Hockenmaier, J.; and Forsyth, D. 2010. Ev- H. 2012. Midge: Generating image descriptions from com-
ery picture tells a story: Generating sentences from images. puter vision detections. In Proceedings of the 13th Confer-
In ECCV. ence of the European Chapter of the Association for Com-
Gao, L.; Fan, K.; Song, J.; Liu, X.; Xu, X.; and Shen, H. T. putational Linguistics.
2019. Deliberate attention networks for image captioning. Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-Linear Attention
In AAAI. Networks for Image Captioning. In CVPR.
1582
Radford, A. 2018. Improving Language Understanding by Yin, G.; Sheng, L.; Liu, B.; Yu, N.; Wang, X.; and Shao, J.
Generative Pre-Training. 2019. Context and attribute grounded dense captioning. In
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R- CVPR.
CNN: Towards real-time object detection with region pro- Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014.
posal networks. In NeurIPS. From image descriptions to visual denotations: New simi-
larity metrics for semantic inference over event descriptions.
Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel,
Transactions of the Association for Computational Linguis-
V. 2017. Self-critical sequence training for image caption-
tics 2: 67–78.
ing. In CVPR.
Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J. J.; and
Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G.; Zhang, X.; Li, Gao, J. 2020a. Unified Vision-Language Pre-Training for
J.; and Sun, J. 2019. Objects365: A large-scale, high-quality Image Captioning and VQA. In AAAI.
dataset for object detection. In ICCV.
Zhou, Y.; Wang, M.; Liu, D.; Hu, Z.; and Zhang, H. 2020b.
Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. More Grounded Image Captioning by Distilling Image-Text
Conceptual captions: A cleaned, hypernymed, image alt-text Matching Model. In CVPR.
dataset for automatic image captioning. In ACL.
Sidorov, O.; Hu, R.; Rohrbach, M.; and Singh, A. 2020.
TextCaps: a Dataset for Image Captioning with Reading
Comprehension. In ECCV.
Song, L.; Liu, J.; Qian, B.; and Chen, Y. 2019. Connect-
ing Language to Images: A Progressive Attention-Guided
Network for Simultaneous Image Captioning and Language
Grounding. In AAAI.
Stewart, R.; Andriluka, M.; and Ng, A. Y. 2016. End-to-end
people detection in crowded scenes. In CVPR.
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; and Dai, J.
2019. VL-BERT: Pre-training of Generic Visual-Linguistic
Representations. In ICLR.
Tan, H.; and Bansal, M. 2019. LXMERT: Learning Cross-
Modality Encoder Representations from Transformers. In
EMNLP.
Tran, K.; He, X.; Zhang, L.; Sun, J.; Carapcea, C.; Thrasher,
C.; Buehler, C.; and Sienkiewicz, C. 2016. Rich image cap-
tioning in the wild. In CVPR.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. In NeurIPS.
Venugopalan, S.; Anne Hendricks, L.; Rohrbach, M.;
Mooney, R.; Darrell, T.; and Saenko, K. 2017. Captioning
images with diverse objects. In CVPR.
Wang, W.; Chen, Z.; and Hu, H. 2019. Hierarchical attention
network for image captioning. In AAAI.
Wu, Y.; Zhu, L.; Jiang, L.; and Yang, Y. 2018. Decoupled
novel object captioner. In ACM Multimedia.
Yang, X.; Zhang, H.; Jin, D.; Liu, Y.; Wu, C.-H.; Tan, J.;
Xie, D.; Wang, J.; and Wang, X. 2020. Fashion Captioning:
Towards Generating Accurate Descriptions with Semantic
Rewards. In ECCV.
Yang, Y.; Teo, C.; Daumé III, H.; and Aloimonos, Y. 2011.
Corpus-guided sentence generation of natural images. In
EMNLP.
Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2017. Incorporating
copying mechanism in image captioning for learning novel
objects. In CVPR.
1583