Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Illustrative Language Understanding: Large-Scale Visual Grounding With Image Search

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Illustrative Language Understanding:

Large-Scale Visual Grounding with Image Search

Jamie Ryan Kiros* William Chan* Geoffrey E. Hinton

Google Brain Toronto


{kiros, williamchan, geoffhinton}@google.com

Abstract embodied cognition, search engines allow us to


get a form of quasi-grounding from high-coverage
We introduce Picturebook, a large-scale ‘snapshots’ of our physical world provided by the
lookup operation to ground language via interaction of millions of users.
‘snapshots’ of our physical world accessed One place to incorporate grounding is in the
through image search. For each word in lookup table that maps tokens to vectors. The
a vocabulary, we extract the top-k im- dominant approach to learning distributed word
ages from Google image search and feed representations is through indexing a learned ma-
the images through a convolutional net- trix. While immensely successful, this lookup op-
work to extract a word embedding. We eration is typically learned through co-occurrence
introduce a multimodal gating function objectives or a task-dependent reward signal. A
to fuse our Picturebook embeddings with very different way to obtain word embeddings is
other word representations. We also intro- to aggregate features obtained by using the word
duce Inverse Picturebook, a mechanism to as a query for an image search engine. This in-
map a Picturebook embedding back into volves retrieving the top-k images from a search
words. We experiment and report results engine, running those through a convolutional net-
across a wide range of tasks: word simi- work and aggregating the results. These word em-
larity, natural language inference, seman- beddings are grounded via the retrieved images.
tic relatedness, sentiment/topic classifica- While several authors have considered this ap-
tion, image-sentence ranking and machine proach, it has been largely limited to a few thou-
translation. We also show that gate acti- sand queries and only a small number of tasks.
vations corresponding to Picturebook em- In this paper we introduce Picturebook embed-
beddings are highly correlated to human dings produced by image search using words as
judgments of concreteness ratings. queries. Picturebook embeddings are obtained
through a convolutional network trained with a
1 Introduction semantic ranking objective on a proprietary im-
Constructing grounded representations of natu- age dataset with over 100+ million images (Wang
ral language is a promising step towards achiev- et al., 2014). Using Google image search, a Pic-
ing human-like language learning. In recent years, turebook embedding for a word is obtained by
a large amount of research has focused on in- concatenating the k-feature vectors of our convo-
tegrating vision and language to obtain visually lutional network on the top-k retrieved search re-
grounded word and sentence representations. One sults. The main contributions of our work are as
source of grounding, which has been utilized in follows:
existing work, is image search engines. Search • We obtain Picturebook embeddings for the 2.2
engines allow us to obtain correspondences be- million words that occur in the Glove vocabu-
tween language and images that are far less re- lary (Pennington et al., 2014) 1 , allowing each
stricted than existing multimodal datasets which word to have a Glove embedding and a par-
typically have restricted vocabularies. While true allel grounded word representation. This col-
natural language understanding may require fully lection of word representations that we visually
1
*Both authors contributed equally to this work. Common Crawl, 840B tokens

922
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 922–933
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
ground via image search is 2-3 orders of magni- Method tasks
tude larger than prior work. (Bergsma and Durme, 2011) bilingual lexicons
• We introduce a multimodal gating mechanism (Bergsma and Goebel, 2011) lexical preference
(Kiela et al., 2014) word similarity
to selectively choose between Glove and Pic- (Kiela et al., 2015a) lexical entailment detection
turebook embeddings in a task-dependent way. (Kiela et al., 2015b) bilingual lexicons
We apply our approach to over a dozen datasets (Shutova et al., 2016) metaphor identification
(Bulat et al., 2015) predicting property norms
and several different tasks: word similarity, sen- (Kiela, 2016) toolbox
tence relatedness, natural language inference, (Vulic et al., 2016) bilingual lexicons
topic/sentiment classification, image sentence (Kiela et al., 2016) word similarity
(Anderson et al., 2017) decoding brain activity
ranking and Machine Translation (MT). (Glavas et al., 2017) semantic text similarity
• We introduce Inverse Picturebook to perform (Bhaskar et al., 2017) abstract vs concrete nouns
the inverse lookup operation. Given a Pic- (Hartmann and Sogaard, 2017) bilingual lexicons
(Bulat et al., 2017) decoding brain activity
turebook embedding, we find the closest words
which would generate the embedding. This is Table 1: Existing methods that use image search
useful for generative modelling tasks. for grounding and their corresponding tasks.
• We perform an extensive analysis of our gating
mechanism, showing that the gate activations
for Picturebook embeddings are highly corre- also fused text-based representations with image-
lated with human judgments of concreteness. based representations (Bruni et al., 2014; Lazari-
We also show that Picturebook gate activations dou et al., 2015; Chrupala et al., 2015; Mao et al.,
are negatively correlated with image dispersion 2016; Silberer et al., 2017; Kiela et al., 2017;
(Kiela et al., 2014), indicating that our model Collell et al., 2017; Zablocki et al., 2018) and
selectively chooses between word embeddings representations derived from a knowledge-graph
based on their abstraction level. (Thoma et al., 2017). More recently, gating-based
• We highlight the importance of the convolu- approaches have been developed for fusing tra-
tional network used to extract embeddings. In ditional word embeddings with visual represen-
particular, networks trained with semantic la- tations. Arevalo et al. (2017) introduce a gat-
bels result in better embeddings than those ing mechanism inspired by the LSTM while Kiela
trained with visual labels, even when evaluating et al. (2018) describe an asymmetric gate that al-
similarity on concrete words. lows one modality to ‘attend’ to the other. The
work that most closely matches ours is that of
2 Related Work Wang et al. (2018) who also consider fusing Glove
embeddings with visual features. However, their
The use of image search for obtaining word rep- analysis is restricted to word similarity tasks and
resentations is not new. Table 1 illustrates ex- they require text-to-image regression to obtain vi-
isting methods that utilize image search and the sual embeddings for unseen words, due to the use
tasks considered in their work. There has also of ImageNet. The use of image search allows us to
been other work using other image sources such obtain visual embeddings for a virtually unlimited
as ImageNet (Kiela and Bottou, 2014; Collell and vocabulary without needing a mapping function.
Moens, 2016) over the WordNet synset vocabu-
lary, and using Flickr photos and captions (Joulin 3 Picturebook Embeddings
et al., 2016). Our approach differs from the above
methods in three main ways: a) we obtain search- Our Picturebook embeddings ground language us-
grounded representations for over 2 million words ing the ‘snapshots’ returned by an image search
as opposed to a few thousand, b) we apply our rep- engine. Given a word (or phrase), we image search
resentations to a higher diversity of tasks than pre- for the top-k images and extract the images. We
viously considered, and c) we introduce a multi- then pass each image through a CNN trained with
modal gating mechanism that allows for a more a semantic ranking objective to extract its em-
flexible integration of features than mere concate- bedding. Our Picturebook embeddings reflect the
nation. search rankings by concatenating the individual
Our work also relates to existing multimodal embeddings in the order of the search results. We
models combining different representations of the can perform all of these operations offline to con-
data (Hill and Korhonen, 2014). Various work has struct a matrix Ep representing the Picturebook

923
embeddings over a vocabulary. 3.2 Visual vs Semantic Similarity

3.1 Inducing Picturebook Embeddings The training procedure is heavily influenced by


the choice of similarity function ri,j . We consider
The convolutional network used to obtain Pic- two types of image similarity: visual and seman-
turebook embeddings is based off of Wang et al. tic. As an example, an image of a blue car would
(2014). Let pi , p+
i , pi denote a triplet of query, have high visual similarity to other blue cars but
positive and negative images, respectively. We de- would have higher semantic similarity to cars of
fine the following hinge loss for a given triplet as the same make, independent of color. In our ex-
follows: periments we consider two types of Picturebook
l(pi , p+ embedding: one trained through optimizing for vi-
i , pi ) =
sual similarity and another for semantic similarity.
max{0, g + D(f (pi ), f (p+
i )) D(f (pi ), f (pi ))} As we will show in our experiments, the semantic
(1) Picturebook embeddings result in representations
where f (pi ) represents the embedding of image that are more useful for natural language process-
pi , D(·, ·) is the Euclidean distance and g is a mar- ing tasks than the visual embeddings.
gin (gap) hyperparameter. Suppose we have avail-
able pairwise relevance scores ri,j = r(pi , pj ) in- 3.3 Multimodal Fusion Gating
dicating the similarity of images pi and pj . The Picturebook embeddings on their own are likely to
objective function that is optimized is given by: be useful for representing concrete words but it is
X not clear whether they will be of benefit for ab-
min ⇠i + kW k22 stract words. Consequently, we would like to fuse
i
our Picturebook embeddings with other sources of
s.t. :l(pi , p+
i , pi )  ⇠i information, for example Glove embeddings (Pen-
8pi , p+ +
i , pi such that r(pi , pi ) > r(pi , pi ) nington et al., 2014) or randomly initialized em-
(2) beddings that will be trained. Let eg = eg (w) be
where ⇠i are slack variables and W is a vector our other embedding (i.e., Glove) for a word w
of the network’s model parameters. The model is and ep = ep (w) be our Picturebook embedding.
trained end-to-end using a proprietary dataset with We fuse our embeddings using a multimodal gat-
100+ million images. We refer the reader to Wang ing mechanism:
et al. (2014) for additional details of training, in-
cluding the specifics of the architecture used. g = (eg , ep ) (4)
After the model is trained, we can use the con- e=g (eg ) + (1 g) (ep ) (5)
volutional network as a feature extractor for im-
ages by computing an embedding vector f (p) for where is a 1 hidden layer DNN with ReLU ac-
an image p. Suppose we would like to obtain a tivations and sigmoid outputs, and are 1 hid-
Picturebook embedding for a given word w. We den layer DNNs with ReLU activations and tanh
first perform an image search with query w to ob- outputs. The gating DNN allows the model to
tain a ranked list of images pw learn how visual a word is as a function of its
1 , . . . , pk . The Pic-
w

turebook embedding for a word w is then repre- input ep and eg . Similar gating mechanisms can
sented as: be found in LSTMs (Hochreiter and Schmidhu-
ber, 1997) and other multimodal models (Arevalo
ep (w) = [f (pw w w
1 ); f (p2 ); . . . ; f (pk )] (3) et al., 2017; Wang et al., 2018; Kiela et al., 2018).
On some experiments we found it beneficial to in-
namely, the concatenation of the feature vectors clude a skip connection from the hidden layer of
in ranked order. In our model, each embedding . We chose this form of fusion over other ap-
results in a 64-dimensional vector with the final proaches, such as CCA variants and metric learn-
Picturebook embedding being 64 ⇤ k dimensions. ing methods, to allow for easier interpretability
Most of our experiments use k = 10 images re- and analysis. We leave comparison of alternative
sulting in a word embedding size of 640. To ob- fusion strategies for future work.
tain the full collection of embeddings, we run the
full Glove vocabulary (2.2M words) through im- 3.4 Contextual Gating
age search to obtain a corresponding Picturebook The gating described above is non-contextual, in
embedding to each word in the Glove vocabulary. the sense that each embedding computes a gate

924
value independent of the context the words oc- 4 Experiments
cur in. In some cases it may be beneficial to
use contextual gates that are aware of the sen- To evaluate the effectiveness of our embeddings,
tence that words appear in to decide how to weight we perform both quantitative and qualitative eval-
Glove and Picturebook embeddings. For contex- uation across a wide range of natural language
tual gates, we use the same approach as above ex- processing tasks. Hyperparameter details of each
cept we replace the controller (eg , ep ) with in- experiment are included in the appendix. Since the
puts that have been fed through a bidirectional- use of Picturebook embeddings adds extra param-
LSTM, e.g. (BiLSTM(eg ),BiLSTM(ep )). We eters to our models, we include a baseline for each
experiment with contextual gating for all experi- experiment (either based on Glove or learned em-
ments that use a bidirectional-LSTM encoder. beddings) that we extensively tune. In most exper-
iments, we end up with baselines that are stronger
3.5 Inverse Picturebook than what has previously been reported.
Picturebook embeddings can be seen as a form of
4.1 Nearest neighbours
implicit image search: given a word (or phrase),
image search the word query and concatenate the In order to get a sense of the representations our
embeddings of the images produced by a CNN. model learns, we first compute nearest neighbour
Up until now, we have only discussed scenarios results of several words, shown in Table 2. These
where we have a word and we want to perform results can be interpreted as follows: the words
this implicit search operation. In generative mod- that appear as neighbours are those which have se-
elling problems (i.e., MT), we want to perform the mantically similar images to that of the query. Of-
opposite operation. Given a Picturebook embed- ten this captures visual similarity as well. Some
ding, we want to find the closest word or phrase words capture multimodality, such as ‘deep’ refer-
aligned to the representation. For example, given ring both to deep sea as well as to AI. Searching
the word ‘bicycle’ in English and its Picturebook for cities returns cities which have visually simi-
embedding, we want to find the closest French lar characteristics. Words like ‘sun’ also return the
word that would generate this representation (i.e., corresponding word in different languages, such
‘vélo’). We want to perform this inverse image as ‘Sol’ in Spanish and ‘Soleil’ in French. Finally,
search operation given its Picturebook embedding. it’s worth highlighting that the most frequent asso-
We introduce a differentiable mechanism which ciation of a word may not be what is represented
allows us to align words across source and target in image search results. For example, the word
languages in the Picturebook embedding domain. ‘is’ returns words related to terrorists and ISIS and
Let h be our internal representation of our model ‘it’ returns words related to scary and clowns due
(i.e., seq2seq decoder state), and ei be the i-th to the 2017 film of the same name. We also re-
word embedding from our Picturebook embedding port nearest neighbour examples across languages
matrix Ep : in Appendix A.1.
exp(hh, ei i) 4.2 Word similarity
p(yi |h) = P (6)
j exp(hh, ej i) Our first quantitative experiment aims to deter-
Given a representation h, Equation 6 simply finds mine how well Picturebook embeddings capture
the most similar word in the embedding space. word similarity. We use the SimLex-999 dataset
This can be easily implemented by setting the out- (Hill et al., 2015) and report results across 9 cat-
put softmax matrix as the transpose of the Picture- egories: all (the whole evaluation), adjectives,
book embedding matrix Ep . In practice, we find nouns, verbs, concreteness quartiles and the hard-
adding additional parameters helps with learning: est 333 pairs. For the concreteness quartiles,
exp(hh, ei + e0i i + bi ) the first quartile corresponds to the most abstract
p(yi |h) = P 0 (7) words, while the last corresponds to the most
j exp(hh, ej + ej i + bj ) concrete words. The hardest pairs are those for
where e0i is a trainable weight vector per word and which similarity is difficult to distinguish from re-
bi is a trainable bias per word. A similar technique latedness. This is an interesting category since
to tie the softmax matrix as the transpose of the image-based word embeddings are perhaps less
embedding matrix can be found in language mod- likely to confuse similarity with relatedness than
elling (Press and Wolf, 2017; Inan et al., 2017). distributional-based methods. For Glove, scores

925
language deep network Melbourne association sun life not
interdisciplinary deepest internet Austin inclusion prominence praising Nosign
languages deep-sea cyberspace Raleigh committees Sol rejoicing prohibited
literacy manta networks Cincinnati social Soleil freedom Forbidden
sociology depths blueprints Yokohama groupe Sole glorifying no
multilingual Jarvis connectivity Cleveland members Venere worshipping no-fly
inclusion cyber interconnections Tampa participation Marte healed forbid
communications AI blueprint Pittsburgh personnel eclipses praise 10
linguistics hackers AI Boston involvement Venus healing prohibiting
values restarting interconnected Rochester staffing eclipse trust forbidden
user-generated diver tech Frankfurt meetings fireballs happiness Stop

Table 2: Nearest neighbours of words. Results are retrieved over the 100K most frequent words.

Model all adjs nouns verbs conc-q1 conc-q2 conc-q3 conc-q4 hard
Glove 40.8 62.2 42.8 19.6 43.3 41.6 42.3 40.2 27.2
Picturebook 37.3 11.7 48.2 17.3 14.4 27.5 46.2 60.7 28.8
Glove + Picturebook 45.5 46.2 52.1 22.8 36.7 41.7 50.4 57.3 32.5
Picturebook (Visual) 31.3 11.1 38.8 20.4 13.9 26.1 38.7 47.7 23.9
Picturebook (Semantic) 37.3 11.7 48.2 17.3 14.4 27.5 46.2 60.7 28.8
Picturebook (1) 24.5 2.6 33.5 12.1 4.7 17.8 32.8 47.8 13.6
Picturebook (2) 28.4 6.5 38.9 9.0 5.0 21.3 34.3 55.1 15.7
Picturebook (3) 30.3 11.9 41.9 3.1 2.6 24.3 37.5 58.3 18.4
Picturebook (5) 34.4 6.8 44.5 18.0 9.0 27.9 42.8 58.3 25.9
Picturebook (10) 37.3 11.7 48.2 17.3 14.4 27.5 46.2 60.7 28.8

Table 3: SimLex-999 results (Spearman’s ⇢). Best results overall are bolded. Best results per section
are underlined. Bracketed numbers signify the number of images used. Some rows are copied across
sections for ease of reading.

are computed via cosine similarity. For computing For the hardest subset of words, Picturebook per-
a score between 2 word pairs with Picturebook, we forms slightly better than Glove while Glove per-
(1) (2)
set s(w(1) , w(2) ) = mini,j d(ei , ej ). 2 That forms better across all pairs. We also compare to
is, the score is minus the smallest cosine distance a convolutional network trained with visual sim-
between all pairs of images of the two words. Note ilarity. We observe a performance difference be-
that this reduces to negative cosine distance when tween our visual and semantic embeddings: on all
using only 1 image per word. We also report re- categories except verbs, the semantic embeddings
sults combining Glove and Picturebook by sum- outperform visual ones, even on the most concrete
ming their two independent similarity scores. By categories. This indicates the importance of the
default, we use 10 images for each embedding us- type of similarity used for training the model. Fi-
ing the semantic convolutional network. nally we note that adding more images nearly con-
Table 3 displays our results, from which sev- sistently improves similarity scores across cate-
eral observations can be made. First, we observe gories. Kiela et al. (2016) showed that after 10-20
that combining Glove and Picturebook leads to images, performance tends to saturate. All sub-
improved similarity across most categories. For sequent experiments use 10 images with semantic
adjectives and the most abstract category, Glove Picturebook.
performs significantly better, while for the most
concrete category Picturebook is significantly bet- 4.3 Sentential Inference and Relatedness
ter. This result confirms that Glove and Picture- We next consider experiments on 3 pairwise pre-
book capture very different properties of words. diction datasets: SNLI (Bowman et al., 2015),
Next we observe that the performance of Picture- MultiNLI (Williams et al., 2017) and SICK
book gets progressively better across each con- (Marelli et al., 2014). The first two are natural lan-
creteness quartile rating, with a 20 point improve- guage inference tasks and the third is a sentence
ment over Glove for the most concrete category. semantic relatedness task. We explore the use of
2
We found scoring all pairs of images to outperform scor- two types of sentential encoders: Bag-of-Words
ing only the corresponding equally ranked image. (BoW) and BiLSTM-Max (Conneau et al., 2017a).

926
Model SNLI MultiNLI SICK Relatedness
dev test dev-mat dev-mis test-p test-s test-mse
Glove (bow) 85.2 84.2 70.5 69.9 86.8 79.8 25.2
Picturebook (bow) 84.0 83.8 67.9 67.1 85.8 79.3 27.0
Glove + Picturebook (bow) 86.2 85.2 71.3 70.9 87.2 80.9 24.4
BiLSTM-Max (Conneau et al., 2017a) 85.0 84.5
Glove 86.8 86.3 74.1 74.5
Picturebook 85.2 85.1 70.7 70.3
Glove + Picturebook 86.7 86.1 73.7 73.7
Glove + Picturebook + Contextual Gating 86.9 86.5 74.2 74.4

Table 4: Classification accuracies are reported for SNLI and MulitNLI. For SICK we report Pearson,
Spearman and MSE. Higher is better for all metrics except MSE. Best results overall per column are
bolded. Best results per section are underlined.

Three sets of features are used: Glove only, Pic- by the authors as well as fastText (Joulin et al.,
turebook only and Glove + Picturebook. For the 2017). Hyperparameter details are reported in Ap-
latter, we use multimodal gating for all encoders pendix B.
and contextual gating in the BiLSTM-Max model. Our experimental results are provided in Table
For SICK, we follow previous work and report av- 5. Perhaps unsurprisingly, adding Picturebook to
erage results across 5 runs (Tai et al., 2015). Due Glove matches or only slightly improves on 5 out
to the small size of the dataset, we only experiment of 7 tasks and obtains a lower result on AG News
with BoW on SICK. The full details of hyperpa- and Yahoo. Our results show that Picturebook em-
rameters are discussed in Appendix B. beddings, while minimally aiding in performance,
Table 4 displays our results. For BoW mod- can perform reasonably well on their own - out-
els, adding Picturebook embeddings to Glove re- performing the n-gram baselines of (Zhang et al.,
sults in significant gains across all three tasks. For 2015) on 5 out of 7 tasks and the unigram fastText
BiLSTM-Max, our contextual gating sets a new baseline on all 7 tasks. This result shows that our
state-of-the-art on SNLI sentence encoding meth- embeddings are able to work as a general text em-
ods (methods without interaction layers), outper- bedding, though they typically lag behind Glove.
forming the recently proposed methods of Im and We note that the best performing methods on these
Cho (2017); Shen et al. (2018). It is worth not- tasks are based on convolutional neural networks
ing the effect that different encoders have when (Conneau et al., 2017b).
using our embeddings. While non-contextual gat-
ing is sufficient to improve bag-of-words methods, 4.5 Image-Sentence Ranking
with BiLSTM-Max it slightly hurts performance We next consider experiments that map images
over the Glove baseline. Adding contextual gating and sentences into a common vector space for re-
was necessary to improve over the Glove baseline trieval. Here, we utilize VSE++ (Faghri et al.,
on SNLI. Finally we note the strength of our own 2017) as our base model and evaluate on the
Glove baseline over the reported results of Con- COCO dataset (Lin et al., 2014). VSE++ improves
neau et al. (2017a), from which we improve on over the original CNN-LSTM embedding method
their accuracy from 85.0 to 86.8 on the develop- of Kiros et al. (2015a) by using hard negatives in-
ment set. 3 stead of summing over contrastive examples. We
re-implement their model with 2 modifications: 1)
4.4 Sentiment and Topic Classification
we replace the unidirectional LSTM encoder with
Our next set of experiments aims to determine how a BiLSTM-Max sentence encoder and 2) we use
well Picturebook embeddings do on tasks that are Inception-V3 (Szegedy et al., 2016) as our CNN
primarily non-visual, such as topic and sentiment instead of ResNet 152 (He et al., 2016). As in pre-
classification. We experiment with 7 datasets pro- vious work, we report the mean Recall@K (R@K)
vided by Zhang et al. (2015) and compare bag-of- and the median rank over 1000 images and 5000
words models against n-gram baselines provided sentences. Full details of the hyperparameters are
3
All reported results on SNLI are available at https: in Appendix B.
//nlp.stanford.edu/projects/snli/ Table 6 displays our results on this task.

927
Model AG DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.
BoW (Zhang et al., 2015) 88.8 96.6 92.2 58.0 68.9 54.6 90.4
ngrams (Zhang et al., 2015) 92.0 98.6 95.6 56.3 68.5 54.3 92.0
ngrams TFIDF (Zhang et al., 2015) 92.4 98.7 95.4 54.8 68.5 52.4 91.5
fastText (Joulin et al., 2017) 91.5 98.1 93.8 60.4 72.0 55.8 91.2
fastText-bigram (Joulin et al., 2017) 92.5 98.6 95.7 63.9 72.3 60.2 94.6
Glove (bow) 94.0 98.6 94.4 61.7 74.1 58.5 93.2
Picturebook (bow) 92.8 98.5 94.4 61.6 73.3 57.8 92.9
Glove + Picturebook (bow) 93.9 98.6 94.5 61.9 73.8 58.7 93.2

Table 5: Test accuracy [%] on topic and sentiment classification datasets. Best results per dataset are
bolded, best results per section are underlined. We compare directly against other bag of ngram baselines.

Image Annotation Image Search


Model R@1 R@5 R@10 Med r R@1 R@5 R@10 Med r
VSE++ (Faghri et al., 2017) 64.6 95.7 1 52.0 92.0 1
Glove 64.6 88.9 95.5 1 53.7 86.5 94.4 1
Picturebook 62.4 90.2 95.3 1 54.2 86.4 94.3 1
Glove + Picturebook 61.8 89.2 95.0 1 54.1 86.7 94.7 1
Glove + Picturebook + Contextual Gating 63.4 90.3 96.5 1 55.2 87.2 94.4 1

Table 6: COCO test-set results for image-sentence retrieval experiments. Our models use VSE++. R@K
is Recall@K (high is good). Med r is the median rank (low is good).

Our Glove baseline was able to match or out- pared to (Caglayan et al., 2017). We believe this
perform the reported results in Faghri et al. is due to the fact we did not use Byte Pair En-
(2017) with the exception of Recall@10 for im- coding (BPE) (Sennrich et al., 2016), and ME-
age annotation, where it performs slightly worse. TEOR captures word stemming (Denkowski and
Glove+Picturebook improves over the Glove base- Lavie, 2014). This is also highlighted where our
line for image search but falls short on image an- French models perform better than our German
notation. However, using contextual gating re- models relatively, due to the compounding nature
sults in improvements over the baseline on all met- of German words. Since seq2seq MT models are
rics except R@1 for image annotation. Our re- typically trained without Glove embeddings, we
ported results have been recently outperformed by also did not use Glove embeddings for this task,
Gu et al. (2018); Huang et al. (2018b); Lee et al. but rather we combine randomly initialized learn-
(2018), which are more sophisticated methods that able embeddings with the fixed Picturebook em-
incorporate generative modelling, reinforcement beddings. We find the gating mechanism not to
learning and attention. help much with the MT task since the trainable
embeddings are free to change their norm magni-
4.6 Machine Translation tudes. We did not experiment with regularizing the
We experiment with the Multi30k (Elliott et al., norm of the embeddings. On the English ! Ger-
2016, 2017) dataset for MT. We compare our man tasks, we find our Picturebook model to per-
Picturebook models with other text-only non- form on average 0.8 BLEU or 0.7 METEOR over
ensembled models on the Flickr Test2016, Flickr our baseline. On the German task, compared to the
Test2017 and MSCOCO test sets from Caglayan previously best published results (Caglayan et al.,
et al. (2017), the winner of the WMT 17 Mul- 2017) we do better in BLEU but slightly worse in
timodal Machine Translation competition (Elliott METEOR. We suspect this is due to the fact that
et al., 2017). We use the standard seq2seq we did not use BPE. On the English ! French
(Sutskever et al., 2015) with content-based atten- task, the Picturebook models do on average 1.2
tion (Bahdanau et al., 2015) model and we de- BLEU better or 1.0 METEOR over our baseline.
scribe our hyperparmeters in Appendix B. We also report results for the IWSLT 2014
Table 7 summarizes our English ! German German-English task (Cettolo et al., 2014) in Ta-
results and Table 8 summarizes our English ! ble 9. Compared to our baseline, we report a
French results. We find our models to perform gain of 0.3 and 1.1 BLEU for German ! En-
better in BLEU than METEOR relatively com- glish and English ! German respectively. We

928
Model Test2016 Test2017 MSCOCO
BLEU METEOR BLEU METEOR BLEU METEOR
BPE (Caglayan et al., 2017) 38.1 57.3 30.8 51.6 26.4 46.8
Baseline 38.9 56.5 32.6 50.7 26.8 45.4
Picturebook 39.6 56.9 31.8 50.1 27.7 45.8
Picturebook + Inverse Picturebook 40.2 57.2 32.3 50.7 27.8 46.3
Picturebook + Inverse Picturebook + Gating 40.0 57.3 33.0 51.1 27.9 46.5

Table 7: Machine Translation results on the Multi30k English ! German task. We note that our models
do not use BPE, and we perform better in BLEU relative to METEOR.

Model Test2016 Test2017 MSCOCO


BLEU METEOR BLEU METEOR BLEU METEOR
BPE (Caglayan et al., 2017) 52.5 69.6 50.4 67.5 41.2 61.3
Baseline 60.7 74.1 52.3 67.4 42.8 60.6
Picturebook 61.0 74.2 52.4 67.5 43.1 61.0
Picturebook + Inverse Picturebook 61.8 75.0 52.6 67.7 42.8 61.2
Picturebook + Inverse Picturebook + Gating 62.1 75.2 53.6 68.4 43.8 61.6

Table 8: Machine Translation results on the Multi30k English ! French task.

report new state-of-the-art results for the English gate activations correlate to a) human judgments
! German task at 25.4 BLEU, while our Ger- of concreteness and b) image dispersion (Kiela
man ! English model achieves 29.6 BLEU which et al., 2014). For concreteness ratings, we use the
is slightly behind the recently proposed Neural dataset of Brysbaert et al. (2013) which provides
Phrase-based Machine Translation (NPMT) model ratings for 40,000 English lemmas. Image disper-
at 29.9 (Huang et al., 2018a). We note that the sion is the average distance between all pairs of
NPMT is not a seq2seq model and can be aug- images returned from a search query. It was shown
mented with our Picturebook embeddings. We in Kiela et al. (2014) that abstract words tend to
also note that our models may not be directly com- have higher dispersion ratings, due to having much
parable to previously published seq2seq models higher variety in the types of images returned from
from (Wiseman and Rush, 2016; Bahdanau et al., a query. On the other hand, low dispersion ratings
2017) since we used a deeper encoder and decoder. were more associated with concrete words. For
each word, we compute the mean gate activation
4.7 Limitations value for Picturebook embeddings. 4 For con-
We explored the use of Picturebook for larger creteness ratings, we take the intersection of words
machine translation tasks, including the popular that have ratings with the dataset vocabulary. We
WMT14 benchmarks. For these tasks, we found then compute the Spearman correlation of mean
that models that incorporate Picturebook led to gate activations with a) concreteness ratings and
faster convergence. However, we were not able to b) image dispersion scores.
improve upon BLEU scores from equivalent mod- Table 10 illustrates the result of this analysis.
els that do not use Picturebook. This indicates We observe that gates have high correlations with
that while our embeddings are useful for smaller concreteness ratings and strong negative correla-
MT experiments, further research is needed on tions with image dispersion scores. Moreover, this
how to best incorporate grounded representations result holds true across all datasets, even those that
in larger translation tasks. are not inherently visual. These results provide ev-
idence that our gating mechanism actively prefers
4.8 Gate Analysis Glove embeddings for abstract words and Picture-
book embeddings for concrete words. Appendix
In this section we perform an extensive analy- A contains examples of words that most strongly
sis of the gating mechanism for models trained activate Glove and Picturebook gates.
across datasets used in our experiments. In our
first experiment, we aim to determine how well 4
We only consider non-contextualized gates.

929
Model DE ! EN BLEU EN ! DE BLEU
MIXER (Ranzato et al., 2016) 21.8
Beam Search Optimization (Wiseman and Rush, 2016) 25.5
Actor-Critic + Log Likelihood (Bahdanau et al., 2017) 28.5
Neural Phrase-based Machine Translation (Huang et al., 2018a) 29.9 25.1
Baseline 29.3 24.3
Picturebook 29.6 25.4

Table 9: Machine Translation results on the IWSLT 2014 German-English task.

Rank SNLI MultiNLI COCO AG-News DBpedia Yelp Amazon


ccorr disp ccorr disp ccorr disp ccorr disp ccorr disp ccorr disp ccorr disp
top-1% 73 -41 39 -27 53 -22 60 -16 56 -30 47 -28 32 -17
top-10% 54 -39 48 -34 34 -23 52 -24 54 -32 49 -26 50 -30
all 35 -30 30 -27 21 -16 36 -17 39 -30 24 -20 33 -31

Table 10: Correlations (rounded, x100) of mean Picturebook gate activations to human judgements of
concreteness ratings (ccorr) and image dispersion (disp) within the specified most frequent words.

(a) SNLI (b) MultiNLI (c) AG-News

Figure 1: POS analysis. Top bar for each tag is Glove, bottom is Picturebook. Tags are sorted by Glove
frequencies. Results taken over the top 100 mean activation values within the 10K most frequent words.

Finally we analyze the parts-of-speech (POS) of search engines for language grounding as well
of the highest activated words. These results are as the effect these embeddings may have on learn-
shown in Figure 1. The highest scoring Pic- ing generic sentence representations (Kiros et al.,
turebook words are almost all singular and plural 2015b; Hill et al., 2016; Conneau et al., 2017a;
nouns (NN / NNS). We also observe tags which Logeswaran and Lee, 2018). Recently, contextu-
are exclusively Glove oriented, namely adverbs alized word representations have shown promis-
(RB), prepositions (IN) and determiners (DT). ing improvements when combined with existing
embeddings (Melamud et al., 2016; Peters et al.,
5 Conclusion 2017; McCann et al., 2017; Peters et al., 2018).
We expect that integrating Picturebook with these
Traditionally, word representations have been built
embeddings to lead to further performance im-
on co-occurrences of neighbouring words; and
provements as well.
such representations only make use of the statis-
tics of the text distribution. Picturebook embed- Acknowledgments
dings offer an alternative approach to constructing
word representations grounded in image search The authors would like to thank Chuck Rosen-
engines. In this work we demonstrated that Pic- berg, Tom Duerig, Neil Alldrin, Zhen Li, Filipe
turebook complements traditional embeddings on Gonçalves, Mia Chen, Zhifeng Chen, Samy Ben-
a wide variety of tasks. Through the use of mul- gio, Yu Zhang, Kevin Swersky, Felix Hill and the
timodal gating, our models lead to interpretable ACL anonymous reviewers for their valuable ad-
weightings of abstract vs concrete words. In fu- vice and feedback.
ture work, we would like to explore other aspects

930
References Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-
cedes Garcia-Martinez, Marc Masana, Luis Herranz,
Andrew Anderson, Douwe Kiela, Stephen Clark, and and Joost van de Weijer. 2017. LIUM-CVC Submis-
Massimo Poesio. 2017. Visually Grounded and Tex- sions for WMT17 Multimodal Translation Task. In
tual Semantic Models Differentially Decode Brain Conference on Machine Translation.
Activity Associated with Concrete and Abstract
Nouns. In ACL. Mauro Cettolo, Jan Niehues, Sebastian Stuker, Luisa
Bentivogli, and Marcello Federico. 2014. Report
John Arevalo, Thamar Solorio, Manuel Montes on the 11th IWSLT Evaluation Campaign, IWSLT
y Gomez, and Fabio A. Gonzalez. 2017. Gated 2014. In IWSLT.
Multimodal Units for Information Fusion. In
arXiv:1702.01992. Grzegorz Chrupala, Akos Kadar, and Afra Alishah.
2015. Learning language through pictures. In
Jimmy Ba, Jamie Kiros, and Geoffrey Hinton. 2016. EMNLP.
Layer Normalization. In arXiv:1607.06450. Guillem Collell and Marie-Francine Moens. 2016. Is
an Image Worth More than a Thousand Words? On
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, the Fine-Grain Semantic Differences between Visual
Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron and Linguistic Representations. In COLING.
Courville, and Yoshua Bengio. 2017. An Actor-
Critic Algorithm for Sequence Prediction. In ICLR. Guillem Collell, Ted Zhang, and Marie-Francine
Moens. 2017. Imagined Visual Representations as
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Multimodal Embeddings. In AAAI.
gio. 2015. Neural Machine Translation by Jointly
Learning to Align and Translate. In ICLR. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c
Barrault, and Antoine Bordes. 2017a. Supervised
Shane Bergsma and Benjamin Van Durme. 2011. Learning of Universal Sentence Representations
Learning Bilingual Lexicons using the Visual Simi- from Natural Language Inference Data. In EMNLP.
larity of Labeled Web Images. In IJCAI. Alexis Conneau, Holger Schwenk, Loı̈c Barrault, and
Yann Lecun. 2017b. Very deep convolutional net-
Shane Bergsma and Randy Goebel. 2011. Using Vi- works for text classification. In EACL.
sual Information to Predict Lexical Preference. In
RANLP. Michael Denkowski and Alon Lavie. 2014. Meteor
universal: Language specific translation evaluation
Sai Abishek Bhaskar, Maximilian Koper, Sabine for any target language. In EACL: Workshop on Sta-
Schulte Im Walde, and Diego Frassinelli. 2017. Ex- tistical Machine Translation.
ploring Multi-Modal Text+Image Models to Distin-
guish between Abstract and Concrete Nouns. In Desmond Elliott, Stella Frank, Loic Barrault, Fethi
IWCS Workshop on Foundations of Situated and Bougares, and Lucia Specia. 2017. Findings of the
Multimodal Communication. Second Shared Task on Multimodal Machine Trans-
lation and Multilingual Image Description. In Con-
Samuel Bowman, Gabor Angeli, Christopher Potts, ference on Machine Translation.
and Christopher Manning. 2015. A large annotated Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-
corpus for learning natural language inference. In cia Specia. 2016. Multi30K: Multilingual English-
EMNLP. German Image Description. In ACL: Workshop on
Vision and Language.
Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014.
Multimodal Distributional Semantics. Journal of Fartash Faghri, David Fleet, Jamie Kiros, and
Artificial Intelligence Research 49(1). Sanja Fidler. 2017. VSE++: Improving Visual-
Semantic Embeddings with Hard Negatives. In
Marc Brysbaert, Amy Beth Warriner, and Victor Ku- arXiv:1707.05612.
perman. 2013. Concreteness ratings for 40 thousand
generally known English word lemmas. Behavior Goran Glavas, Ivan Vulic, and Simone Paolo Ponzetto.
Research Methods 46(3). 2017. If Sentences Could See: Investigating Vi-
sual Information for Semantic Textual Similarity. In
Luana Bulat, Stephen Clark, and Ekaterina Shutova. IWCS.
2017. Speaking, Seeing, Understanding: Correlat- Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang
ing semantic models with conceptual representation Wang. 2018. Look, Imagine and Match: Improving
in the brain. In EMNLP. Textual-Visual Cross-Modal Retrieval with Genera-
tive Models. In CVPR.
Luana Bulat, Douwe Kiela, and Stephen Clark. 2015.
Vision and Feature Norms: Improving automatic Mareike Hartmann and Anders Sogaard. 2017. Limita-
feature norm learning through cross-modal maps. In tions of Cross-Lingual Learning from Image Search.
NAACL. In arXiv:1709.05914.

931
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Douwe Kiela, Laura Rimell, Ivan Vulic, and Stephen
Sun. 2016. Deep Residual Learning for Image Clark. 2015a. Exploiting Image Generality for Lex-
Recognition. In CVPR. ical Entailment Detection. In ACL.

Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Douwe Kiela, Anita Vero, and Stephen Clark. 2016.
Learning distributed representations of sentences Comparing data sources and architectures for deep
from unlabelled data. In NAACL. visual representation learning in semantics. In
EMNLP.
Felix Hill and Anna Korhonen. 2014. Learning Ab-
stract Concept Embeddings from Multi-Modal Data: Douwe Kiela, Ivan Vulic, and Stephen Clark. 2015b.
Since You Probably Can’t See What I Mean. In Visual Bilingual Lexicon Induction with Transferred
EMNLP. ConvNet Features. In EMNLP.
Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Diederik Kingma and Jimmy Ba. 2015. Adam: A
SimLex-999: Evaluating Semantic Models with Method for Stochastic Optimization. In ICLR.
(Genuine) Similarity Estimation. Computational
Linguistics 41(4). Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel.
2015a. Unifying Visual-Semantic Embeddings
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long with Multimodal Neural Language Models. In
Short-Term Memory. Neural Computation 9(8). arXiv:1411.2539.
Po-Sen Huang, Chong Wang, Sitao Huang, Dengyong Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,
Zhou, and Li Deng. 2018a. Towards Neural Phrase- Richard Zemel, Raquel Urtasun, Antonio Torralba,
based Machine Translation. In ICLR. and Sanja Fidler. 2015b. Skip-thought vectors. In
NIPS.
Yan Huang, Qi Wu, and Liang Wang. 2018b. Learning
Semantic Concepts and Order for Image and Sen- Angeliki Lazaridou, Nghia The Pham, and Marco Ba-
tence Matching. In CVPR. roni. 2015. Combining Language and Vision with a
Multimodal Skip-gram Model. In ACL.
Jinbae Im and Sungzoon Cho. 2017. Distance-based
self-attention network for natural language infer- Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu,
ence. In arXiv:1712.02047. and Xiaodong He. 2018. Stacked Cross Attention
Hakan Inan, Khashayar Khosravi, and Richard Socher. for Image-Text Matching. In arXiv:1803.08024.
2017. Tying Word Vectors and Word Classifiers: A
Tsung-Yi Lin, Michael Maire, Serge Belongie,
Loss Framework for Languag Modeling. In ICLR.
Lubomir Bourdev, Ross Girshick, James Hays,
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Pietro Perona, Deva Ramanan, Lawrence Zitnick,
Tomas Mikolov. 2017. Bag of Tricks for Efficient and Piotr Dollár. 2014. Microsoft COCO: Common
Text Classification. In EACL. Objects in Context. In ECCV.

Armand Joulin, Laurens van der Maaten, Allan Jabri, Lajanugen Logeswaran and Honglak Lee. 2018. An
and Nicolas Vasilache. 2016. Learning Visual Fea- efficient framework for learning sentence represen-
tures from Large Weakly Supervised Data. In tations. In ICLR.
ECCV.
Junhua Mao, Jiajing Xu, Yushi Jing, and Alan Yuille.
Douwe Kiela. 2016. MMFeat: A Toolkit for Extracting 2016. Training and Evaluating Multimodal Word
Multi-Modal Features. In ACL: System Demonstra- Embeddings with Large-scale Web Annotated Im-
tions. ages. In NIPS.

Douwe Kiela and Leon Bottou. 2014. Learning Image Marco Marelli, Stefano Menini, Marco Baroni, Luisa
Embeddings using Convolutional Neural Networks Bentivogli, Raffaella bernardi, and Roberto Zampar-
for Improved Multi-Modal Semantics. In EMNLP. elli. 2014. A SICK cure for the evaluation of com-
positional distributional semantic models. In LREC.
Douwe Kiela, Alexis Conneau, Allan Jabri, and Max-
imilian Nickel. 2017. Learning Visually Grounded Bryan McCann, James Bradbury, Caiming Xiong, and
Sentence Representations. In arXiv:1707.06320. Richard Socher. 2017. Learned in translation: Con-
textualized word vectors. In NIPS.
Douwe Kiela, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2018. Efficient Large-Scale Multi- Oren Melamud, Jacob Goldberger, and Ido Dagan.
Modal Classification. In AAAI. 2016. context2vec: Learning Generic Context Em-
bedding with Bidirectional LSTM. In CoNLL.
Douwe Kiela, Felix Hill, Anna Korhonen, and Stephen
Clark. 2014. Improving Multi-Modal Representa- Jeffrey Pennington, Richard Socher, and Christopher
tions Using Image Dispersion: Why Less is Some- Manning. 2014. GloVe: Global Vectors for Word
times More. In ACL. Representation. In EMNLP.

932
Gabriel Pereyra, George Tucker, Jan Chorowski, Jiang Wang, Yang Song, Thomas Leung, Chuck Rosen-
Łukasz Kaiser, and Geoffrey Hinton. 2017. Regu- berg, Jingbin Wang, James Philbin, Bo Chen, and
larizing Neural Networks by Penalizing Confident Ying Wu. 2014. Learning Fine-grained Image Sim-
Output Distributions. In ICLR Workshop. ilarity with Deep Ranking. In CVPR.
Matthew Peters, Waleed Ammar, Chandra Bhagavat- Shaonan Wang, Jiajun Zhang, and Chengqing Zong.
ula, and Russell Power. 2017. Semi-supervised se- 2018. Learning Multimodal Word Representation
quence tagging with bidirectional language models. via Dynamic Fusion Methods. In AAAI.
In ACL.
Adina Williams, Nikita Nangia, and Samuel Bow-
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt man. 2017. A Broad-Coverage Challenge Corpus
Gardner, Christopher Clark, Kenton Lee, and Luke for Sentence Understanding through Inference. In
Zettlemoyer. 2018. Deep contextualized word rep- arXiv:1704.05426.
resentations. In arXiv:1802.05365.
Sam Wiseman and Alexander M. Rush. 2016.
Ofir Press and Lior Wolf. 2017. Using the Output Em- Sequence-to-Sequence Learning as Beam-Search
bedding to Improve Language Models. In EACL. Optimization. In EMNLP.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,
and Wojciech Zaremba. 2016. Sequence Level Éloi Zablocki, Benjamin Piwowarski, Laure Soulier,
Training with Recurrent Neural Networks. In ICLR. and Patrick Gallinari. 2018. Learning Multi-Modal
Word Representation Grounded in Visual Context.
Stanislau Semeniuta, Aliaksei Severyn, and Erhardt In AAAI.
Barth. 2016. Recurrent Dropout without Memory
Loss. In COLING. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level Convolutional Networks for Text
Rico Sennrich, Barry Haddow, and Alexandra Birch. Classification. In NIPS.
2016. Neural Machine Translation of Rare Words
with Subword Units. In ACL.
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang,
Sen Wang, and Chengqi Zhang. 2018. Rein-
forced Self-Attention Network: a Hybrid of Hard
and Soft Attention for Sequence Modeling. In
arXiv:1801.10296.
Ekaterina Shutova, Douwe Kiela, and Jean Maillard.
2016. Black Holes and White Rabbits: Metaphor
Identification with Visual Features. In NAACL.
Carina Silberer, Vittorio Ferrari, and Mirella Lapata.
2017. Visually Grounded Meaning Representations.
PAMI .
Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2015.
Sequence to Sequence Learning with Neural Net-
works. In NIPS.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
Jonathon Shlens, and Zbigniew Wojna. 2016. Re-
thinking the Inception Architecture for Computer
Vision. In CVPR.
Kai Sheng Tai, Richard Socher, and Christopher D
Manning. 2015. Improved Semantic Representa-
tions From Tree-Structured Long Short-Term Mem-
ory Networks. In ACL.
Steffen Thoma, Achim Rettinger, and Fabian Both.
2017. Towards Holistic Concept Representa-
tions: Embedding Relational Knowledge, Visual At-
tributes, and Distributional Word Semantics. In
ISWC.
Ivan Vulic, Douwe Kiela, Stephen Clark, and Marie-
Francine Moens. 2016. Multi-Modal Representa-
tions for Improved Bilingual Lexicon Learning. In
ACL.

933

You might also like