also introduce several alternative ways of reducing vector for the SVO triple is given by
the number of tensor parameters by using matri- X
ces. The best performing method uses two matri- V (s, o)l = Vljk ok sj (1)
ces, one representing the subject-verb interactions
and the other the verb-object interactions. Some We aim to learn distributional vectors s and o
interaction between the subject and the object is for subjects and objects, and tensors V for verbs,
re-introduced through a softmax layer. A similar such that the output vectors V (s, o) are distri-
method is presented in Paperno et al. (2014). Mi- butional representations of the entire SVO triple.
lajevs et al. (2014) use vectors generated by a neu- While there are several possible definitions of
ral language model to construct verb matrices and the sentence space (Clark, 2013; Baroni et al.,
several different composition operators to generate 2014), we follow previous work (Grefenstette et
the composed subject-verb-object sentence repre- al., 2013) by using a contextual sentence space
sentation. consisting of content words that occur within the
In this paper, we use tensor rank decomposi- same sentences as the SVO triple.
tion (Kolda and Bader, 2009) to represent each Low-Rank Tensor Representations Following
verb’s tensor as a sum of tensor products of vec- Lei et al. (2014), we represent each verb’s tensor
tors. We learn the component vectors and apply using a low-rank canonical polyadic (CP) decom-
the composition without ever constructing the full position to reduce the numbers of parameters that
tensors and thus we are able to improve on both must be learned during training. As a higher-order
memory usage and efficiency. This approach fol- analogue to singular value decomposition for ma-
lows recent work on using low-rank tensors to pa- trices, CP decomposition factors a tensor into a
rameterize models for dependency parsing (Lei et sum of R tensor products of vectors.1 Given a
al., 2014) and semantic role labelling (Lei et al., third-order tensor V ∈ RS×N ×N , the CP decom-
2015). Our work applies the same tensor rank position of V is:
decompositions, and similar optimization algo-
rithms, to the task of constructing a syntax-driven
V= Pr ⊗ Qr ⊗ Rr (2)
model for CDS. Although we focus on the Cat-
egorial framework, the low-rank decomposition
methods are also applicable to other tensor-based where P ∈ RR×S , Q ∈ RR×N , R ∈ RR×N are
semantic models including Van de Cruys (2010), parameter matrices, Pr gives the rth row of matrix
Smolensky and Legendre (2006), and Blacoe et al. P, and ⊗ is the tensor product.
(2013). The smallest R that allows the tensor to be ex-
pressed as this sum of outer products is the rank
of the tensor (Kolda and Bader, 2009). By fixing a
2 Model value for R that is sufficiently small compared to
S and N (forcing the verb tensor to have rank of
Tensor Models for Verbs We model each tran- at most R), and directly learning the parameters of
sitive verb as a bilinear function mapping subject the low-rank approximation using gradient-based
and object noun vectors, each of dimensionality optimization, we learn a low-rank tensor requiring
N , to a single sentence vector of dimensionality S fewer parameters without ever having to store the
(Coecke et al., 2011; Maillard et al., 2014) repre- full tensor.
senting the composed subject-verb-object (SVO) In addition to reducing the number of parame-
triple. Each transitive verb has its own third- ters, representing tensors in this form allows us to
order tensor, which defines this bilinear function. formulate the verb tensor’s action on noun vectors
Consider a verb V with associated tensor V ∈ as matrix multiplication. For a tensor in the form
RS×N ×N , and vectors s ∈ RN , o ∈ RN for of Eq. (2), the output SVO vector is given by
subject and object nouns, respectively. Then the
compositional representation for the subject, verb, V (s, o) = P> (Qs Ro) (3)
and object is a vector V (s, o) ∈ RS , produced by where is the elementwise vector product.
applying tensor contraction (the higher-order ana- 1
However, unlike matrix singular value decomposition,
logue of matrix multiplication) to the verb tensor the component vectors in the CP decomposition are not nec-
and two noun vectors. The lth component of the essarily orthonormal.
3 Training ing the Paragraph Vector distributed bag of words
method of Le and Mikolov (2014), an extension of
We train the compositional model for verbs in
the skip-gram model of Mikolov et al. (2013). In
three steps: extracting transitive verbs and their
our experiments, given an SVO triple, the model
subject and object nouns from corpus data, pro-
must predict contextual words sampled from all
ducing distributional vectors for the nouns and the
sentences containing that triple. In the process, the
SVO triples, and then learning parameters of the
model learns vector embeddings for both the SVO
verb functions, which map the nouns to the SVO
triples and for the words in the sentences such that
triple vectors.
SVO vectors have a high dot product with their
Corpus Data We extract SVO triples from an contextual word vectors. While previous work
October 2013 download of Wikipedia, tokenized (Milajevs et al., 2014) has used prediction-based
using Stanford CoreNLP (Manning et al., 2014), vectors for words in a tensor-based CDS model,
lemmatized with the Morpha lemmatizer (Minnen ours uses prediction-based vectors for both words
et al., 2001), and parsed using the C&C parser and phrases to train a tensor regression model.
(Curran et al., 2007). We filter the SVO triples We learn 100-dimensional vectors for nouns
to a set containing 345 distinct verbs: the verbs and SVO triples with a modified version of
from our test datasets, along with some additional word2vec,2 using the hierarchical sampling
high-frequency verbs included to produce more method with the default hyperparameters and 20
representative sentence spaces. For each verb, we iterations through the training data.
selected up to 600 triples which occurred more
Training Methods We learn the tensor V of pa-
than once and contained subject and object nouns
rameters for a given verb V using multi-linear re-
that occurred at least 100 times (to allow suffi-
gression, treating the noun vectors s and o as in-
cient context to produce a distributional represen-
put and the composed SVO triple vector V (s, o)
tation for the triple). This resulted in approxi-
as the regression output. Let MV be the num-
mately 150,000 SVO triples overall. th
ber of training instances for V , where the i in-
(i) (i) (i)
stance is a triple of vectors s , o , t , which
Distributional Vectors We produce two types
of distributional vectors for nouns and SVO triples are the distributional vectors for the subject noun,
using the Wikipedia corpus. Since these methods object noun, and the SVO triple, respectively. We
for producing distributional vectors for the SVO aim to learn a verb tensor V (either in full or in
triples require that the triples occur in a corpus of decomposed, low-rank form) that minimizes the
text, the methods are not a replacement for a com- mean of the squared residuals between the pre-
positional framework that can produce representa- dicted SVO vectors V (s(i) , o(i) ) and those vec-
tions for previously unseen expressions. However, tors obtained distributionally from the corpus, t(i) .
they can be used to generate data to train such a Specifically, we attempt to minimize the following
model, as we will describe. loss function:
1) Count vectors (SVD): we count the num- MV
ber of times each noun or SVO triple co-occurs 1 X
L(V ) = ||V (s(i) , o(i) ) − t(i) ||22 (4)
with each of the 10,000 most frequent words (ex- MV
cluding stopwords) in the Wikipedia corpus, using
sentences as context boundaries. If the verb in the V (s, o) is given by Eq. (1) for full tensors, and by
SVO triple is itself a content word, we do not in- Eq. (3) for tensors represented in low-rank form.
clude it as context for the triple. This produces one In both the low-rank and full-rank tensor learn-
set of context vectors for nouns and another for ing, we use mini-batch ADADELTA optimization
SVO triples. We weight entries in these vectors (Zeiler, 2012) up to a maximum of 500 iterations
using the t-test weighting scheme (Curran, 2004; through the training data, which we found to be
Polajnar and Clark, 2014), and then reduce the sufficient for convergence for every verb. Rather
vectors to 100 dimensions via singular value de- than placing a regularization penalty on the ten-
composition (SVD), decomposing the noun vec- sor parameters, we use early stopping if the loss
tors and SVO vectors separately. 2
2) Prediction vectors (PV): we train vector msg/word2vec-toolkit/Q49FIrNOQRo/
embeddings for nouns and SVO triples by adapt- J6KG8mUj45sJ
increases on a validation set consisting of 10% of GS11 KS14 # tensor
the available SVO triples for each verb. SVD PV SVD PV params.
For low-rank tensors, we compare seven differ- Add. 0.13 0.14 0.55 0.56 –
ent maximal ranks: R=1, 5, 10, 20, 30, 40 and 50. Mult. 0.13 0.14 0.09 0.27 –
To learn the parameters of the low-rank tensors, R=1 0.10 0.05 0.18 0.30 300
we use an alternating optimization method (Kolda R=5 0.26 0.30 0.28 0.40 1.5K
and Bader, 2009; Lei et al., 2014): performing gra- R=10 0.29 0.32 0.26 0.45 3K
dient descent on one of the parameter matrices (for R=20 0.31 0.34 0.39 0.44 6K
example P) to minimize the loss function while R=30 0.28 0.33 0.32 0.46 9K
holding the other two fixed (Q and R), then re- R=40 0.32 0.30 0.31 0.52 12K
peating for the other parameter matrices in turn. R=50 0.34 0.32 0.42 0.51 15K
The parameter matrices are randomly initialized.3 Full 0.29 0.36 0.41 0.52 1M
