P15-2120
P15-2120
P15-2120
731
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 731–736,
Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics
also introduce several alternative ways of reducing vector for the SVO triple is given by
the number of tensor parameters by using matri- X
ces. The best performing method uses two matri- V (s, o)l = Vljk ok sj (1)
j,k
ces, one representing the subject-verb interactions
and the other the verb-object interactions. Some We aim to learn distributional vectors s and o
interaction between the subject and the object is for subjects and objects, and tensors V for verbs,
re-introduced through a softmax layer. A similar such that the output vectors V (s, o) are distri-
method is presented in Paperno et al. (2014). Mi- butional representations of the entire SVO triple.
lajevs et al. (2014) use vectors generated by a neu- While there are several possible definitions of
ral language model to construct verb matrices and the sentence space (Clark, 2013; Baroni et al.,
several different composition operators to generate 2014), we follow previous work (Grefenstette et
the composed subject-verb-object sentence repre- al., 2013) by using a contextual sentence space
sentation. consisting of content words that occur within the
In this paper, we use tensor rank decomposi- same sentences as the SVO triple.
tion (Kolda and Bader, 2009) to represent each Low-Rank Tensor Representations Following
verb’s tensor as a sum of tensor products of vec- Lei et al. (2014), we represent each verb’s tensor
tors. We learn the component vectors and apply using a low-rank canonical polyadic (CP) decom-
the composition without ever constructing the full position to reduce the numbers of parameters that
tensors and thus we are able to improve on both must be learned during training. As a higher-order
memory usage and efficiency. This approach fol- analogue to singular value decomposition for ma-
lows recent work on using low-rank tensors to pa- trices, CP decomposition factors a tensor into a
rameterize models for dependency parsing (Lei et sum of R tensor products of vectors.1 Given a
al., 2014) and semantic role labelling (Lei et al., third-order tensor V ∈ RS×N ×N , the CP decom-
2015). Our work applies the same tensor rank position of V is:
decompositions, and similar optimization algo-
R
X
rithms, to the task of constructing a syntax-driven
V= Pr ⊗ Qr ⊗ Rr (2)
model for CDS. Although we focus on the Cat-
r=1
egorial framework, the low-rank decomposition
methods are also applicable to other tensor-based where P ∈ RR×S , Q ∈ RR×N , R ∈ RR×N are
semantic models including Van de Cruys (2010), parameter matrices, Pr gives the rth row of matrix
Smolensky and Legendre (2006), and Blacoe et al. P, and ⊗ is the tensor product.
(2013). The smallest R that allows the tensor to be ex-
pressed as this sum of outer products is the rank
of the tensor (Kolda and Bader, 2009). By fixing a
2 Model value for R that is sufficiently small compared to
S and N (forcing the verb tensor to have rank of
Tensor Models for Verbs We model each tran- at most R), and directly learning the parameters of
sitive verb as a bilinear function mapping subject the low-rank approximation using gradient-based
and object noun vectors, each of dimensionality optimization, we learn a low-rank tensor requiring
N , to a single sentence vector of dimensionality S fewer parameters without ever having to store the
(Coecke et al., 2011; Maillard et al., 2014) repre- full tensor.
senting the composed subject-verb-object (SVO) In addition to reducing the number of parame-
triple. Each transitive verb has its own third- ters, representing tensors in this form allows us to
order tensor, which defines this bilinear function. formulate the verb tensor’s action on noun vectors
Consider a verb V with associated tensor V ∈ as matrix multiplication. For a tensor in the form
RS×N ×N , and vectors s ∈ RN , o ∈ RN for of Eq. (2), the output SVO vector is given by
subject and object nouns, respectively. Then the
compositional representation for the subject, verb, V (s, o) = P> (Qs Ro) (3)
and object is a vector V (s, o) ∈ RS , produced by where is the elementwise vector product.
applying tensor contraction (the higher-order ana- 1
However, unlike matrix singular value decomposition,
logue of matrix multiplication) to the verb tensor the component vectors in the CP decomposition are not nec-
and two noun vectors. The lth component of the essarily orthonormal.
732
3 Training ing the Paragraph Vector distributed bag of words
method of Le and Mikolov (2014), an extension of
We train the compositional model for verbs in
the skip-gram model of Mikolov et al. (2013). In
three steps: extracting transitive verbs and their
our experiments, given an SVO triple, the model
subject and object nouns from corpus data, pro-
must predict contextual words sampled from all
ducing distributional vectors for the nouns and the
sentences containing that triple. In the process, the
SVO triples, and then learning parameters of the
model learns vector embeddings for both the SVO
verb functions, which map the nouns to the SVO
triples and for the words in the sentences such that
triple vectors.
SVO vectors have a high dot product with their
Corpus Data We extract SVO triples from an contextual word vectors. While previous work
October 2013 download of Wikipedia, tokenized (Milajevs et al., 2014) has used prediction-based
using Stanford CoreNLP (Manning et al., 2014), vectors for words in a tensor-based CDS model,
lemmatized with the Morpha lemmatizer (Minnen ours uses prediction-based vectors for both words
et al., 2001), and parsed using the C&C parser and phrases to train a tensor regression model.
(Curran et al., 2007). We filter the SVO triples We learn 100-dimensional vectors for nouns
to a set containing 345 distinct verbs: the verbs and SVO triples with a modified version of
from our test datasets, along with some additional word2vec,2 using the hierarchical sampling
high-frequency verbs included to produce more method with the default hyperparameters and 20
representative sentence spaces. For each verb, we iterations through the training data.
selected up to 600 triples which occurred more
Training Methods We learn the tensor V of pa-
than once and contained subject and object nouns
rameters for a given verb V using multi-linear re-
that occurred at least 100 times (to allow suffi-
gression, treating the noun vectors s and o as in-
cient context to produce a distributional represen-
put and the composed SVO triple vector V (s, o)
tation for the triple). This resulted in approxi-
as the regression output. Let MV be the num-
mately 150,000 SVO triples overall. th
ber of training instances for V , where the i in-
(i) (i) (i)
stance is a triple of vectors s , o , t , which
Distributional Vectors We produce two types
of distributional vectors for nouns and SVO triples are the distributional vectors for the subject noun,
using the Wikipedia corpus. Since these methods object noun, and the SVO triple, respectively. We
for producing distributional vectors for the SVO aim to learn a verb tensor V (either in full or in
triples require that the triples occur in a corpus of decomposed, low-rank form) that minimizes the
text, the methods are not a replacement for a com- mean of the squared residuals between the pre-
positional framework that can produce representa- dicted SVO vectors V (s(i) , o(i) ) and those vec-
tions for previously unseen expressions. However, tors obtained distributionally from the corpus, t(i) .
they can be used to generate data to train such a Specifically, we attempt to minimize the following
model, as we will describe. loss function:
1) Count vectors (SVD): we count the num- MV
ber of times each noun or SVO triple co-occurs 1 X
L(V ) = ||V (s(i) , o(i) ) − t(i) ||22 (4)
with each of the 10,000 most frequent words (ex- MV
i=1
cluding stopwords) in the Wikipedia corpus, using
sentences as context boundaries. If the verb in the V (s, o) is given by Eq. (1) for full tensors, and by
SVO triple is itself a content word, we do not in- Eq. (3) for tensors represented in low-rank form.
clude it as context for the triple. This produces one In both the low-rank and full-rank tensor learn-
set of context vectors for nouns and another for ing, we use mini-batch ADADELTA optimization
SVO triples. We weight entries in these vectors (Zeiler, 2012) up to a maximum of 500 iterations
using the t-test weighting scheme (Curran, 2004; through the training data, which we found to be
Polajnar and Clark, 2014), and then reduce the sufficient for convergence for every verb. Rather
vectors to 100 dimensions via singular value de- than placing a regularization penalty on the ten-
composition (SVD), decomposing the noun vec- sor parameters, we use early stopping if the loss
tors and SVO vectors separately. 2
https://groups.google.com/d/
2) Prediction vectors (PV): we train vector msg/word2vec-toolkit/Q49FIrNOQRo/
embeddings for nouns and SVO triples by adapt- J6KG8mUj45sJ
733
increases on a validation set consisting of 10% of GS11 KS14 # tensor
the available SVO triples for each verb. SVD PV SVD PV params.
For low-rank tensors, we compare seven differ- Add. 0.13 0.14 0.55 0.56 –
ent maximal ranks: R=1, 5, 10, 20, 30, 40 and 50. Mult. 0.13 0.14 0.09 0.27 –
To learn the parameters of the low-rank tensors, R=1 0.10 0.05 0.18 0.30 300
we use an alternating optimization method (Kolda R=5 0.26 0.30 0.28 0.40 1.5K
and Bader, 2009; Lei et al., 2014): performing gra- R=10 0.29 0.32 0.26 0.45 3K
dient descent on one of the parameter matrices (for R=20 0.31 0.34 0.39 0.44 6K
example P) to minimize the loss function while R=30 0.28 0.33 0.32 0.46 9K
holding the other two fixed (Q and R), then re- R=40 0.32 0.30 0.31 0.52 12K
peating for the other parameter matrices in turn. R=50 0.34 0.32 0.42 0.51 15K
The parameter matrices are randomly initialized.3 Full 0.29 0.36 0.41 0.52 1M
734
on GS11, and R=50 on KS14. References
On GS11, the SVD and PV vectors have vary- Marco Baroni and Roberto Zamparelli. 2010. Nouns
ing but mostly comparable performance, with PV are vectors, adjectives are matrices: Representing
having higher performance on 5 out of 8 models. adjective-noun constructions in semantic space. In
However, on KS14, the PV vectors have better per- Proceedings of the 2010 Conference on Empirical
Methods in Natural Language Processing (EMNLP
formance than the SVD vectors for every model
2010), Cambridge, Massachusetts.
by at least 0.05 points, which is consistent with
prior work comparing count and predict vectors on Marco Baroni, Raffaela Bernardi, and Roberto Zam-
these datasets (Milajevs et al., 2014). parelli. 2014. Frege in space: A program of compo-
The low-rank tensor models are also at least sitional distributional semantics. Linguistic Issues
in Language Technology, 9.
twice as fast to train as the full tensors: on a single
core, training a rank-1 tensor takes about 5 sec- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
onds for each verb on average, ranks 5-50 each Christian Jauvin. 2003. A neural probabilistic lan-
take between 1 and 2 minutes, and the full tensors guage model. Journal of Machine Learning Re-
search, (3):1137–1155.
each take about 4 minutes. Since a separate tensor
is trained for each verb, this allows a substantial William Blacoe, Elham Kashefi, and Mirella Lapata.
amount of time to be saved even when using the 2013. A quantum-theoretic approach to distribu-
constrained vocabulary of 345 verbs. tional semantics. In Proceedings of the 2013 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
6 Conclusion guage Technologies (NAACL-HLT 2013), Atlanta,
Georgia.
We find that low-rank tensors for verbs achieve
comparable or better performance than full-rank Stephen Clark. 2013. Type-driven syntax and se-
tensors on both verb disambiguation and sentence mantics for composing meaning vectors. Quan-
similarity tasks, while reducing the number of pa- tum Physics and Linguistics: A Compositional, Dia-
rameters that must be learned and stored for each grammatic Discourse, pages 359–377.
verb by at least two orders of magnitude, and cut- Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen
ting training time in half. Clark. 2011. Mathematical foundations for a com-
While in our experiments the prediction-based positional distributional model of meaning. Linguis-
vectors outperform the count-based vectors on tic Analysis, 36(1-4):345–384.
both tasks for most models, Levy et al. (2015) in- R. Collobert, J. Weston, L. Bottou, M. Karlen,
dicate that tuning hyperparameters of the count- K. Kavukcuoglu, and P. Kuksa. 2011. Natural lan-
based vectors may be able to produce compara- guage processing (almost) from scratch. Journal of
ble performance. Regardless, we show that the Machine Learning Research, 12:2493–2537.
low-rank tensors are able to achieve performance James R Curran, Stephen Clark, and Johan Bos. 2007.
comparable to the full rank for both types of vec- Linguistically motivated large-scale NLP with C&C
tors. This is important for extending the model and Boxer. In Proceedings of the Demonstration
to many more grammatical types (including those Session of the 45th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2007),
with corresponding tensors of higher order than in-
Prague, Czech Republic.
vestigated here) to build a wide-coverage tensor-
based semantic system using, for example, the James R. Curran. 2004. From distributional to seman-
CCG parser of Curran et al. (2007). tic similarity. Ph.D. thesis, University of Edinburgh.
735
Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. A Jeff Mitchell and Mirella Lapata. 2008. Vector-based
study of entanglement in a categorical framework of models of semantic composition. In Proceedings of
natural language. In Proceedings of the 11th Work- the 46th Annual Meeting of the Assocation for Com-
shop on Quantum Physics and Logic (QPL 2014), putational Linguistics: Human Language Technolo-
Kyoto, Japan, June. gies (ACL-08: HLT), Columbus, Ohio.
Tamara G Kolda and Brett W Bader. 2009. Ten- Denis Paperno, Nghia The Pham, and Marco Baroni.
sor decompositions and applications. SIAM Review, 2014. A practical and linguistically-motivated ap-
51(3):455–500. proach to compositional distributional semantics.
In Proceedings of the 52nd Annual Meeting of
Quoc V. Le and Tomas Mikolov. 2014. Distributed the Association for Computational Linguistics (ACL
representations of sentences and documents. In Pro- 2014), Baltimore, Maryland.
ceedings of the 31st International Conference on
Machine Learning (ICML 2014), Beijing, China. Tamara Polajnar and Stephen Clark. 2014. Improv-
ing distributional semantic vectors through context
Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and selection and normalisation. In Proceedings of the
Tommi Jaakkola. 2014. Low-rank tensors for scor- 14th Conference of the European Chapter of the
ing dependency structures. In Proceedings of the Association for Computational Linguistics (EACL
52nd Annual Meeting of the Association for Compu- 2014), Gothenburg, Sweden.
tational Linguistics (ACL 2014), Baltimore, Mary-
land. Tamara Polajnar, Luana Fagarasan, and Stephen Clark.
2014. Reducing dimensions of tensors in type-
Tao Lei, Yuan Zhang, Lluis Marquez, Alessandro Mos- driven distributional semantics. In Proceedings of
chitti, and Regina Barzilay. 2015. High-order low- the 2014 Conference on Empirical Methods in Nat-
rank tensors for semantic role labeling. In Proceed- ural Language Processing (EMNLP 2014), Doha,
ings of the 2015 Conference of the North American Qatar.
Chapter of the Association for Computational Lin-
guistics – Human Language Technologies (NAACL- Hinrich Schütze. 1998. Automatic word sense dis-
HLT 2015), Denver, Colorado. crimination. Computational Linguistics, 24(1):97–
124.
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
proving distributional similarity with lessons learned Paul Smolensky and Geraldine Legendre. 2006.
from word embeddings. Transactions of the Associ- The Harmonic Mind: from neural computation to
ation for Computational Linguistics, 3:211–225. optimality-theoretic grammar. MIT Press, Cam-
bridge, MA.
Jean Maillard, Stephen Clark, and Edward Grefen-
stette. 2014. A type-driven tensor-based semantics Richard Socher, Brody Huval, Christopher D. Man-
for CCG. In Proceedings of the EACL 2014 Type ning, and Andrew Y. Ng. 2012. Semantic Composi-
Theory and Natural Language Semantics Workshop tionality Through Recursive Matrix-Vector Spaces.
(TTNLS), Gothenburg, Sweden. In Proceedings of the 2012 Conference on Empirical
Methods in Natural Language Processing and Com-
Christopher D. Manning, Mihai Surdeanu, John Bauer, putational Natural Language Learning (EMNLP-
Jenny Finkel, Steven J. Bethard, and David Mc- CoNLL 2012), Jeju Island, Korea.
Closky. 2014. The Stanford CoreNLP natural lan-
Peter D Turney and Patrick Pantel. 2010. From fre-
guage processing toolkit. In Proceedings of 52nd
quency to meaning: Vector space models of se-
Annual Meeting of the Association for Computa-
mantics. Journal of Artificial Intelligence Research,
tional Linguistics (ACL 2014): System Demonstra-
37(1):141–188.
tions, pages 55–60, Baltimore, Maryland.
Tim Van de Cruys. 2010. A non-negative tensor fac-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- torization model for selectional preference induc-
rado, and Jeffrey Dean. 2013. Distributed repre- tion. Journal of Natural Language Engineering,
sentations of words and phrases and their composi- 16(4):417–437.
tionality. In Neural Information Processing Systems
(NIPS 2013). Fabio M Zanzotto and Lorenzo Dell’Arciprete. 2012.
Distributed tree kernels. In Proceedings of the
Dmitrijs Milajevs, Dimitri Kartsaklis, Mehrnoosh 29th International Conference on Machine Learning
Sadrzadeh, and Matthew Purver. 2014. Evaluating (ICML 2012), Edinburgh, Scotland.
neural word representations in tensor-based compo-
sitional settings. In Proceedings of the 2014 Con- Matthew D Zeiler. 2012. ADADELTA: an
ference on Empirical Methods in Natural Language adaptive learning rate method. arXiv preprint
Processing (EMNLP 2014). arXiv:1212.5701.
Guido Minnen, John Carroll, and Darren Pearce. 2001.
Applied morphological processing of English. Nat-
ural Language Engineering, 7(03):207–223.
736