Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

P15-2120

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Low-Rank Tensors for Verbs in Compositional Distributional Semantics

Daniel Fried, Tamara Polajnar, and Stephen Clark


University of Cambridge
Computer Laboratory
{df345,tp366,sc609}@cam.ac.uk

Abstract where each word is represented by a tensor whose


order is determined by the Categorial Grammar
Several compositional distributional se- type of the word. For example, nouns are an
mantic methods use tensors to model atomic type represented by a vector, and adjec-
multi-way interactions between vectors. tives are matrices that act as functions transform-
Unfortunately, the size of the tensors can ing a noun vector into another noun vector (Baroni
make their use impractical in large-scale and Zamparelli, 2010). A transitive verb is a third-
implementations. In this paper, we inves- order tensor that takes the noun vectors represent-
tigate whether we can match the perfor- ing the subject and object and returns a vector in
mance of full tensors with low-rank ap- the sentence space (Polajnar et al., 2014).
proximations that use a fraction of the However, a concrete implementation of the Cat-
original number of parameters. We in- egorial framework requires setting and storing the
vestigate the effect of low-rank tensors on values, or parameters, defining these matrices and
the transitive verb construction where the tensors. These parameters can be quite numerous
verb is a third-order tensor. The results for even low-dimensional sentence spaces. For ex-
show that, while the low-rank tensors re- ample, a third-order tensor for a given transitive
quire about two orders of magnitude fewer verb, mapping two 100-dimensional noun spaces
parameters per verb, they achieve perfor- to a 100-dimensional sentence space, would have
mance comparable to, and occasionally 1003 parameters in its full form. All of the
surpassing, the unconstrained-rank tensors more complex types have corresponding tensors of
on sentence similarity and verb disam- higher order, and therefore a barrier to the practi-
biguation tasks. cal implementation of this framework is the large
number of parameters required to represent an ex-
1 Introduction tended vocabulary and a variety of grammatical
Distributional semantic methods represent word constructions.
meanings by their contextual distributions, for ex- We aim to reduce the size of the models by
ample by computing word-context co-ocurrence demonstrating that reduced-rank tensors, which
statistics (Schütze, 1998; Turney and Pantel, 2010) can be represented in a form requiring fewer pa-
or by learning vector representations for words rameters, can capture the semantics of complex
as part of a context prediction model (Bengio et types as well as the full-rank tensors do. We base
al., 2003; Collobert et al., 2011; Mikolov et al., our experiments on the transitive verb construction
2013). Recent research has also focused on com- for which there are established tasks and datasets
positional distributional semantics (CDS): com- (Grefenstette and Sadrzadeh, 2011; Kartsaklis and
bining the distributional representations for words, Sadrzadeh, 2014).
often in a syntax-driven fashion, to produce distri- Previous work on the transitive verb construc-
butional representations of phrases and sentences tion within the Categorial framework includes a
(Mitchell and Lapata, 2008; Baroni and Zam- two-step linear-regression method for the con-
parelli, 2010; Socher et al., 2012; Zanzotto and struction of the full verb tensors (Grefenstette et
Dell’Arciprete, 2012). al., 2013) and a multi-linear regression method
One method for CDS is the Categorial frame- combined with a two-dimensional plausibility
work (Coecke et al., 2011; Baroni et al., 2014), space (Polajnar et al., 2014). Polajnar et al. (2014)

731
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 731–736,
Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics
also introduce several alternative ways of reducing vector for the SVO triple is given by
the number of tensor parameters by using matri- X
ces. The best performing method uses two matri- V (s, o)l = Vljk ok sj (1)
j,k
ces, one representing the subject-verb interactions
and the other the verb-object interactions. Some We aim to learn distributional vectors s and o
interaction between the subject and the object is for subjects and objects, and tensors V for verbs,
re-introduced through a softmax layer. A similar such that the output vectors V (s, o) are distri-
method is presented in Paperno et al. (2014). Mi- butional representations of the entire SVO triple.
lajevs et al. (2014) use vectors generated by a neu- While there are several possible definitions of
ral language model to construct verb matrices and the sentence space (Clark, 2013; Baroni et al.,
several different composition operators to generate 2014), we follow previous work (Grefenstette et
the composed subject-verb-object sentence repre- al., 2013) by using a contextual sentence space
sentation. consisting of content words that occur within the
In this paper, we use tensor rank decomposi- same sentences as the SVO triple.
tion (Kolda and Bader, 2009) to represent each Low-Rank Tensor Representations Following
verb’s tensor as a sum of tensor products of vec- Lei et al. (2014), we represent each verb’s tensor
tors. We learn the component vectors and apply using a low-rank canonical polyadic (CP) decom-
the composition without ever constructing the full position to reduce the numbers of parameters that
tensors and thus we are able to improve on both must be learned during training. As a higher-order
memory usage and efficiency. This approach fol- analogue to singular value decomposition for ma-
lows recent work on using low-rank tensors to pa- trices, CP decomposition factors a tensor into a
rameterize models for dependency parsing (Lei et sum of R tensor products of vectors.1 Given a
al., 2014) and semantic role labelling (Lei et al., third-order tensor V ∈ RS×N ×N , the CP decom-
2015). Our work applies the same tensor rank position of V is:
decompositions, and similar optimization algo-
R
X
rithms, to the task of constructing a syntax-driven
V= Pr ⊗ Qr ⊗ Rr (2)
model for CDS. Although we focus on the Cat-
r=1
egorial framework, the low-rank decomposition
methods are also applicable to other tensor-based where P ∈ RR×S , Q ∈ RR×N , R ∈ RR×N are
semantic models including Van de Cruys (2010), parameter matrices, Pr gives the rth row of matrix
Smolensky and Legendre (2006), and Blacoe et al. P, and ⊗ is the tensor product.
(2013). The smallest R that allows the tensor to be ex-
pressed as this sum of outer products is the rank
of the tensor (Kolda and Bader, 2009). By fixing a
2 Model value for R that is sufficiently small compared to
S and N (forcing the verb tensor to have rank of
Tensor Models for Verbs We model each tran- at most R), and directly learning the parameters of
sitive verb as a bilinear function mapping subject the low-rank approximation using gradient-based
and object noun vectors, each of dimensionality optimization, we learn a low-rank tensor requiring
N , to a single sentence vector of dimensionality S fewer parameters without ever having to store the
(Coecke et al., 2011; Maillard et al., 2014) repre- full tensor.
senting the composed subject-verb-object (SVO) In addition to reducing the number of parame-
triple. Each transitive verb has its own third- ters, representing tensors in this form allows us to
order tensor, which defines this bilinear function. formulate the verb tensor’s action on noun vectors
Consider a verb V with associated tensor V ∈ as matrix multiplication. For a tensor in the form
RS×N ×N , and vectors s ∈ RN , o ∈ RN for of Eq. (2), the output SVO vector is given by
subject and object nouns, respectively. Then the
compositional representation for the subject, verb, V (s, o) = P> (Qs Ro) (3)
and object is a vector V (s, o) ∈ RS , produced by where is the elementwise vector product.
applying tensor contraction (the higher-order ana- 1
However, unlike matrix singular value decomposition,
logue of matrix multiplication) to the verb tensor the component vectors in the CP decomposition are not nec-
and two noun vectors. The lth component of the essarily orthonormal.

732
3 Training ing the Paragraph Vector distributed bag of words
method of Le and Mikolov (2014), an extension of
We train the compositional model for verbs in
the skip-gram model of Mikolov et al. (2013). In
three steps: extracting transitive verbs and their
our experiments, given an SVO triple, the model
subject and object nouns from corpus data, pro-
must predict contextual words sampled from all
ducing distributional vectors for the nouns and the
sentences containing that triple. In the process, the
SVO triples, and then learning parameters of the
model learns vector embeddings for both the SVO
verb functions, which map the nouns to the SVO
triples and for the words in the sentences such that
triple vectors.
SVO vectors have a high dot product with their
Corpus Data We extract SVO triples from an contextual word vectors. While previous work
October 2013 download of Wikipedia, tokenized (Milajevs et al., 2014) has used prediction-based
using Stanford CoreNLP (Manning et al., 2014), vectors for words in a tensor-based CDS model,
lemmatized with the Morpha lemmatizer (Minnen ours uses prediction-based vectors for both words
et al., 2001), and parsed using the C&C parser and phrases to train a tensor regression model.
(Curran et al., 2007). We filter the SVO triples We learn 100-dimensional vectors for nouns
to a set containing 345 distinct verbs: the verbs and SVO triples with a modified version of
from our test datasets, along with some additional word2vec,2 using the hierarchical sampling
high-frequency verbs included to produce more method with the default hyperparameters and 20
representative sentence spaces. For each verb, we iterations through the training data.
selected up to 600 triples which occurred more
Training Methods We learn the tensor V of pa-
than once and contained subject and object nouns
rameters for a given verb V using multi-linear re-
that occurred at least 100 times (to allow suffi-
gression, treating the noun vectors s and o as in-
cient context to produce a distributional represen-
put and the composed SVO triple vector V (s, o)
tation for the triple). This resulted in approxi-
as the regression output. Let MV be the num-
mately 150,000 SVO triples overall. th
ber of training instances for V , where the  i in-
(i) (i) (i)
stance is a triple of vectors s , o , t , which
Distributional Vectors We produce two types
of distributional vectors for nouns and SVO triples are the distributional vectors for the subject noun,
using the Wikipedia corpus. Since these methods object noun, and the SVO triple, respectively. We
for producing distributional vectors for the SVO aim to learn a verb tensor V (either in full or in
triples require that the triples occur in a corpus of decomposed, low-rank form) that minimizes the
text, the methods are not a replacement for a com- mean of the squared residuals between the pre-
positional framework that can produce representa- dicted SVO vectors V (s(i) , o(i) ) and those vec-
tions for previously unseen expressions. However, tors obtained distributionally from the corpus, t(i) .
they can be used to generate data to train such a Specifically, we attempt to minimize the following
model, as we will describe. loss function:
1) Count vectors (SVD): we count the num- MV
ber of times each noun or SVO triple co-occurs 1 X
L(V ) = ||V (s(i) , o(i) ) − t(i) ||22 (4)
with each of the 10,000 most frequent words (ex- MV
i=1
cluding stopwords) in the Wikipedia corpus, using
sentences as context boundaries. If the verb in the V (s, o) is given by Eq. (1) for full tensors, and by
SVO triple is itself a content word, we do not in- Eq. (3) for tensors represented in low-rank form.
clude it as context for the triple. This produces one In both the low-rank and full-rank tensor learn-
set of context vectors for nouns and another for ing, we use mini-batch ADADELTA optimization
SVO triples. We weight entries in these vectors (Zeiler, 2012) up to a maximum of 500 iterations
using the t-test weighting scheme (Curran, 2004; through the training data, which we found to be
Polajnar and Clark, 2014), and then reduce the sufficient for convergence for every verb. Rather
vectors to 100 dimensions via singular value de- than placing a regularization penalty on the ten-
composition (SVD), decomposing the noun vec- sor parameters, we use early stopping if the loss
tors and SVO vectors separately. 2
https://groups.google.com/d/
2) Prediction vectors (PV): we train vector msg/word2vec-toolkit/Q49FIrNOQRo/
embeddings for nouns and SVO triples by adapt- J6KG8mUj45sJ

733
increases on a validation set consisting of 10% of GS11 KS14 # tensor
the available SVO triples for each verb. SVD PV SVD PV params.
For low-rank tensors, we compare seven differ- Add. 0.13 0.14 0.55 0.56 –
ent maximal ranks: R=1, 5, 10, 20, 30, 40 and 50. Mult. 0.13 0.14 0.09 0.27 –
To learn the parameters of the low-rank tensors, R=1 0.10 0.05 0.18 0.30 300
we use an alternating optimization method (Kolda R=5 0.26 0.30 0.28 0.40 1.5K
and Bader, 2009; Lei et al., 2014): performing gra- R=10 0.29 0.32 0.26 0.45 3K
dient descent on one of the parameter matrices (for R=20 0.31 0.34 0.39 0.44 6K
example P) to minimize the loss function while R=30 0.28 0.33 0.32 0.46 9K
holding the other two fixed (Q and R), then re- R=40 0.32 0.30 0.31 0.52 12K
peating for the other parameter matrices in turn. R=50 0.34 0.32 0.42 0.51 15K
The parameter matrices are randomly initialized.3 Full 0.29 0.36 0.41 0.52 1M

Table 1: Model performance on the verb disam-


4 Evaluation
biguation (GS11) and sentence similarity (KS14)
We compare the performance of the low-rank ten- tasks, given by Spearman’s ρ, and the number of
sors against full tensors on two tasks. Both tasks parameters needed to represent each verb’s tensor.
require the model to rank pairs of sentences each We show the highest tensor result for each task and
consisting of a subject, transitive verb, and object vector set in bold (and also bold the baseline when
by the semantic similarity of the sentences in the it outperforms the tensor method).
pair. The gold standard ranking is given by sim-
ilarity scores provided by human evaluators and 5 Results
the scores are not averaged among the annotators.
The model ranking is evaluated against the rank- Table 1 displays correlations between the systems’
ing from the gold standard similarity judgements scores and human SVO similarity judgements on
using Spearman’s ρ. the verb disambiguation (GS11) and sentence sim-
The verb disambiguation task (GS11) (Grefen- ilarity (KS14) tasks, for both the count (SVD) and
stette and Sadrzadeh, 2011) involves distinguish- prediction vectors (PV). We also give results for
ing between senses of an ambiguous verb, given simple composition of word vectors using elemen-
subject and object nouns as context. The dataset twise addition and multiplication (Mitchell and
consists of 200 sentence pairs, where the two sen- Lapata, 2008) (using verb vectors produced in the
tences in each pair have the same subject and ob- same manner as for nouns). As is consistent with
ject but differ in the verb. Each of these pairs was prior work, the tensor-based models are surpassed
ranked by human evaluators on a 1-7 similarity by vector addition on the KS14 dataset (Milajevs
scale so that properly disambiguated pairs (e.g. au- et al., 2014), but perform better than both addition
thor write book – author publish book) have higher and multiplication on the GS11 dataset.4
similarity scores than improperly disambiguated Unsurprisingly, the rank-1 tensor has lowest
pairs (e.g. author write book – author spell book). performance for both tasks and vector sets, and
The transitive sentence similarity dataset (Kart- performance generally increases as we increase
saklis and Sadrzadeh, 2014) consists of 72 subject- the maximal rank R. The full tensor achieves
verb-object sentences arranged into 108 sentence the best, or tied for the best, performance on both
pairs. As in GS11, each pair has a gold standard tasks when using the PV vectors. However, for the
semantic similarity score on a 1-7 scale. For ex- SVD vectors, low-rank tensors surpass the perfor-
ample, the pair medication achieve result – drug mance of the full-rank tensor for R=40 and R=50
produce effect has a high similarity rating, while
4
author write book – delegate buy land has a low The results in this table are not directly comparable with
Milajevs et al. (2014), who compare against averaged annota-
rating. In this dataset, however, the two sentences tor scores. Comparing against averaged annotator scores, our
in each pair have no lexical overlap: neither sub- best result on GS11 is 0.47 for the full-rank tensor with PV
vectors, and our best non-addition result on KS14 is 0.68 for
jects, objects, nor verbs are shared. the K=40 tensor with PV vectors (the best result is addition
with PV vectors, which achieves 0.71). These results exceed
3
Since the low-rank tensor loss is non-convex, we suspect the scores reported for tensor-based models by Milajevs et al.
that parameter initialization may produce better results. (2014).

734
on GS11, and R=50 on KS14. References
On GS11, the SVD and PV vectors have vary- Marco Baroni and Roberto Zamparelli. 2010. Nouns
ing but mostly comparable performance, with PV are vectors, adjectives are matrices: Representing
having higher performance on 5 out of 8 models. adjective-noun constructions in semantic space. In
However, on KS14, the PV vectors have better per- Proceedings of the 2010 Conference on Empirical
Methods in Natural Language Processing (EMNLP
formance than the SVD vectors for every model
2010), Cambridge, Massachusetts.
by at least 0.05 points, which is consistent with
prior work comparing count and predict vectors on Marco Baroni, Raffaela Bernardi, and Roberto Zam-
these datasets (Milajevs et al., 2014). parelli. 2014. Frege in space: A program of compo-
The low-rank tensor models are also at least sitional distributional semantics. Linguistic Issues
in Language Technology, 9.
twice as fast to train as the full tensors: on a single
core, training a rank-1 tensor takes about 5 sec- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
onds for each verb on average, ranks 5-50 each Christian Jauvin. 2003. A neural probabilistic lan-
take between 1 and 2 minutes, and the full tensors guage model. Journal of Machine Learning Re-
search, (3):1137–1155.
each take about 4 minutes. Since a separate tensor
is trained for each verb, this allows a substantial William Blacoe, Elham Kashefi, and Mirella Lapata.
amount of time to be saved even when using the 2013. A quantum-theoretic approach to distribu-
constrained vocabulary of 345 verbs. tional semantics. In Proceedings of the 2013 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
6 Conclusion guage Technologies (NAACL-HLT 2013), Atlanta,
Georgia.
We find that low-rank tensors for verbs achieve
comparable or better performance than full-rank Stephen Clark. 2013. Type-driven syntax and se-
tensors on both verb disambiguation and sentence mantics for composing meaning vectors. Quan-
similarity tasks, while reducing the number of pa- tum Physics and Linguistics: A Compositional, Dia-
rameters that must be learned and stored for each grammatic Discourse, pages 359–377.
verb by at least two orders of magnitude, and cut- Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen
ting training time in half. Clark. 2011. Mathematical foundations for a com-
While in our experiments the prediction-based positional distributional model of meaning. Linguis-
vectors outperform the count-based vectors on tic Analysis, 36(1-4):345–384.
both tasks for most models, Levy et al. (2015) in- R. Collobert, J. Weston, L. Bottou, M. Karlen,
dicate that tuning hyperparameters of the count- K. Kavukcuoglu, and P. Kuksa. 2011. Natural lan-
based vectors may be able to produce compara- guage processing (almost) from scratch. Journal of
ble performance. Regardless, we show that the Machine Learning Research, 12:2493–2537.
low-rank tensors are able to achieve performance James R Curran, Stephen Clark, and Johan Bos. 2007.
comparable to the full rank for both types of vec- Linguistically motivated large-scale NLP with C&C
tors. This is important for extending the model and Boxer. In Proceedings of the Demonstration
to many more grammatical types (including those Session of the 45th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2007),
with corresponding tensors of higher order than in-
Prague, Czech Republic.
vestigated here) to build a wide-coverage tensor-
based semantic system using, for example, the James R. Curran. 2004. From distributional to seman-
CCG parser of Curran et al. (2007). tic similarity. Ph.D. thesis, University of Edinburgh.

Acknowledgments Edward Grefenstette and Mehrnoosh Sadrzadeh. 2011.


Experimenting with transitive verbs in a DisCoCat.
Daniel Fried is supported by a Churchill Schol- In Proceedings of the 2011 Workshop on Geometri-
cal Models of Natural Language Semantics (GEMS
arship. Tamara Polajnar is supported by ERC 2011), Edinburgh, Scotland.
Starting Grant DisCoTex (306920). Stephen Clark
is supported by ERC Starting Grant DisCoTex Edward Grefenstette, Georgiana Dinu, Yao-Zhong
(306920) and EPSRC grant EP/I037512/1. We Zhang, Mehrnoosh Sadrzadeh, and Marco Baroni.
2013. Multi-step regression learning for composi-
would like to thank Laura Rimell and the anony- tional distributional semantics. In Proceedings of
mous reviewers for their comments. the 10th International Conference on Computational
Semantics (IWCS 2013), Pottsdam, Germany.

735
Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. 2014. A Jeff Mitchell and Mirella Lapata. 2008. Vector-based
study of entanglement in a categorical framework of models of semantic composition. In Proceedings of
natural language. In Proceedings of the 11th Work- the 46th Annual Meeting of the Assocation for Com-
shop on Quantum Physics and Logic (QPL 2014), putational Linguistics: Human Language Technolo-
Kyoto, Japan, June. gies (ACL-08: HLT), Columbus, Ohio.

Tamara G Kolda and Brett W Bader. 2009. Ten- Denis Paperno, Nghia The Pham, and Marco Baroni.
sor decompositions and applications. SIAM Review, 2014. A practical and linguistically-motivated ap-
51(3):455–500. proach to compositional distributional semantics.
In Proceedings of the 52nd Annual Meeting of
Quoc V. Le and Tomas Mikolov. 2014. Distributed the Association for Computational Linguistics (ACL
representations of sentences and documents. In Pro- 2014), Baltimore, Maryland.
ceedings of the 31st International Conference on
Machine Learning (ICML 2014), Beijing, China. Tamara Polajnar and Stephen Clark. 2014. Improv-
ing distributional semantic vectors through context
Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and selection and normalisation. In Proceedings of the
Tommi Jaakkola. 2014. Low-rank tensors for scor- 14th Conference of the European Chapter of the
ing dependency structures. In Proceedings of the Association for Computational Linguistics (EACL
52nd Annual Meeting of the Association for Compu- 2014), Gothenburg, Sweden.
tational Linguistics (ACL 2014), Baltimore, Mary-
land. Tamara Polajnar, Luana Fagarasan, and Stephen Clark.
2014. Reducing dimensions of tensors in type-
Tao Lei, Yuan Zhang, Lluis Marquez, Alessandro Mos- driven distributional semantics. In Proceedings of
chitti, and Regina Barzilay. 2015. High-order low- the 2014 Conference on Empirical Methods in Nat-
rank tensors for semantic role labeling. In Proceed- ural Language Processing (EMNLP 2014), Doha,
ings of the 2015 Conference of the North American Qatar.
Chapter of the Association for Computational Lin-
guistics – Human Language Technologies (NAACL- Hinrich Schütze. 1998. Automatic word sense dis-
HLT 2015), Denver, Colorado. crimination. Computational Linguistics, 24(1):97–
124.
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-
proving distributional similarity with lessons learned Paul Smolensky and Geraldine Legendre. 2006.
from word embeddings. Transactions of the Associ- The Harmonic Mind: from neural computation to
ation for Computational Linguistics, 3:211–225. optimality-theoretic grammar. MIT Press, Cam-
bridge, MA.
Jean Maillard, Stephen Clark, and Edward Grefen-
stette. 2014. A type-driven tensor-based semantics Richard Socher, Brody Huval, Christopher D. Man-
for CCG. In Proceedings of the EACL 2014 Type ning, and Andrew Y. Ng. 2012. Semantic Composi-
Theory and Natural Language Semantics Workshop tionality Through Recursive Matrix-Vector Spaces.
(TTNLS), Gothenburg, Sweden. In Proceedings of the 2012 Conference on Empirical
Methods in Natural Language Processing and Com-
Christopher D. Manning, Mihai Surdeanu, John Bauer, putational Natural Language Learning (EMNLP-
Jenny Finkel, Steven J. Bethard, and David Mc- CoNLL 2012), Jeju Island, Korea.
Closky. 2014. The Stanford CoreNLP natural lan-
Peter D Turney and Patrick Pantel. 2010. From fre-
guage processing toolkit. In Proceedings of 52nd
quency to meaning: Vector space models of se-
Annual Meeting of the Association for Computa-
mantics. Journal of Artificial Intelligence Research,
tional Linguistics (ACL 2014): System Demonstra-
37(1):141–188.
tions, pages 55–60, Baltimore, Maryland.
Tim Van de Cruys. 2010. A non-negative tensor fac-
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- torization model for selectional preference induc-
rado, and Jeffrey Dean. 2013. Distributed repre- tion. Journal of Natural Language Engineering,
sentations of words and phrases and their composi- 16(4):417–437.
tionality. In Neural Information Processing Systems
(NIPS 2013). Fabio M Zanzotto and Lorenzo Dell’Arciprete. 2012.
Distributed tree kernels. In Proceedings of the
Dmitrijs Milajevs, Dimitri Kartsaklis, Mehrnoosh 29th International Conference on Machine Learning
Sadrzadeh, and Matthew Purver. 2014. Evaluating (ICML 2012), Edinburgh, Scotland.
neural word representations in tensor-based compo-
sitional settings. In Proceedings of the 2014 Con- Matthew D Zeiler. 2012. ADADELTA: an
ference on Empirical Methods in Natural Language adaptive learning rate method. arXiv preprint
Processing (EMNLP 2014). arXiv:1212.5701.
Guido Minnen, John Carroll, and Darren Pearce. 2001.
Applied morphological processing of English. Nat-
ural Language Engineering, 7(03):207–223.

736

You might also like