Learning Word Representations With Hierarchical Sparse Coding

Learning Word Representations
with Hierarchical Sparse Coding
Dani Yogatama Manaal Faruqui Chris Dyer Noah A. Smith

Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213, USA
{dyogatama,mfaruqui,cdyer,nasmith}@cs.cmu.edu
Abstract
We propose a new method for learning word representations using hierarchical

regularization in sparse coding inspired by the linguistic study of word meanings.
We show an efficient learning algorithm based on stochastic proximal methods
that is significantly faster than previous approaches, making it possible to perform
hierarchical sparse coding on a corpus of billions of word tokens. Experiments on
various benchmark tasks—word similarity ranking, analogies, sentence completion,
and sentiment analysis—demonstrate that the method outperforms or is competitive
with state-of-the-art methods. Our word representations are available at http:
//www.ark.cs.cmu.edu/dyogatam/wordvecs/.
1 Introduction
When applying machine learning to text, the classic categorical representation of words as indices
of a vocabulary fails to capture syntactic and semantic similarities that are easily discoverable in
data (e.g., pretty, beautiful, and lovely have similar meanings, opposite to unattractive, ugly, and
repulsive). In contrast, recent approaches to word representation learning apply neural networks to
obtain dense, low-dimensional, continuous embeddings of words (Bengio et al., 2003; Mnih and Teh,
2012; Collobert et al., 2011; Huang et al., 2012; Mikolov et al., 2010, 2013b; Lebret and Collobert,
2014).
In this work, we propose an alternative approach based on decomposition of a high-dimensional
matrix capturing surface statistics of association between a word and its “contexts” with sparse coding.
As in past work, contexts are words that occur nearby in running text (Turney and Pantel, 2010).
Learning is performed by minimizing a reconstruction loss function to find the best factorization of
the input matrix.
The key novelty in our method is to govern the relationships among dimensions of the learned word
vectors, introducing a hierarchical organization imposed through a structured penalty known as the
group lasso (Yuan and Lin, 2006). The idea of regulating the order in which variables enter a model
was first proposed by Zhao et al. (2009), and it has since been shown useful for other applications
(Jenatton et al., 2011). Our approach is motivated by coarse-to-fine organization of words’ meanings
often found in the field of lexical semantics (see §2.2 for a detailed description), which mirrors
evidence for distributed nature of hierarchical concepts in the brain (Raposo et al., 2012). Related
ideas have also been explored in syntax (Petrov and Klein, 2008). It also has a foundation in cognitive
science, where hierarchical structures have been proposed as representations of semantic cognition
(Collins and Quillian, 1969). We show a stochastic proximal algorithm for hierarchical sparse coding
that is suitable for problems where the input matrix is very large and sparse. Our algorithm enables
application of hierarchical sparse coding to learn word representations from a corpus of billions of
word tokens and 400,000 word types.
1
On standard evaluation tasks—word similarity ranking, analogies, sentence completion, and sen-
timent analysis—we find that our method outperforms or is competitive with the best published
representations.
2 Model
2.1 Background and Notation
The observable representation of word v is taken to be a vector xv 2 RC of cooccurrence statistics

with C different contexts. Most commonly, each context is a possible neighboring word within a fixed
window.1 Following many others, we let xv,c be the pointwise mutual information (PMI) between
the occurrence of context word c within a five-word window of an occurrence of word v (Turney and
Pantel, 2010; Murphy et al., 2012; Faruqui and Dyer, 2014).
In sparse coding, the goal is to represent each input vector x 2 RC as a sparse linear combination of
basis vectors. Given a stacked input matrix X 2 RC⇥V , where V is the number of words, we seek to
minimize:
arg min kX DAk22 + ⌦(A), (1)
D2D,A
where D 2 RC⇥M is the dictionary of basis vectors, D is the set of matrices whose columns have
small (e.g., less than or equal to one) `2 norm, A 2 RM ⇥V is the code matrix, is a regularization
hyperparameter, and ⌦ is the regularizer. Here, we use the squared loss for the reconstruction error,
but other loss functions could also be used (Lee et al., 2009). Note that it is not necessary, although
typical, for M to be less than C (when M > C, it is often called an overcomplete representation).
The most common regularizer is the `1 penalty, which results in sparse codes. While structured
regularizers are associated with sparsity as well (e.g., the group lasso encourages group sparsity), our
motivation is to use ⌦ to encourage a coarse-to-fine organization of latent dimensions of the learned
representations of words.
2.2 Structured Regularization for Word Representations
For ⌦(A), we design a forest-structured regularizer that encourages the model to use some dimensions
in the code space before using other dimensions. Consider the trees in Figure 1. In this example,
there are 13 variables in each tree, and 26 variables in total (i.e., M = 26), each corresponding to a
latent dimension for one particular word. These trees describe the order in which variables “enter the
model” (i.e., take nonzero values). In general, a node may take a nonzero value only if its ancestors
also do. For example, nodes 3 and 4 may only be nonzero if nodes 1 and 2 are also nonzero. Our
regularizer for column v of A, denoted by av (in this example, av 2 R26 ), for the trees in Figure 1 is:
26
X
⌦(av ) = khav,i , av,Descendants(i) ik2
i=1
where Descendants(i) returns the (possibly empty) set of descendants of node i. Jenatton et al.
(2011) proposed a related penalty with only one tree for learning image and document representations.
Let us analyze why organizing the code space this way is helpful in learning better word repre-
sentations. Recall that the goal is to have a good dictionary D and code matrix A. We apply the
structured penalty to each column of A. When we use the same structured penalty in these columns,
we encode an additional shared constraint that the dimensions of av that correspond to top level
nodes should focus on “general” contexts that are present in most words. In our case, this corresponds
to contexts with extreme PMI values for most words, since they are the ones that incur the largest
losses. As we go down the trees, more word-specific contexts can then be captured. As a result, we
have better organization across words when learning their representations, which also translates to
a more structured dictionary D. Contrast this with the case when we use unstructured regularizers
1
Others include: global context (Huang et al., 2012), multilingual context (Faruqui and Dyer, 2014),
geographic context (Bamman et al., 2014), brain activation data (Fyshe et al., 2014), and second-order context
(Schutze, 1998).
2
1 14
2 5 8 11 15 18 21 24
3 4 6 7 9 10 12 13 16 17 19 20 22 23 25 26
Figure 1: An example of a regularization forest that governs the order in which variables enter the model. In this
example, 1 needs to be selected (nonzero) for 2, 3, . . . , 13 to be selected. However, 1, 2, . . . , 13 have nothing to
do with the variables in the second tree: 14, 15, . . . , 26. See text for details.
that penalize each dimension of A independently (e.g., lasso). In this case, each dimension of av
has more flexibility to pay attention to any contexts (the only constraint that we encode is that the
cardinality of the model should be small). We hypothesize that this is less appropriate for learning
word representations, since the model has excessive freedom when learning A on noisy PMI values,
which translates to poor D.
The intuitive motivation for our regularizer comes from the field of lexical semantics, which often
seeks to capture the relationships between words’ meanings in hierarchically-organized lexicons. The
best-known example is WordNet (Miller, 1995). Words with the same (or close) meanings are grouped
together (e.g., professor and prof are synonyms), and fine-grained meaning groups (“synsets”) are
nested under coarse-grained ones (e.g., professor is a hyponym of academic). Our hierarchical sparse
coding approach is still several steps away from inducing such a lexicon, but it seeks to employ the
dimensions of a distributed word representation scheme in a similar coarse-to-fine way. In cognitive
science, such hierarchical organization of semantic representations was first proposed by Collins and
Quillian (1969).
2.3 Learning
Learning is accomplished by minimizing the function in Eq. 1, with the group lasso regularization
function described in §2.2. The function is not convex with respect to D and A, but it is convex with
respect to each when the other is fixed. Alternating minimization routines have been shown to work
reasonably well in practice for such problems (Lee et al., 2007), but they are too expensive here due
to:
• The size of X 2 RC⇥V (C and V are each on the order of 105 ).

• The many overlapping groups in the structured regularizer ⌦(A).
One possible solution is based on the online dictionary learning method of Mairal et al. (2010). For
T iterations, we:
• Sample a mini-batch of words and (in parallel) solve for each one’s a using the alternating
directions method of multipliers, shown to work well for overlapping group lasso problems (Qin
and Goldfarb, 2012; Yogatama and Smith, 2014).2
• Update D using the block coordinate descent algorithm of Mairal et al. (2010).
Finally, we parallelize solving for all columns of A, which are separable once D is fixed. In our
experiments, we use this algorithm for a medium-sized corpus.
The main difficulty of learning word representations with hierarchical sparse coding is that the size of
the input matrix can be very large. When we use neighboring words as the contexts, the numbers
of rows and columns are the size of the vocabulary. For a medium-sized corpus with hundreds of
millions of word tokens, we typically have one or two hundred thousand unique words, so the above
2
Since our groups form tree structures, other methods such as FISTA (Jenatton et al., 2011) could also be
used.
3
Algorithm 1 Fast algorithm for learning word representations with the forest regularizer.
Input: matrix X, regularization constant and ⌧ , learning rate sequences ⌘0 , . . . , ⌘T , number of
iterations T
Initialize D0 and A0 randomly
for t = 1, . . . , T {can be parallelized, see text for details} do
Sample xc,v with probability proportional to its (absolute) value
dc = dc + 2⌘t (av (xc,v dc · av ) ⌧ dc )
av = av + 2⌘t (dc (xc,v dc · av ))
for m = 1, . . . , M do
prox⌦m , (av ), where ⌦m = khav,m , av,Descendants(m) ik2
end for
end for
algorithm is still applicable. For a large corpus with billions of word tokens, this number can easily
double or triple, making learning very expensive. We propose an alternative learning algorithm for
such cases.
We rewrite Eq. 1 as:
X X
arg min (xc,v dc · av )2 + ⌦(A) + ⌧ kdm k22
D,A c,v m
where (abusing notation) dc denotes the c-th row vector of D and dm denotes the m-th column
vector of D (recall that D 2 RC⇥M ). Instead of considering all elements of the input matrix, our
algorithm approximates the solution by using only non-zero entries in the input matrix X. At each
iteration, we sample a non-zero entry xc,v and perform gradient updates to the corresponding row dc
and column av .
We directly penalize columns of D by their squared `2 norm as an alternative to constraining columns
of D to have unit `2 norm. The advantage of this transformation is that we have eliminated a
projection step for columns of D. Instead, we can include the gradient of the penalty term in the
stochastic gradient update. We apply the proximal operator associated with ⌦(av ) as a composition
of elementary proximal operators with no group overlaps, similar to Jenatton et al. (2011). This can
be done by recursively visiting each node of a tree and applying the proximal operator for the group
lasso penalty associated with that node (i.e., the group lasso penalty where the node is the topmost
node and the group consists of the node and all of its descendants). The proximal operator associated
with node m, denoted by prox⌦m , , is simply the block-thresholding operator for node m and all its
descendants.
Since each non-zero entry xc,v only depends on dc and av , we can sample multiple non-zero entries
and perform the updates in parallel as long as they do not share c and v. In our case, where C and V
are on the order of hundreds of thousands and we only have tens or hundreds of processors, finding
non-zero elements that do not violate this constraint is easy. There are typically a huge number of
non-zero entries (on the order of billions). Using a sampling procedure that favors entries with higher
(absolute) PMI values can lead to reasonably good word representations faster. We sample a non-zero
entry with probability proportional to its absolute value. This also justifies using only the non-zero
entries, since the probability of sampling zero entries is always zero.3 We summarize our learning
algorithm in Algorithm 1.
3 Experiments
We present a controlled comparison of the forest regularizer against several strong baseline word
representations learned on a fixed dataset, across several tasks. In §3.4 we compare to publicly
available word vectors trained on different data.
3
In practice, we can use a faster approximation of this sampling procedure by uniformly sampling a non-zero
entry and multiplying its gradient by a scaling constant proportional to its absolute PMI value.
4
3.1 Setup and Baselines
We use the WMT-2011 English news corpus as our training data.4 The corpus contains about 15
million sentences and 370 million words. The size of our vocabulary is 180,834.5
In our experiments, we use forests similar to those in Figure 1 to organize the latent word space.
Note that the example has 26 nodes (2 trees). We choose to evaluate performance with M = 52 (4
trees) and M = 520 (40 trees).6 We denote the sparse coding method with regular `1 penalty by SC,
and our method with structured regularization (§2.2) by FOREST. We set = 0.1. In this first set
of experiments with a medium-sized corpus, we use the online learning algorithm of Mairal et al.
(2010).
We compare with the following baseline methods:
• Turney and Pantel (2010): principal component analysis (PCA) by truncated singular value
decomposition on X> . Note that this is also the same as minimizing the squared reconstruction
loss in Eq. 1 without any penalty on A.
• Mikolov et al. (2010): a recursive neural network (RNN) language model. We obtain an imple-
mentation from http://rnnlm.org/.
• Mnih and Teh (2012): a log bilinear model that predicts a word given its context, trained using
noise-contrastive estimation (NCE, Gutmann and Hyvarinen, 2010). We use our own implementa-
tion for this model.
• Mikolov et al. (2013b): a log bilinear model that predicts a word given its context (continuous
bag of words, CBOW), trained using negative sampling (Mikolov et al., 2013a). We obtain an
implementation from https://code.google.com/p/word2vec/.
• Mikolov et al. (2013b): a log bilinear model that predicts context words given a target word (skip
gram, SG), trained using negative sampling (Mikolov et al., 2013a). We obtain an implementation
from https://code.google.com/p/word2vec/.
Our focus here is on comparisons of model architectures. For a fair comparison, we train all competing
methods on the same corpus using a context window of five words (left and right). For the baseline
methods, we use default settings in the provided implementations (or papers, when implementations
are not available and we reimplement the methods). We also trained the last two baseline methods
with hierarchical softmax using a binary Huffman tree instead of negative sampling; consistent with
Mikolov et al. (2013a), we found that negative sampling performs better and relegate hierarchical
softmax results to supplementary materials.
3.2 Evaluation
We evaluate on the following benchmark tasks.

Word similarity The first task evaluates how well the representations capture word similarity. For
example beautiful and lovely should be closer in distance than beautiful and unattractive. We evaluate
on a suite of word similarity datasets, subsets of which have been considered in past work: WordSim
353 (Finkelstein et al., 2002), rare words (Luong et al., 2013), and many others; see supplementary
materials for details. Following standard practice, for each competing model, we compute cosine
distances between word pairs in word similarity datasets, then rank and report Spearman’s rank
correlation coefficient (Spearman, 1904) between the model’s rankings and human rankings.
Syntactic and semantic analogies The second evaluation dataset is two analogy tasks proposed
by Mikolov et al. (2013b). These questions evaluate syntactic and semantic relations between words.
There are 10,675 syntactic questions (e.g., walking : walked :: swimming : swam) and 8,869 semantic
questions (e.g., Athens : Greece :: Oslo :: Norway). In each question, one word is missing, and the
task is to correctly predict the missing word. We use the vector offset method (Mikolov et al., 2013b)
that computes the vector b = aAthens aGreece + aOslo . We only consider a question to be answered
4
http://www.statmt.org/wmt11/
5
We replace words with frequency less than 10 with #rare# and numbers with #number#.
6
In preliminary experiments we explored binary tree structures and found they did not work as well; we leave
a more extensive exploration of tree structures to future work.
5
Table 1: Summary of results. We report Spearman’s correlation coefficient for the word similarity task and
accuracies (%) for other tasks. Higher values are better (higher correlation coefficient or higher accuracy). The
last two methods (columns) are new to this paper, and our proposed method is in the last column.
M Task PCA RNN NCE CBOW SG SC FOREST

Word similarity 0.39 0.26 0.48 0.43 0.49 0.49 0.52
Syntactic analogies 18.88 10.77 24.83 23.80 26.69 11.84 24.38
52 Semantic analogies 8.39 2.84 25.29 8.45 19.49 4.50 9.86
Sentence completion 27.69 21.31 30.18 25.60 26.89 25.10 28.88
Sentiment analysis 74.46 64.85 70.84 68.48 71.99 75.51 75.83
Word similarity 0.50 0.31 0.59 0.53 0.58 0.58 0.66
Syntactic analogies 40.67 22.39 33.49 52.20 54.64 22.02 48.00
520 Semantic analogies 28.82 5.37 62.76 12.58 39.15 15.46 41.33
Sentence completion 30.58 23.11 33.07 26.69 26.00 28.59 35.86
Sentiment analysis 81.70 72.97 78.60 77.38 79.46 78.20 81.90
Table 2: Results on the syntactic and semantic analogies tasks with a bigger corpus (M = 260).
Task CBOW SG FOREST

Syntactic 61.37 63.61 65.11
Semantic 23.13 54.41 52.07
correctly if the returned vector (b) has the highest cosine similarity to the correct answer (in this
example, aNorway ).
Sentence completion The third evaluation task is the Microsoft Research sentence completion
challenge (Zweig and Burges, 2011). In this task, the goal it to choose from a set of five candidate
words which one best completes a sentence. For example: Was she his {client, musings, discomfiture,
choice, opportunity}, his friend, or his mistress? (client is the correct answer). We choose the
candidate with the highest average similarity to every other word in the sentence.7
Sentiment analysis The last evaluation task is sentence-level sentiment analysis. We use the movie
reviews dataset from Socher et al. (2013). The dataset consists of 6,920 sentences for training,
872 sentences for development, and 1,821 sentences for testing. We train `2 -regularized logistic
regression to predict binary sentiment, tuning the regularization strength on development data. We
represent each example (sentence) as an M -dimensional vector constructed by taking the average of
word representations of words appearing in that sentence.
The analogy, sentence completion, and sentiment analysis tasks are evaluated on prediction accuracy.
3.3 Results
Table 1 shows results on all evaluation tasks for M = 52 and M = 520. Runtime will be discussed
in §3.5. In the similarity ranking and sentiment analysis tasks, our method performed the best in both
low and high dimensional embeddings. In the sentence completion challenge, our method performed
best in the high-dimensional case and second-best in the low-dimensional case. Importantly, FOREST
outperforms PCA and unstructured sparse coding (SC) on every task. We take this collection of results
as support for the idea that coarse-to-fine organization of latent dimensions of word representations
captures the relationships between words’ meanings better compare to unstructured organization.
Analogies Unlike others tasks, our results on the syntactic and semantic analogies tasks are below
state-of-the-art performance from previous work (for all models). We hypothesize that this is because
performing well on these tasks requires training on a bigger corpus. We combine our WMT-2011
corpus with other news corpora and Wikipedia to obtain a corpus of 6.8 billion words. The size of the
vocabulary of this corpus is 401,150. We retrain three models that are scalable to a corpus of this
size: CBOW, SG, and FOREST;8 with M = 260 to balance the trade-off between training time and
7
We note that unlike matrix decomposition based approaches, some of the neural network based models can
directly compute the scores of context words given a possible answer (Mikolov et al., 2013b). We choose to use
average similarities for a fair comparison of the representations.
8
Our NCE implementation is not optimized and therefore not scalable.
6
Table 3: Comparison to previously published word representations. The five right-most columns correspond to
the tasks described above; parenthesized values are the number of in-vocabulary items that could be evaluated.
Models M V W. Sim. Syntactic Semantic Sentence Sentiment

CW 130,000 (6,225) 0.51 (10,427) 12.34 (8,656) 9.33 (976) 24.59 69.36
RNN-DC 100,232 (6,137) 0.32 (10,349) 10.94 (7,853) 2.60 (964) 19.81 67.76
50
HLBL 246,122 (6,178) 0.11 (10,477) 8.98 (8,446) 1.74 (990) 19.90 62.33
NNSE 34,107 (3,878) 0.23 (5,114) 1.47 (1,461) 2.46 (833) 0.04 64.80
HPCA 178,080 (6,405) 0.29 (10,553) 10.42 (8,869) 3.36 (993) 20.14 67.49
FOREST 52 180,834 (6,525) 0.52 (10,675) 24.38 (8,733) 9.86 (1,004) 28.88 75.83
performance (M = 52 does not perform as well, and M = 520 is computationally expensive). For
FOREST, we use the fast learning algorithm in §2.3, since the online learning algorithm of Mairal et al.
(2010) does not scale to a problem of this size. We report accuracies on the syntactic and semantic
analogies tasks in Table 2. All models benefit significantly from a bigger corpus, and the performance
levels are now comparable with previous work. On the syntactic analogies task, FOREST is the best
model. On the semantic analogies task, SG outperformed FOREST, and they both are better than
CBOW.
3.4 Other Comparisons
In Table 3, we compare with five other baseline methods for which we do not train on our training
data but pre-trained 50-dimensional word representations are available:
• Collobert et al. (2011): a neural network language model trained on Wikipedia data for 2 months
(CW).9
• Huang et al. (2012): a neural network model that uses additional global document context
(RNN-DC).10
• Mnih and Hinton (2008): a log bilinear model that predicts a word given its context, trained using
hierarchical softmax (HLBL).11
• Murphy et al. (2012): a word representation trained using non-negative sparse embedding (NNSE)
on dependency relations and document cooccurrence counts.12 These vectors were learned using
sparse coding, but using different contexts (dependency and document cooccurrences), a different
training method, and with a nonnegativity constraint. Importantly, there is no hierarchy in the
code space, as in FOREST.13
• Lebret and Collobert (2014): a word representation trained using Hellinger PCA (HPCA).14
These methods were all trained on different corpora, so they have different vocabularies that do not
always include all of the words found in the tasks. We estimate performance on the items for which
prediction is possible, and show the count for each method in Table 3. This comparison should
be interpreted cautiously since many experimental variables are conflated; nonetheless, FOREST
performs strongly.
3.5 Discussion
Our method produces sparse word representations with exact zeros. We observe that the sparse coding
method without a structured regularizer produces sparser representations, but it performs worse on
our evaluation tasks, indicating that it zeroes out meaningful dimensions. For FOREST with M = 52
and M = 520, the average numbers of nonzero entries are 91% and 85% respectively. While our
word representations are not extremely sparse, this makes intuitive sense since we try to represent
about 180,000 contexts in only 52 (520) dimensions. We also did not tune . As we increase M , we
get sparser representations.
9
http://ronan.collobert.com/senna/
10
http://goo.gl/Wujc5G
11
http://metaoptimize.com/projects/wordreprs/ (Turian et al., 2010)
12
Obtained from http://www.cs.cmu.edu/˜bmurphy/NNSE/.
13
We found that NNSE trained using our contexts performed very poorly; see supplementary materials.
14
http://lebret.ch/words/
7
In terms of running time, FOREST is reasonably fast to learn. We use the online dictionary learning
method for M = 52 and M = 520 on a medium-sized corpus. For M = 52, the dictionary learning
step took about 30 minutes (64 cores) and the overall learning procedure took approximately 2 hours
(640 cores). For M = 520, the dictionary learning step took about 1.5 hours (64 cores) and the
overall learning procedure took approximately 20 hours (640 cores). For comparison, the SG model
took about 1.5 hours and 5 hours for M = 52 and M = 520 using a highly optimized implementation
from the author’s website (with no parallelization). On a large corpus with 6.8 billion words and
vocabulary size of about 400,000, FOREST with Algorithm 1 took about 2 hours (16 cores) while SG
took about 6.5 hours (16 cores) for M = 260.
We visualize our M = 52 word representations (FOREST) related to animals (10 words) and countries
(10 words). We show the coefficient patterns for these words in Figure 2. We can see that in both
cases, there are dimensions where the coefficient signs (positive or negative) agree for all 10 words
(they are mostly on the right and left sides of the plots). Note that the dimensions where all the
coefficients agree are not the same in animals and countries. The larger magnitude of the vectors
for more abstract concepts (animal, animals, country, countries) is suggestive of neural imaging
studies that have found evidence of more global activation patterns for processing superordinate terms
(Raposo et al., 2012). In Figure 3, we show tree visualizations of coefficients of word representations
for animal, horse, and elephant. We show one tree for M = 52 (there are four trees in total, but
other trees exhibit similar patterns). Coefficients that differ in sign mostly correspond to leaf nodes,
validating our motivation that top level nodes should focus more on “general” contexts (for which
they should be roughly similar for animal, horse, and elephant) and leaf nodes focus on word-specific
contexts. One of the leaf nodes for animal is driven to zero, suggesting that more abstract concepts
require fewer dimensions to explain.
For FOREST and SG with M = 520, we project the learned word representations into two dimensions
using the t-SNE tool (van der Maaten and Hinton, 2008) from http://homepage.tudelft.
nl/19j49/t-SNE.html. We show projections of words related to the concept “good” vs. “bad”
in Figure 4.15 See supplementary materials for “man” vs. “woman,” as well as 2-dimensional
projections of NCE.
animal
animals
dog
cat
lion
elephant
bird
pig
snake
fish
bull
horse
15
8
25
34
45
35
42
3
27
28
11
23
40
49
2
52
16
44
26
9
38
6
51
41
18
31
19
33
17
21
4
20
46
13
43
32
29
10
7
37
39
14
30
24
50
48
36
22
12
5
47
1
country
countries
spain
france
italy
germany
russia
china
india
iraq
brazil
egypt
3
36
17
52
14
44
10
33
31
11
39
21
46
30
2
22
29
27
43
28
50
35
4
48
41
42
6
12
5
7
16
19
37
26
49
45
9
23
38
15
13
32
40
18
25
34
20
47
51
8
24
1
Figure 2: Heatmaps of word representations for 10 animals (top) and 10 countries (bottom) for M = 52 from
FOREST. Red indicates negative values, blue indicates positive values (darker colors correspond to more extreme
values); white denotes exact zero. The x-axis shows the original dimension index, we show the dimensions from
the most negative (left) to the most positive (right), within each block, for readability.
15
Since t-SNE is a non-convex method, we run it 10 times and choose the plots with the lowest t-SNE error.
8
(a) animal (b) horse (c) elephant
Figure 3: Tree visualizations of word representations for animal (left), horse (center), elephant (right) for
M = 52. We use the same color coding scheme as in Figure 2. Here, we only show one tree (out of four), but
other trees exhibit similar patterns.
Figure 4: Two dimensional projections of the FOREST (left) and SG (right) word representations using the t-SNE
tool (van der Maaten and Hinton, 2008). Words associated with “good” are colored in blue, words associated
with “bad” are colored in red. We can see that in both cases most “good” and “bad” words are clustered together
(in fact, they are linearly separated in the 2D space), except for poor in the SG case. See supplementary materials
for more examples.
4 Conclusion
We introduced a new method for learning word representations based on hierarchical sparse coding.
The regularizer encourages hierarchical organization of the latent dimensions of vector-space word
embeddings. We showed that our method outperforms state-of-the-art methods on word similarity
ranking, syntactic analogy, sentence completion, and sentiment analysis tasks.
Acknowledgements
The authors thank anonymous reviewers, Sam Thomson, Bryan R. Routledge, Jesse Dodge, and Fei
Liu for helpful feedback on an earlier draft of this paper. This work was supported by the National
Science Foundation through grant IIS-1352440, the Defense Advanced Research Projects Agency
through grant FA87501420244, and computing resources provided by Google and the Pittsburgh
Supercomputing Center.
References
Bamman, D., Dyer, C., and Smith, N. A. (2014). Distributed representations of situated language. In
Proc. of ACL.
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language model.
Journal of Machine Learning Research, 3, 1137–1155.
Collins, A. M. and Quillian, M. R. (1969). Retrieval time from semantic memory. Journal of Verbal
Learning and Verbal Behaviour, 8, 240–247.
9
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuska, P. (2011). Natural
language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2461–2505.
Faruqui, M. and Dyer, C. (2014). Improving vector space word representations using multilingual
correlation. In Proc. of EACL.
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2002).
Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1),
116–131.
Fyshe, A., Talukdar, P. P., Murphy, B., and Mitchell, T. M. (2014). Interpretable semantic vectors
from a joint model of brain- and text- based meaning. In Proc. of ACL.
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle
for unnormalized statistical models. In Proc. of AISTATS.
Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. (2012). Improving word representations via
global context and multiple word prototypes. In Proc. of ACL.
Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011). Proximal methods for hierarchical sparse
coding. Journal of Machine Learning Research, 12, 2297–2334.
Lebret, R. and Collobert, R. (2014). Word embeddings through hellinger PCA. In Proc. of EACL.
Lee, H., Battle, A., Raina, R., and Ng, A. Y. (2007). Efficient sparse coding algorithms. In Proc. of
NIPS.
Lee, H., Raina, R., Teichman, A., and Ng, A. Y. (2009). Exponential family sparse coding with
application to self-taught learning. In Proc. of IJCAI.
Luong, M.-T., Socher, R., and Manning, C. D. (2013). Better word representations with recursive
neural networks for morphology. In Proc. of CONLL.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and
sparse coding. Journal of Machine Learning Research, 11, 19–60.
Mikolov, T., Martin, K., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural
network based language model. In Proc. of Interspeech.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013a). Distributed representations
of words and phrases and their compositionality. In Proc. of NIPS.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013b). Efficient estimation of word representations
in vector space. In Proc. of ICLR Workshop.
Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11),
39–41.
Mnih, A. and Hinton, G. (2008). A scalable hierarchical distributed language model. In Proc. of
NIPS.
Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language
models. In Proc. of ICML.
Murphy, B., Talukdar, P., and Mitchell, T. (2012). Learning effective and interpretable semantic
models using non-negative sparse embedding. In Proc. of COLING.
Petrov, S. and Klein, D. (2008). Sparse multi-scale grammars for discriminative latent variable
parsing. In Proc. of EMNLP.
Qin, Z. T. and Goldfarb, D. (2012). Structured sparsity via alternating direction methods. Journal of
Machine Learning Research, 13, 1435–1468.
Raposo, A., Mendes, M., and Marques, J. F. (2012). The hierarchical organization of semantic
memory: Executive function in the processing of superordinate concepts. NeuroImage, 59, 1870–
1878.
Schutze, H. (1998). Automatic word sense discrimination. Computational Linguistics - Special issue
on word sense disambiguation, 24(1), 97–123.
Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., and Potts, C. (2013). Recursive
deep models for semantic compositionality over a sentiment treebank. In Proc. of EMNLP.
Spearman, C. (1904). The proof and measurement of association between two things. The American
Journal of Psychology, 15, 72–101.
10
Turian, J., Ratinov, L., and Bengio, Y. (2010). Word representations: A simple and general method
for semi-supervised learning. In Proc. of ACL.
Turney, P. D. and Pantel, P. (2010). From frequency to meaning: Vector space models of semantics.
Journal of Artificial Intelligence Research, 37, 141–188.
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning
Research, 9, 2579–2605.
Yogatama, D. and Smith, N. A. (2014). Making the most of bag of words: Sentence regularization
with alternating direction method of multipliers. In Proc. of ICML.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society, Series B, 68(1), 49–67.
Zhao, P., Rocha, G., and Yu, B. (2009). The composite and absolute penalties for grouped and
hierarchical variable selection. The Annals of Statistics, 37(6A), 3468–3497.
Zweig, G. and Burges, C. J. C. (2011). The microsoft research sentence completion challenge.
Technical report, Microsoft Research Technical Report MSR-TR-2011-129.
11
Supplementary Materials for Learning Word
Representations with Hierarchical Sparse Coding
Dani Yogatama Manaal Faruqui Chris Dyer Noah A. Smith

Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213, USA
{dyogatama,mfaruqui,cdyer,nasmith}@cs.cmu.edu
Table 1: Summary of results for non-negative sparse embedding (NNSE), continuous bag-of-words and skip
gram models trained with hierarchical softmax (CBOW-HS and SG-HS). Higher number is better (higher
correlation coefficient or higher accuracy).
M Task NNSE CBOW-HS SG-HS FOREST

Word similarity 0.04 0.38 0.47 0.52
Syntactic analogies 0.10 19.50 24.87 24.38
52 Semantic analogies 0.01 5.31 14.77 9.86
Sentence completion 0.01 22.51 28.78 28.88
Sentiment analysis 61.12 68.92 71.72 75.83
Word similarity 0.05 0.50 0.57 0.66
Syntactic analogies 0.81 46.00 50.40 48.00
520 Semantic analogies 0.57 8.00 31.05 41.33
Sentence completion 22.81 25.80 27.79 35.86
Sentiment analysis 67.05 78.50 79.57 81.90
A Additional Results
In Table 1, we compare FOREST with three additional baselines:
• Murphy et al. (2012): a word representation trained using non-negative sparse embedding
(NNSE) on our corpus. Similar to the authors, we use an NNSE implementation from
http://spams-devel.gforge.inria.fr/ (Mairal et al., 2010).
• Mikolov et al. (2013): a log bilinear model that predicts a word given its context, trained using
hierarchical softmax with a binary Huffman tree (continuous bag of words, CBOW-HS). We use
an implementation from https://code.google.com/p/word2vec/.
• Mikolov et al. (2013): a log bilinear model that predicts context words given a target word,
trained using hierarchical softmax with a binary Huffman tree (skip gram, SG-HS). We use an
implementation from https://code.google.com/p/word2vec/.
We train these models on our corpus using the same setup as experiments in our paper.
B Additional Two Dimensional Projections

For FOREST, SG, and NCE with M = 520, we project the learned word representations into two
dimensions using the t-SNE tool (van der Maaten and Hinton, 2008) from http://homepage.
tudelft.nl/19j49/t-SNE.html. We show projections of words related to the concept “good”
vs. “bad” and “man” vs. “woman” in Figure 1.
10
Figure 1: Two dimensional projections of the FOREST (top), SG (middle), and NCE (bottom) word representations
using the t-SNE tool (van der Maaten and Hinton, 2008). Words associated with “good” (left) and “man” (right)
are colored in blue, words associated with “bad” (left) and “woman” (right) are colored in red. The two plots on
the top left are the same plots shown in the paper.
C List of Word Similarity Datasets
We use the following word similarity datasets in our experiments:
• Finkelstein et al. (2002): WordSimilarity dataset (353 pairs).

• Agirre et al. (2009): a subset of WordSimilarity dataset for evaluating similarity (203 pairs).
• Agirre et al. (2009): a subset of WordSimilarity dataset for evaluating relatedness (252 pairs).
• Miller and Charles (1991): semantic similarity dataset (30 pairs)
• Rubenstein and Goodenough (1965): contains only nouns (65 pairs)
• Luong et al. (2013): rare words (2,034 pairs)
• Bruni et al. (2012): frequent words (3,000 pairs)
• Radinsky et al. (2011): MTurk-287 dataset (287 pairs)
• Halawi and Dror (2014): MTurk-771 dataset (771 pairs)
• Yang and Powers (2006): contains only verbs (130 pairs)
11
References
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., and Soroa, A. (2009). A study on
similarity and relatedness using distributional and wordnet-based approaches. In Proc. of NAACL-
HLT.
Bruni, E., Boleda, G., Baroni, M., and Tran, N.-K. (2012). Distributional semantics in technicolor. In
Proc. of ACL.
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2002).
Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1),
116–131.
Halawi, G. and Dror, G. (2014). The word relatedness mturk-771 test collection.
Luong, M.-T., Socher, R., and Manning, C. D. (2013). Better word representations with recursive
neural networks for morphology. In Proc. of CONLL.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and
sparse coding. Journal of Machine Learning Research, 11, 19–60.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations
in vector space. In Proc. of Workshop at ICLR.
Miller, G. A. and Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and
Cognitive Processes, 6(1), 1–28.
Murphy, B., Talukdar, P., and Mitchell, T. (2012). Learning effective and interpretable semantic
models using non-negative sparse embedding. In Proc. of COLING.
Radinsky, K., Agichtein, E., Gabrilovich, E., and Markovitch, S. (2011). A word at a time: Computing
word relatedness using temporal semantic analysis. In Proc. of WWW.
Rubenstein, H. and Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications
of the ACM, 8(10), 627–633.
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning
Research, 9, 2579–2605.
Yang, D. and Powers, D. M. W. (2006). Verb similarity on the taxonomy of wordnet. In Proc. of
GWC.
12

Learning Word Representations With Hierarchical Sparse Coding

Uploaded by

Copyright:

Available Formats

Learning Word Representations With Hierarchical Sparse Coding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Word Representations With Hierarchical Sparse Coding

Uploaded by

Copyright:

Available Formats

Learning Word Representations

with Hierarchical Sparse Coding

Dani Yogatama Manaal Faruqui Chris Dyer Noah A. Smith

We propose a new method for learning word representations using hierarchical

2.1 Background and Notation

The observable representation of word v is taken to be a vector xv 2 RC of cooccurrence statistics

2.2 Structured Regularization for Word Representations

• The size of X 2 RC⇥V (C and V are each on the order of 105 ).

We evaluate on the following benchmark tasks.

M Task PCA RNN NCE CBOW SG SC FOREST

Task CBOW SG FOREST

Models M V W. Sim. Syntactic Semantic Sentence Sentiment

3.4 Other Comparisons

Dani Yogatama Manaal Faruqui Chris Dyer Noah A. Smith

M Task NNSE CBOW-HS SG-HS FOREST

B Additional Two Dimensional Projections

C List of Word Similarity Datasets

We use the following word similarity datasets in our experiments:

• Finkelstein et al. (2002): WordSimilarity dataset (353 pairs).

You might also like