Text Similarity in Vector Space Models: A Comparative Study
Text Similarity in Vector Space Models: A Comparative Study
Text Similarity in Vector Space Models: A Comparative Study
A Comparative Study
{omid.shahmirzadi,kenneth.younge}@epfl.ch
2
Patent Research Foundation, Seattle, USA
alugowski@patrf.org
1 Introduction
|D| + 1
TFIDFt,d = TFt,d · log (1)
DFt,D + 1
Despite the popularity and applicability of TFIDF to many applications,
it suffers from the curse of dimensionality in many downstream applications
(e.g. computing k nearest neighbors[14]), it ignores n-gram phrases, and all IDF
weights might need to be updated upon the addition of new documents. The
basic model, however, can be extended in several ways to avoid some of these
pitfalls. We consider two recently proposed extensions in this study.
First, we consider adding certain n-grams to the term vocabulary. N-grams
allow for the combination of terms into higher-level concepts, which may be
particularly important for research in computational social sciences including
patent research [2]. Adding n-grams blindly, however, would vastly increase the
size of vocabulary, and thus the number of vector dimensions. A more manage-
able approach, therefore, is to add noun phrases based on synthetic properties of
the text. We test the phrase extraction technique from [9] which extracts noun
phrases based on a pattern based method. They extend the simple noun phrase
grammar of formula 2 to support better coordination of noun phrases and better
handling of textual tags. A finite state transducer is used to extract text por-
tions that match the input grammar, including nested and overlapping parts,
from the input text which is marked by part of speech (POS) tags. They impose
no upper bound for the size of extracted phrases and show that their method
extract high quality noun phrases efficiently.
|DT | + 1
TFIDFt,d,T = TFt,d · log (3)
DFt,DT + 1
Topic Models. Topic models transform a text into a fixed size vector, equal to a
given number of latent topics. The vector represents the probability distribution
that the focal text relates to each of the different topics. In practice, each topic
is a weighted average of a subset of terms. Similar to TFIDF, topic models treat
the text as a bag of words where order of words is ignored. On the down side,
interpretation of each topic can be subjective and determining the right number
of topics requires tuning of the model.
Latent Semantic Indexing (LSI) [8,18] is a commonly used topic model to
find low-dimension representation of words or documents. Latent Dirichlet allo-
cation (LDA) is another popular topic model that fits a probabilistic model with
a special prior to extract topics and document vectors. We choose LSI as the rep-
resentative of topic models in this study, as LDA models can be hard to reproduce
due to their highly probabilistic nature. Given a set of documents di , d2 , ..., dn
and a set of vocabulary words wi , w2 , ..., wm , LSI builds a document-term matrix
X of m.n dimensions, where item xi,j can represent the total occurrences of wj
in di (which can be the raw count, 0-1 count or TFIDF weight). To reduce the
dimensions of X, truncated Singular Value Decomposition (SVD) can be applied
in LSI as in formula 4 where k is the number of topics.
T
X ≈ Um,k Σk,k Vn,k (4)
Neural Models. Unlike models that simply count terms, neural models capture
information from the context of other words that surround a given word, hence
taking ordering into account. The most well-known model for predicting word
context is W2V (Word to Vector)[16], where the authors propose an algorithm
based on a shallow neural network of three layers to learn word vectors. Prior
research has shown the W2V model to perform well with analogy and similarity
relationships. Given a context window size, the W2V algorithm comes in two
forms. In the first form, known as CBOW (Continuous Bag of Words), the model
predicts the probability of a target word given a context word. In the second
form, know as Skip-Gram, the model predicts the probability of a context word
given a target word. We explain Skip-Gram mechanics in more details. Consider
a corpus with a sequence of words w1 , w2 , ..., wT and a window with size of c,
where c words on the left and right side of a focal word are considered as context.
The objective functions to be maximized is given in formula 5.
T c
1X X
log p(wt+i |wt ) (5)
T t=1
i=−c,i6=0
T
exp(vw u )
c wt
p(wc |wt ) = PW (6)
T
w=1 exp(vw uwt )
Due to the computational cost of using Softmax as a loss function (i.e.,
computing the gradient has a complexity proportional to the vocabulary size
W ), two efficient alternatives to Softmax have been suggested[17]: hierarchical
Softmax, and negative sampling.
q Also note that W2V ignores frequent words
t
with a probability of 1 − f (w i)
, where t is a hyper-parameter of the model that
is used to for the down-sampling of frequent words, and that f (wi ) is the word
frequency in the corpus.
To obtain document vectors from word vectors, one can average together all
word vectors. Because simple averaging will give the same weight to both im-
portant and unimportant words, one may be able to retain more information by
assigning weights during averaging based on TFIDF. Alternatively, an extension
of W2V, known as D2V (Document to Vector) or paragraph vectors[12], has
been proposed to obtain document vectors directly by considering a document
as a special context token that can be added to the training data, such that the
model can learn token vectors and consider them as document vectors. Build-
ing on W2V, the D2V algorithm also comes in two flavors: Distributed Memory
(DM) and Distributed Bag of Words (DBOW). D2V can be implemented incre-
mentally and showed a better performance compared to the previous approaches
in some similarity detection benchmarks. However, on the negative side, it has
numerous hyper-parameters which should be tuned to harvest its power to the
full extent. We consider the D2V model as the representative neural model for
the experiments in this study.
Given two vectors, one can measure similarity between them in many ways:
Euclidean distance, angular separation, correlation, and others. Although there
are differences between each measure, in this study we adopt cosine similarity as
the measure of interest (see formula 7) as it is well-known and frequently-used.
A·B
sim (A,B) = (7)
|A||B|
Several recent studies compare different vector space models with respect to
their similarity detection power for texts. Very few of them, however, target
similarity detection for longer texts (e.g. documents). In [11] the authors com-
pared D2V variants against an averaging W2V, as well as a probabilistic n-gram
model for two similarity tasks. In the first task, the goal is to detect similarity of
forum questions; the second task aims to detect similarity between pair of sen-
tences. The authors find that D2V is superior for most cases and that training
the model on a large corpora can improve their results. In [19] the authors at-
tempt to detect similarity between sentences and compare several neural models
against a baseline that simply averages together word vectors. They find that
more complex neural models work best for in-domain scenarios (where both the
training data set and the testing data set are from the same domain), while a
baseline of averaging word vectors is hard to beat for out-of-domain cases. In [3]
the authors proposed a method for sentence embedding through the weighted
averaging of word vectors as transformed by a dimensionality reduction tech-
nique. They show that their text vectors outperform well-known methods to
detect sentence to sentence similarity. In [20] the authors use an unsupervised
method to vectorize sentences and show that their method outperforms other
state-of-the-art techniques to detect similarity of short sequences of words. In
each of these studies, however, the objective was to determine the performance
of similarity detection algorithms on relatively short sections of text.
There has been much less research on the performance of similarity compar-
isons for longer text. In [7] the authors compared D2V to a weighted W2V, LDA,
and TFIDF to detect the similarity of documents in Wikipedia and arXiv cor-
pus. They find that D2V can outperform other models on Wikipedia, but that
D2V could barely beat a simple TFIDF baseline on arXiv. In [1] the authors
compared several algorithms to detect similarity between biomedical papers of
PubMed and find that advanced embedding techniques again can hardly beat
simpler baselines such as TFIDF. This paper adds to the stream of research
comparing text vectorization methods for longer text. In particular, we focus
on a real-world problem with an objective standard for determining similarity,
whereas prior research has had to rely on broad categorizations from repositories
such as Wikipedia and arXiv. To the best of our knowledge, this work is also
the first comparative study of semantic similarity methods in the patent space.
3 Data Pipeline
For pre-processing of the data, we use the DataProc service of Google Cloud
Platfrom5 . DataProc, a managed Apache Spark[25] service hosted by Google,
provides a high-performance infrastructure for rapid implementation of data
parallel tasks such as data pre-processing. We apply several pre-processing steps
to the textual fields of patent data (title, abstract, description):
– remove HTML, non-ASCII characters, terms with digits, terms less than 3
characters, internet addresses, and sequences such as DNA
– stem words and change to lower-case
– remove stopwords, including general NLP and patent-specific ones
– remove rare terms with extremely low total document frequency
3.2 Vectorizing
We vectorize titles, abstracts and descriptions for each of the following models:
Simple TFIDF. We use the machine learning library in the Apache Spark
framework for our implementation of TFIDF. There are two flavors of of TFIDF
in Spark to consider. The first method is based on CountVectorizer, where vo-
cabulary is generated and term frequencies are explicitly counted before being
multiplied with inverse document frequency. The CountVectorizer method, how-
ever, creates a highly sparse representation of a document over the vocabulary.
The second method takes advantage of HashingTF which transforms a set of
terms to fixed-length feature vectors. The hashing trick can be used to directly
derive dimensional indices, but it can suffer from collisions. In this work, we use
the first method as a slower but more robust technique.
D2V. We also use the implementation of D2V in Gensim, as the Gensim version
is memory efficient, allows for incremental updates, and can be parallelized over
multiple cores. We fix the random seed and Python hash seed for reproducibil-
ity of results. D2V implementation accepts raw pre-processed text in the form
of TaggedDocument elements, and has several hyper-parameters, of which we
consider the followings for tuning:
3.3 Evaluation
Model Tuning. Models with hyper-parameters (e.g., topic models and neural
models) need to be tuned for optimal performance. Studies show that hyper-
parameter tuning can be as important as, or even more important than, the
choice of the model itself [13]. Nevertheless, the complexity and cost of tuning can
escalate quickly as the number of hyper-parameters increases. Classic approaches
to tuning, such as a grid search or a random search, are often a poor choice when
the cost of model evaluation is high. For grid search, it is difficult to select grid
points for continuous parameters, and every added hyper-parameter will increase
the tuning cost geometrically. For random search, one can end up calculating
many poor configurations where model evaluation is expensive and of no use.
Bayesian optimization is shown to be superior to classic hyper-parameter
tuning solutions for expensive models [23]. In particular, it provides a global,
derivative-free approach that is suitable for tuning black box models with a high
cost of evaluation. The costs of tuning is also linear with respect to the number
of hyper-parameters.
Algorithm 1 depicts Bayesian optimization at a high level. Given a function
to optimize f , and a set of hyper-parameters X, the algorithm creates a set of
initial points for the optimization surface (line 2) and saves the obtained values
in set D. For a budget of N evaluations, the algorithm then does the following:
(i) fit the distribution of possible functions as a Gaussian process GP to the D
(line 4); (ii) suggest the next point to assess where it maximizes the expected
value of its goodness for the probabilistic model (line 5)9 ; (iii) assess the costly
objective function (line 6); and (iv) add the new assessed point to the set D (line
7). At the end of the computing budget, the point with the lowest y value in D
is reported as the optimum point.
We used the scikit-optimize library10 to implement Bayesian optimization
due to the library’s ability to run in parallel and to integrate with other libraries
in Python. We set a tuning budget equal to 10 times the number of hyper-
parameters for most experiments, but stopped tuning short for the description
field due to the high cost and low expectation of further improvement.
9
This method can be extended to suggest several points instead of a single one[6].
10
https://scikit-optimize.github.io
Algorithm 1 Bayesian optimization loop
input : f, X
D ← InitialSamples(f, X)
for i ← |D| to N do
p(y|x, D) ← F itM odel(GP, D)
xi ← argmaxx∈X EV (x, p(y|x, D))
yi ← f (xi )
D ← D ∪ (xi , yi )
end for
4 Experimental Results
Sensitivity
Sensitivity
0.6 0.6 0.6
Sensitivity
Sensitivity
0.6 0.6 0.6
Sensitivity
Sensitivity
0.6 0.6 0.6
Sensitivity
Sensitivity
Legend: Green : Rej. vs random pairs Blue : Rej. vs same class pairs Red : Rej. vs same subclass pairs
_________ TFIDF baseline - - - - - Comparison model
Table 2. Computation wall time comparison of the best VSM to the simple TFIDF.
minor improvement of D2V over the simple TFIDF baseline were obtained only
after very extensive and expensive tuning of the D2V model.
To summarize, Table 1 reports the AUC and the percentage improvement in
AUC of the best vectorization method in all different text field - test set evalu-
ation scenarios over the simple TFIDF baseline while Table 2 depicts the rough
wall time estimate required by each method to compute the corresponding vector
space representation of the full corpus using our hardware setting (excluding pre-
processing time). In each case, the best method11 performs better than a simple
TFIDF model, but the percentage of improvement is negligible in most cases
other than similarity comparison based on titles with a very easy distinguisha-
bility (102 rejections pairs versus random pairs). Moreover the computation wall
time of the best method is at least two orders of magnitude highers than the
baseline TFIDF in all scenarios.
Same Subclass
0.9 0.9 0.9
AUC
AUC
AUC
0.7 0.7 0.7
AUC
AUC
0.7 0.7 0.7
AUC
AUC
0.7 0.7 0.7
Legend: ______ AUC for D2V during tuning . . . . AUC for D2V with default parameters - - - - AUC for simple TFIDF baseline
5 Conclusion
In this paper, we evaluated the performance of text vectorization methods for the
real-world application of automatic measurement of patent-to-patent similarity.
We compared a simple TFIDF baseline to more complicated methods, including
extensions to the basic TFIDF model, LSI topic model, and D2V neural model.
We tested models on shorter to longer text, and for easier to harder problems of
similarity detection.
For our application, we find that the simple TFIDF, considering its per-
formance and cost, is a sensible choice. The use of more complex embedding
methods which can require extensive tuning, such as LSI and D2V, is only jus-
tified if the text is very condensed and the similarity detection task is relatively
coarse. Moreover extensions to the baseline TFIDF, such as adding n-grams or
incremental IDFs, does not seem to be beneficial. Although our conclusion is
based on experiments over patent corpus, we believe that it can be generalized
to other corpora due to the minimal patent-specific interventions in our pipeline.
Our results are compatible with previous studies on semantic text similarity
detection of embedding methods. The focus of prior research, however, has typ-
ically been on short text and simple similarity detection problems. Few studies
have evaluated the performance of different vector space models on long text or
for more challenging benchmarks.
For the context of this study (patent-to-patent similarity) in practice, dis-
criminating between random pairs of patents and rejection pairs of patents is
a rather trivial problem that probably does not require a complicated NLP
solution. It is only on such problems, however, that D2V and LSI might outper-
form the TFIDF model considerably. Instead, for many applications, users are
often looking for the automatic detection of differences between relatively simi-
lar patents (e.g. same-subclass pairs versus rejection pairs). For such problems,
where the differences in similarity are small, simple TFIDF appears to be a good
choice. The difference in cost and simplicity is to the extent that the use of a
simple TFIDF model, which might do slightly worse under certain conditions
then more complex models, may still be justified. An extension of TFIDF with
incremental IDF calculation could provide additional benefit of avoiding calcu-
lation of all TFIDF vectors upon addition of new patent(s) without scarifying
performance for similarity detection task.
This study can be extended by future research in several directions, both in
theory and in practice. We observed that incorporating noun phrases and incre-
mental timing information did not lead to better detection of similar patents.
For the case of noun phrases, perhaps the low weights of locally-filtered phrases
misses the signal, and a more global approach for filtering might improve the
performance. For the case of incremental TFIDF, it appears that adjusting IDF
vectors based on time is not strong enough to affect the model performance in
our context; it remains to be seen, however, how incremental IDFs may affect
more rapidly evolving domains. Also studying the affect of other similarity met-
rics, than cosine similarity, can introduce another dimension for future work.
Last, but not the least, coming up with better unsupervised vector space mod-
els for similarity detection of longer text as well as fundamental understanding
of limitations of current embedding methods in this context are clearly fertile
research playgrounds.
References
1. Jon Ezeiza Alvarez. A review of word embedding and document similarity algo-
rithms applied to academic text, bachelor thesis, university of freiburg, 2017.
2. Linda Andersson, Mihai Lupu, João Palotti, Allan Hanbury, and Andreas Rauber.
When is the time ripe for natural language processing for patent passage retrieval?
In CIKM, 2016.
3. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline
for sentence embeddings. In ICLR, 2017.
4. Andrew P. Bradley. The use of the area under the roc curve in the evaluation of
machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997.
5. Matthew Brand. Incremental singular value decomposition of uncertain data with
missing values. In ECCV, 2002.
6. Clément Chevalier and David Ginsbourger. Fast Computation of the Multi-points
Expected Improvement with Applications in Batch Selection. working paper or
preprint, October 2012.
7. Andrew M. Dai, Christopher Olah, and Quoc V. Le. Document embedding with
paragraph vectors. CoRR, abs/1507.07998, 2015.
8. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and
Richard Harshman. Indexing by latent semantic analysis. Journal of the American
Society for Information Science, 41(6):391–407, 1990.
9. Abram Handler, Matthew Denny, Hanna M. Wallach, and Brendan T. O’Connor.
Bag of what? simple noun phrase extraction for text analysis. In EMNLP, 2016.
10. Bryan Kelly, Dimitris Papanikolaou, Amit Seru, and Matt Taddy. Measuring tech-
nological innovation over the long run, working paper, 2017.
11. Jey Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical
insights into document embedding generation. CoRR, abs/1607.05368, 2016.
12. Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and
documents. CoRR, abs/1405.4053, 2014.
13. Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity
with lessons learned from word embeddings. TACL, 3:211–225, 2015.
14. Ting Liu, Andrew W. Moore, Er Gray, and Ke Yang. An investigation of practical
approximate nearest neighbor algorithms. In NIPS, 2004.
15. Qiang Lu, Amanda Myers, and Scott Beliveau. Uspto patent prosecution research
data: Unlocking office action traits. https://ssrn.com/abstract=3024621, 2017.
16. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
of word representations in vector space. CoRR, abs/1301.3781, 2013.
17. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In NIPS.
2013.
18. Andreea Moldovan, Radu Boţ, and Gert Wanka. Latent semantic indexing for
patent documents. International Journal of Applied Mathematics and Computer
Science, 15:551–560, 01 2005.
19. Marwa Naili, Anja Habacha Chaibi, and Henda Hajjami Ben Ghezala. Compara-
tive study of word embedding methods in topic segmentation. Procedia Computer
Science, 112(C):340–349, 2017.
20. Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised learning of
sentence embeddings using compositional n-gram features. CoRR, abs/1703.02507,
2017.
21. Juan Ramos. Using tf-idf to determine word relevance in document queries, 1999.
22. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Incremental
singular value decomposition algorithms for highly scalable recommender systems.
In ICIS, 2002.
23. Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimiza-
tion of machine learning algorithms. In NIPS, 2012.
24. Kenneth Younge and Jeffrey Kuhn. Patent-to-patent similarity: A vector space
model. https://ssrn.com/abstract=2709238, 2016.
25. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion
Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010.