IT4BI 2015 Thesis 4 PDF
IT4BI 2015 Thesis 4 PDF
IT4BI 2015 Thesis 4 PDF
Notation
Master Thesis
by
Alexey Grigorev
Thesis Advisors:
Moritz Schubotz
Juan Soto
Thesis Supervisor:
Prof. Dr. Volker Markl
2
Eidesstattliche Erklärung
Ich erkläre an Eides statt, dass ich die vorliegende Arbeit selbstständig
verfasst, andere als die angegebenen Quellen/Hilfsmittel nicht benutzt, und
die den benutzten Quellen wörtlich und inhaltlich entnommenen Stellen als
solche kenntlich gemacht habe.
Statutory Declaration
I declare that I have authored this thesis independently, that I have not
used other than the declared sources/resources, and that I have explicitly
marked all material which has been quoted either literally or by content
from the used sources.
Alexey Grigorev
3
Abstract
Acknowledgements
This thesis addresses a topic that has not been studied previously, and it
was challenging, but extremely interesting and I learned a lot while working
on it. I would like to thank everybody who made it possible.
First, I would like to express my gratitude to my thesis advisor, Moritz
Schubotz, who not only introduced me to the topic of namespace discovery,
but also guided me through the thesis with useful comments and enlightening
discussions.
Secondly, I thank the IT4BI committee who selected me among other
candidates and allowed me to pursue this master’s degree. I thank all my
teachers that gave me enough background to successfully complete the thesis.
I especially would like to thank Dr. Verónika Peralta and Dr. Patrick Marcel,
the teachers of Information Retrieval course at Université Francois Rabelais,
Prof. Arnaud Giacometti, the teacher of Data Mining class at Université
Francois Rabelais, and finally, Prof. Klaus-Robert Müller, the teacher of
Machine Learning class at Technische Universität Berlin.
I am also grateful to Yusuf Ameri for his suggestions on improving the
language of this work.
Last, but not least, I would like to thank my wife for supporting me for
the duration of the master program.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Namespaces in Computer Science . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Math-aware POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Mathematical Definition Extraction . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Similarity Measures and Distances . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Document Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Namespace Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Namespaces in Mathematical Notation . . . . . . . . . . . . . . . . . . . . 27
3.2 Discovery of Identifier Namespaces . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Namespace Discovery by Cluster Analysis . . . . . . . . . . . . . . . . . . 30
3.4 Identifier Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Definition Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Building Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Java Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Building Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5 Evaluation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7 Outlook and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1 Implementation and Other Algorithms . . . . . . . . . . . . . . . . . . . . . 80
7.2 Other Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.4 Unsolved Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7
1 Introduction
1.1 Motivation
In computer science, a namespace refers to a collection of terms that
are grouped because they share functionality or purpose, typically for pro-
viding modularity and resolving name conflicts [1]. For example, XML uses
namespaces to prefix element names to ensure uniqueness and remove ambi-
guity between them [3], and the Java programming language uses packages
to organize identifiers into namespaces for modularity [4].
In this thesis we extend the notion of namespaces to mathematical for-
mulae. In mathematics, there exists a special system of choosing identifers,
and it is called mathematical notation [5]. Because of the notation, when
people write “E = mc2 ”, the meaning of this expression is recognized among
scientists. However, the same identifier may be used in different areas, but
denote different things: For example, “E ” may refer to “energy”, “expected
value” or “elimination matrix”, depending on the domain of the article. We
can compare this problem with the problem of name collision in computer
science and introduce namespaces of identifiers in mathematical notation to
overcome it.
In this work we aim to discover namespaces of identifiers in mathemat-
ical notation. However, the notation only exists in the documents where
it is used, and it does not exist in isolation. It means that the identifer
namespaces should be discovered from the documents with mathematical
formulae. Therefore, the goal of this work is to automatically discover a
set of identifier namespaces given a collection of documents.
We expect the namespaces to be meaningful, in the sense that they can
be related to real-world areas of knowledge, such as physics, linear algebra
or statistics.
Once such namespaces are found, they can give good categorization of
scientific documents based on formulas and notation used in them. We be-
lieve that this may facilitate better user experience: when learning a new
area it will help the users familiarize with the notation faster. Additionally,
it may also help to locate the usages of a particular identifier and refer to
other documents where the identifier is used.
Namespaces also give a way to avoid ambiguity. If we refer to an identifier
from a particular namespace, then it is clear what the semantic meaning
of this identifier. For example, if we say that “E ” belongs to a namespaces
about physics, it gives additional context and makes it clear that “E ” means
“energy” rather than “expected value”.
8
Chapter 6 – Conclusions
Chapter 6 summarizes the findings.
Chapter 7 – Outlook and Future Work
Finally, in chapter 7 we discuss the possible areas of improvements. We
conclude this chapter by identifying the questions that are not resolved
and present challenges for future research on identifier namespace dis-
covery.
10
In this work, we are interested in the first two parts: definiendum and
definiens. Thus we define a relation as a pair (definiendum, definiens). For
example, (E, “energy”) is a relation where “E” is a definiendum, and “energy”
is a definiens. We refer to definiendum as identifier, and to definiens as
definition, so relations are identifier-definition pairs.
There are several ways of extracting the identifier-definition relations.
Here we will review the following:
– Nearest Noun
– Pattern Matching
– Machine-Learning based methods
– Probabilistic methods
The Nearest Noun [14] [15] is the simplest definition extraction method.
It assumes that the definition is a combination of ad It finds definitions by
looking for combinations of adjectives and nouns (sometimes preceded by
determiners) in the text before the identifier.
I.e. if we see a token annotated with ID, and then a sequence consisting
only of adjectives (JJ), nouns (NN, NNS) and determiners (DET), then we say
that this sequence is the definition for the identifer.
2
https://en.wikipedia.org/wiki/Mass%E2%80%93energy_equivalence
14
For example, given the sentence “In other words, the bijection σ normal-
izes G in ...” we will extract a relation (σ, "bijection").
– IDE DEF
– DEF IDE
– let|set IDE denote|denotes|be DEF
– DEF is|are denoted|defined|given as|by IDE
– IDE denotes|denote|stand|stands as|by DEF
– IDE is|are DEF
– DEF is|are IDE
– and many others
In this method IDE and DEF are placeholders that are assigned a value
when the pattern is matched against some subsequence of tokens. IDE and
DEF need to satisfy certain criteria in order to be successfully matched: like
in the Nearest Noun method we assume that IDE is some token annotated
with ID and DEF is a phrase containing adjective (JJ), nouns (NN) and de-
terminers (DET). Note that the first patten corresponds to the Nearest Noun
pattern.
The patterns above are combined from two lists: one is extracted from
a guide to writing mathematical papers in English ([17]), and another is
extracted from “Graphs and Combinatorics” papers from Springer [13].
The pattern matching method is often used as the baseline method for
identifier-definition extraction methods [13] [18] [6].
n
tf-idf(t, d) = (1 + log tf(t, d)) · log ,
df(t)
The inner product between two vectors can be used as a similarity func-
tion: the more similar two vectors are, the larger is their inner product.
Geometrically the inner product between two vectors x and y is defined as
xT y = kxk kyk cos θ where θ is the angle between vectors x and y. In Lin-
ear Algebra, however, the inner product is defined as a sum of element-wise
products
T
Pnof two vectors: given two vectors x and y, the inner product is
x y = i=1 xi yi where xi and yi are ith elements of x and y, respectively.
The geometric and algebraic definitions are equivalent [26].
Inner product is sensitive to the length of vectors, and thus it may make
sense to consider only the angle between them: the angle does not depend on
the magnitude, but it is still a very good indicator of vectors being similar
or not.
The angle between two vectors can be calculated from the geometric
definition of inner product: xT y = kxk kyk cos θ. By rearranging the terms
we get cos θ = xT y / (kxk kyk).
We do not need the angle itself and can use the cosine directly [19]. Thus
can define cosine similarity between two documents d1 and d2 as
dT1 d2
cosine(d1 , d2 ) = .
kd1 k kd2 k
19
If the documents have unit lengths, then cosine similarity is the same as dot
product: cosine(d1 , d2 ) = dT1 d2 .
The cosine similarity can be converted to a distance function. The max-
imal possible cosine is 1 for two identical documents. Therefore we can de-
fine cosine distance between two vectors d1 and d2 as dc (d1 , d2 ) = 1 −
cosine(d1 , d2 ). The cosine distance is not a proper metric [27], but it is
nonetheless useful.
The cosine distance and the Euclidean distance are connected [27]. For
two unit-normalized vectors d1 and d2 the Euclidean distance between them
is kd1 −d2 k2 = 2−2 dT1 d2 = 2 dc (d1 , d2 ). Thus we can use Euclidean distance
on unit-normalized vectors and interpret it as cosine distance.
These algorithms differ only in the way they calculate similarity between
clusters. It can be Single Linkage, when the clusters are merged based on
the closest pair; Complete Linkage, when the clusters are merged based on
the worst-case similarity – the similarity between the most distant objects
on the clusters; Group-Average Linkage, based on the average pair-wise
similarity between all objects in the clusters; and Ward’s Method when
the clusters to merge are chosen to minimize the within-cluster error between
each object and its centroid is minimized [21].
Among these algorithms only Single Linkage is computationally feasi-
ble for large data sets, but it doesn’t give good results compared to other
agglomerative clustering algorithms. Additionally, these algorithms are not
always good for document clustering because they tend to make mistakes at
early iterations that are impossible to correct afterwards [29].
2.6.2 K-Means
2.6.3 DBSCAN
belong to the same cluster. If a point is not a core point itself, but it belong
to the neighborhood of some core point, then it is a border point. But if a
point is not a core point and it is not in the neighborhood of any other core
point, then it does not belong to any cluster and it is considered noise.
DBSCAN works as follows: it selects an arbitrary data point p, and then
finds all other points in ε-neighborhood of p. If there are more than MinPts
points around p, then it is a core point, and it is considered a cluster. Then
the process is repeated for all points in the neighborhood, and they all are
assigned to the same cluster, as p. If p is not a core point, but it has a core
point in its neighborhood, then it’s a border point and it is assigned to the
same cluster and the core point. But if it is a noise point, then it is marked
as noise or discarded (see listing 2).
Algorithm 2 DBSCAN
function DBSCAN(database D, radius ε, MinPts)
result ← ∅
for all p ∈ D do
if p is visited then
continue
mark p as visited
N ← Region-Query(p, ε) . N is the neighborhood of p
if N < MinPts then
mark p as NOISE
else
C ← Expand-Cluster(p, N , ε, MinPts)
result ← result ∪ {C}
return result
LSA has some drawbacks. Because SVD looks for an orthogonal basis
for the new reduced document space, there could be negative values that
are harder to interpret, and what is more, the cosine similarity can become
negative as well. However, it does not significantly affect the cosine distance:
it still will always give non-negative results.
Apart from SVD there are many other different matrix decomposition
techniques that can be applied for document clustering and for discovering
the latent structure of the term-document matrix [40], and one of them
in Non-Negative Matrix Factorization (NMF) [41]. Using NMF solves the
problem of negative coefficients: when it is applied to non-negative data
such as term-document matrices, NMF produces non-negative rank-reduced
approximations.
The main conceptual difference between SVD and NMF is that SVD
looks for orthogonal directions to represent document space, while NMF
does not require orthogonality [42] (see fig. 2).
Fig. 2: Directions found by SVD (on the left) vs directions by NMF (on the
right)
3 Namespace Discovery
In this chapter we introduce the problem of namespace discovery in math-
ematical notation and suggest how this problem can be approached.
First, we extend the idea of namespaces to mathematics in section 3.1,
and discuss the problem of namespace discovery in section 3.2, and then
argue that it is possible to use document cluster analysis to solve the problem
in section 3.3. Finally, we propose a way of representing identifiers in a vector
space in section 3.4.
(c, “speed of light”)} , “Nphysics .E” refers to “energy” – the definition of “E”
in the namespace “Physics”.
Analogously to namespaces in Computer Science, formally a mathemat-
ical namespace can contain any set of identifier-definition pairs that satisfies
the definition of the namespace, but typically namespaces of mathemati-
cal notation exhibit the same properties as well-designed software packages:
they have low coupling and high cohesion, meaning that all definitions in
a namespace come from the same area of mathematical knowledge and the
definitions from different namespace do not intersect heavily.
However, mathematical notation does not yet exist in isolation, and it is
usually observed indirectly by its usage in documents. To account for this
fact, we need to introduce a document-centric view on mathematical names-
paces: suppose we have a collection of n documents D = {d1 , d2 , ... , dn }
and a set of K namespaces {N1 , N2 , ... , NK }. A document dj can use a
namespace Nk by importing identifiers from it. To import an identifier, the
document uses an import statement where the identifier i is referred by its
fully qualified name. For example, a document “Energy-mass equivalence”
would import “Nphysics .E”, “Nphysics .m”, and “Nphysics .c”, and then these iden-
tifiers can be used in formulae of this document unambiguously.
A namespace exhibits low coupling if it is used only in a small subset of
documents, and high cohesion if all the documents in this subset are related
to the same domain.
But in real-life scientific document there are no import statements in the
document preamble, and they contain only natural language texts along with
some mathematical formulae. Yet we may still assume that these import
exists, but they are implicit, i.e. they are latent and cannot be observed
directly. Additionally, the namespaces themselves are also not observed.
Typically in mathematical texts, when an identifier is first introduced,
its definition is given in the natural language description that surrounds the
formula. This description can be extracted and used to assign the meaning
to the identifiers. Once identifier definitions are extracted, a document can
be represented as a set of identifier-definition pairs, and these pairs can be
used to discover the namespaces.
In the next section we discuss how this problem can be addressed.
and they do not have low coupling: they are not very homogenous and they
import from several namespaces
With this intuition we can refer to type 1 document groups as names-
pace defining groups. These groups can be seen as “type 1” packages: they
define namespaces that are used by other type 2 document groups. Once the
namespace defining groups are found, we can learn the namespace of these
document.
Thus we need to find groups of homogenous documents given a collection,
and this is exactly what Cluster Analysis methods do.
In the next section we will argue why we can use traditional document
clustering techniques and what are the characteristics that texts and iden-
tifiers have in common.
– two words are synonymous if they have the same meaning (for example,
“graph” and “chart” are synonyms),
– a word is polysemous is it can have multiple meanings (for example,
“trunk” can refer to a part of elephant or a part of a car).
Note that identifiers have the same problems. For example, “E” can stand
both for “Energy” and “Expected value”, so “E” is polysemous.
These problems have been studied in Information Retrieval and Natu-
ral Language Processing literature. One possible solution for the polysemy
problem is Word Sense Disambiguation [9]: either replace a word with its
sense [45] or append the sense to the word. For example, if the polysemous
word is “bank” with meaning “financial institution”, then we replace it with
31
“bank_finance”. The same idea can be used for identifiers, for example if we
have an identifier “E” which is defined as “energy”, then “E” can be replaced
with “E_energy”.
Thus we see that text representation of documents and identifier repre-
sentation of documents have many similarities and therefore we can apply
the set of techniques developed for text representation for clustering docu-
ments based on identifiers.
For document clustering, documents are usually represented using Vector
Space Models [21] [22]. Likewise, we can introduce “Identifier Vector Space
Model” analogously to Vector Space Models for words, and then we can
apply clustering algorithm to documents represented in this space.
c 1 1 0
32
4 Implementation
Wikipedia is a big online encyclopedia where the content are written and
edited by the community. It contains a large amount of articles on a variety of
topics, including articles about Mathematics and Mathematics-related fields
such as Physics. It is multilingual and available in several languages, includ-
ing English, German, French, Russian and others. The content of Wikipedia
pages are authored in a special markup language and the content of the
entire encyclopedia is freely available for download.
The techniques discussed in this work are mainly applied to the English
version of Wikipedia. At the moment of writing (July 31, 2015) the English
Wikipedia contains about 4.9 million articles3 . However, just a small portion
of these articles are math related: there are only 30 000 pages that contain
at least one <math> tag.
Apart from the text data and formulas Wikipedia articles have informa-
tion about categories, and we can exploit this information as well. The cate-
gory information is encoded directly into each Wikipedia page with a special
markup tag. For example, the article “Linear Regression” 4 belongs to the cat-
egory “Regression analysis” and [[Category:Regression analysis]] tag
encodes this information.
Wikipedia is available in other languages, not only English. While the
most of the analysis is performed on the English Wikipedia, we also apply
some of the techniques to the Russian version [46] to compare it with the
results obtained on the English Wikipedia. The Russian Wikipedia is smaller
that the English Wikipedia and contains 1.9 million articles5 , among which
only 15 000 pages are math-related (i.e. contain at least one <math> tag).
3
https://en.wikipedia.org/wiki/Wikipedia:Statistics
4
https://en.wikipedia.org/wiki/Linear_regression
5
https://en.wikipedia.org/wiki/Russian_Wikipedia
34
Once the identifiers are extracted, the rest of the formula is discarded.
As the result, we have a “Bag of Formulae”: analogously to the Bag of Words
approach (see section 2.4) we keep only the counts of occurrences of different
identifiers and we do not preserve any other structure.
The content of Wikipedia document is authored with Wiki markup – a
special markup language for specifying document layout elements such as
headers, lists, text formatting and tables. Thus the next step is to process
the Wiki markup and extract the textual content of an article, and this is
done using a Java library “Mylyn Wikitext” [50]. Almost all annotations are
discarded at this stage, and only inner-wiki links are kept: they can be useful
as candidate definitions. The implementation of this step is taken entirely
from [6] with only a few minor changes.
Once the markup annotations are removed and the text content of an
article is extracted, we then apply Natural Language Processing (NLP) tech-
niques. Thus, the next step is the NLP step, and for NLP we use the Stan-
ford Core NLP library (StanfordNLP) [12]. The first part of this stage is
to tokenize the text and also split it by sentences. Once it is done, we then
apply Math-aware POS tagging (see section 2.2). For English documents
from the English Wikipedia we use StanfordNLP’s Maximal Entropy POS
Tagger [51]. Unfortunately, there are no trained models available for POS
tagging the Russian language for the StanfordNLP library and we were not
able to find a suitable implementation of any other POS taggers in Java.
Therefore we implemented a simple rule-based POS tagger ourselves. The
implementation is based on a PHP function from [52]: it is translated into
Java and seamlessly integrated into the StanfordNLP pipeline. The English
tagger uses the Penn Treebank POS Scheme [11], and hence we follow the
same convention for the Russian tagger.
For handling mathematics we introduce two new POS classes: “ID” for
identifiers and “MATH” for formulas. These classes are not a part of the Penn
Treebank POS Scheme, and therefore we need to label all the instances of
these tags ourselves during the additional post-processing step. If a token
starts with “FORMULA_”, then we recognize that it is a placeholder for a math
formula, and therefore we annotate it with the “MATH” tag. Additionally, if
this formula contains only one identifier, this placeholder token is replaced by
the identifier and it is tagged with “ID”. We also keep track of all identifiers
found in the document and then for each token we check if this token is in
the list. If it is, then it is re-annotated with the “ID” tag.
36
The Natural Language data is famous for being noisy and hard to clean
[53]. The same is true for mathematical identifiers and scientific texts with
formulas. In this section we describe how the data was preprocessed and
cleaned at different stages of Definition Extraction.
Often identifiers contain additional semantic information visually con-
veyed by special diacritical marks or font features. For example, the diacrit-
ics can be hats to denote “estimates” (e.g. “ ŵ”), bars to denote the expected
value (e.g. “ X̄”), arrows to denote vectors (e.g. “~x ”) and others. As for the
font features, boldness is often used to denote vectors (e.g. “w”) or matrices
(e.g. “X”), calligraphic fonts are used for sets (e.g. “H”), double-struck fonts
often denote spaces (e.g. “R”), and so on.
Unfortunately there is no common notation established across all fields
of mathematics and there is a lot of variance. For example, a vector can be
denoted by “~x ”, “x” or “x”, and a real line by “R”, “R” or “R”. In natural
languages there are related problems of lexical ambiguity such as synonymy,
when different words refer to the same concept, and it can be solved by re-
placing the ambiguous words with some token, representative of the concept.
37
English and Russian: for the Russian wikipedia we only need to handle the
auxiliary words such as “где” (“where”), “иначе” (“else”) and so on. The
names for operators and functions are more or less consistent across both
data sources.
Then, at the next stage, the definitions are extracted. However many
shortlisted definitions are either not valid definitions or too general. For ex-
ample, some identifiers become associated with “if and only if”, “alpha”,
“beta”, “gamma”, which are not valid definitions.
Other definitions like “element” (“элемент”), “number” (“число”) or
“variable” (“переменная” ) are valid, but they are too general and not
descriptive. We maintain a stop list of such false definitions and filter them
out from the result. The elements of the stop list are also consistent across
both data data sets, in the sense that the false definition candidates are
same but expressed in different languages.
The Russian language is highly inflected, and because of this extracted
definitions have many different forms, depending on grammatical gender,
form (singular or plural) and declensions. This highly increases the variabil-
ity of the definitions, and to reduce it lemmatize the definitions: they are
reduced to the same common form (nominative, singular and masculinum).
This is done using Pymorphy2: a python library for Russian and Ukrainian
morphology [55].
At the next stage the retrieved identifier/definition pairs are used for
document clustering. Some definitions are used only once and we can note
that they are not very useful because they do not have any discriminative
power. Therefore all such definitions are excluded.
At the identifier extraction step when the data set is cleaned, some iden-
tifiers are discarded, and after that some documents become empty: they
no longer contain any identifiers, and which is why these documents are
not considered for further analysis. Additionally, we discard all the docu-
ments that have only one identifier. This leaves only 22 515 documents out
of 30 000, and they contain 12 771 distinct identifiers, which occur about 2
million times.
The most frequent identifiers are “x” (125 500 times), “p” (110 000), “m”
(105 000 times) and “n” (83 000 times), but about 3 700 identifiers occur only
once and 1 950 just twice. Clearly, the distribution of identifiers follows some
power law distribution (see fig. 3).
39
140000 106
120000 105
100000
60000
102
40000
101
20000
100 0
0 10 101 102 103 104
0 10 20 30 40
identifiers identifiers, log scale
The distribution of counts for identifiers inside the documents also ap-
pears to follow a long tail power law distribution: there are few articles that
contain many identifiers, while most of the articles do not (see fig. 4a). The
biggest article (“Euclidean algorithm”) has 22 766 identifiers, a and the sec-
ond largest (“Lambda lifting”) has only 6 500 identifiers. The mean number
of identifiers per document is 33. The distribution for number of distinct
identifiers per document is less skewed (see fig. 4b). The largest number of
distinct identifiers is 287 (in the article “Hooke’s law”), and it is followed
by 194 (in “Dimensionless quantity”). The median number of identifiers per
document is 10.
For 12 771 identifiers the algorithm extracted 115 300 definitions, and
the number of found definitions follows a long tail distribution as well (see
fig. 4c), with the median number of definitions per page being 4.
Table 1 shows the list of the most common identifier-definition relations
extracted from the English Wikipedia.
In the Russian Wikipedia only 5 300 articles contain enough identifiers,
and the remaining 9 500 are discarded.
The identifiers and definitions extracted from the Russian version of
Wikipedia exhibit the similar properties. The most frequently occurring
identifier is “x” with 13 248 occurrences, but the median frequency of an
identifer is only 3 times. The article with the largest number of identifiers is
“Уравнения Максвелла” (“Maxwell’s equations”) which contains 1 831 iden-
tifiers, while the median number of identifiers is just 3; the article with the
largest number of distinct identifiers is also “Уравнения Максвелла” with
112 unique identifiers, and the median number of distinct identifiers in the
40
25000
20000
identifiers 15000
10000
5000
0
0 10 20 30 40 50 60 70
documents
(a) Identifier frequencies per document for first 80 most largest documents
300
250
distinct identifiers
200
150
100
50
0
0 10 20 30 40 50 60 70
documents
(b) No. of distinct identifiers per document for first 80 most largest documents
100
80
definitions
60
40
20
0
0 10 20 30 40 50 60 70
documents
– identifiers alone,
– “weak” identifier-definition association,
42
In the first case we are only interested in identifier information and dis-
card the definitions altogether.
In the second and third cases we keep the definitions and use them to
index the dimensions of the Identifier Space. Bur there is some variability
in the definitions: for example, the same identifier “σ” in one document
can be assigned to “Cauchy stress tensor” and in other it can be assigned to
“stress tensor”, which are almost the same thing. To reduce this variability we
perform some preprocessing: we tokenize the definitions and use individual
tokens to index dimensions of the space. For example, suppose we have
two pairs (σ, “Cauchy stress tensor”) and (σ, “stress tensor”). In the “weak”
association case we have will dimensions (σ, Cauchy, stress, tensor), while for
the “strong” association case we will have (σ_Cauchy, σ_stress, σ_tensor).
Additionally, the effect of variability can be decreased further by applying
a stemming technique for each definition token. In this work we use Snowball
stemmer for English [56] implemented in NLTK [57]: a python library for
Natural Language Processing. For Russian we use Pymorphy2 [55].
Using TfidfVectorizer from scikit-learn [58] we vectorize each docu-
ment. The experiments are performed with (log TF) × IDF weighting, and
therefore we use use_idf=False, sublinear_tf=True parameters for the
vectorizer. Additionally, we use min_df=2 to discard identifiers that occurs
only once.
The output is a document-identifier matrix (analogous to “document-
term”): documents are rows and identifiers/definitions are columns. The
output of TfidfVectorizer is row-normalized, i.e. all rows has unit length.
Once we the documents are vectorized, we can apply clustering tech-
niques to them. We use K-Means (see section 2.6.2) implemented as a class
KMeans in scikit-learn and Mini-Batch K-Means (class MiniBatchKMeans)
[58]. Note that if rows are unit-normalized, then running K-Means with
Euclidean distance is equivalent to cosine distance (see section 2.5.3).
DBSCAN and SNN Clustering (see section 2.6.3) algorithms were imple-
mented manually: available DBSCAN implementations usually take distance
measure rather than a similarity measure. The similarity matrix cleated by
similarity measures are typically very sparse, because usually only a small
fraction of the documents are similar to some given document. Similarity
measures can be converted to distance measures, but in this case the matrix
will no longer be sparse, and we would like to avoid that. Additionally, avail-
able implementations are usually general purpose implementations and do
43
not take advantage of the structure of the data: in text-like data clustering
algorithms can be sped up significantly by using an inverted index.
Dimensionality reduction techniques are also important: they not only
reduce the dimensionality, but also help reveal the latent structure of data.
In this work we use Latent Semantic Analysis (LSA) (section 2.7) which is
implemented using randomized Singular Value Decomposition (SVD) [59],
The implementation of randomized SVD is taken from scikit-learn [58] –
method randomized_svd. Non-negative Matrix Factorization is an alterna-
tive technique for dimensionality reduction (section 2.7). Its implementation
is also taken from scikit-learn [58], class NMF.
To assess the quality of produced clusters we use wikipedia categories.
It is quite difficult to extract category information from raw wikipedia text,
therefore we use DBPedia [60] for that: it provides machine-readable infor-
mation about categories for each wikipedia article. Additionally, categories
in wikipedia form a hierarchy, and this hierarchy is available as a SKOS
ontology.
Unfortunately, there is no information about articles from the Russian
Wikipedia on DBPedia. However the number of documents is not very large,
and therefore this information can be retrieved via MediaWiki API8 indi-
vidually for each document. This information can be retrieved in chunks for
a group of several documents at once, and therefore it is quite fast.
– Document A:
• n: (predictions, 0.95), (size, 0.92), (random sample, 0.82), (popula-
tion, 0.82)
• θ: (estimator, 0.98), (unknown parameter, 0.98), (unknown parame-
ter, 0.94)
• µ: (true mean, 0.96), (population, 0.89)
• µ4 : (central moment, 0.83)
• σ: (population variance, 0.86), (square error, 0.83), (estimators, 0.82)
– Document B:
• Pθ : (family, 0.87)
• X: (measurable space, 0.95)
• θ: (sufficient statistic, 0.93)
• µ: (mean, 0.99), (variance, 0.95), (random variables, 0.89), (normal,
0.83)
• σ: (variance, 0.99), (mean, 0.83)
– Document C:
• n: (tickets, 0.96), (maximum-likelihood estimator, 0.89)
• x: (data, 0.99), (observations, 0.93)
• θ: (statistic, 0.95), (estimator, 0.93), (estimator, 0.93), (rise, 0.91),
(statistical model, 0.85), (fixed constant, 0.82)
• µ: (expectation, 0.96), (variance, 0.93), (population, 0.89)
• σ: (variance, 0.94), (population variance, 0.91), (estimator, 0.87)
– Pθ : (family, 0.87)
– X: (measurable space, 0.95), (Poisson, 0.82)
– n: (tickets, 0.96), (predictions, 0.95), (size, 0.92), (maximum-likelihood
estimator, 0.89), (random sample, 0.82), (population, 0.82)
45
– Pθ : (family, 0.87)
– X: (measurable space, 0.95), (Poisson, 0.82)
– n: (tickets, 0.96), (predictions, 0.95), (size, 0.92), (maximum-likelihood
estimator, 0.89), (random sample, 0.82), (population, 0.82)
– x: (data, 0.99), (observations, 0.93)
– θ: (estimator, 2.84), (unknown parameter, 1.92), ({statistic, sufficient
statistic}, 1.88), (rise, 0.91), (statistical model, 0.85), (fixed constant,
0.82)
– µ: (random variables, 2.67), ({mean, true mean}, 1.95), (variance, 1.88),
(expectation, 0.96), (normal, 0.83)
– µ4 : (central moment, 0.83)
– σ: ({variance, population variance}, 3.7), ({estimator, estimators}, 1.69),
(square error, 0.83), (mean, 0.83)
Intuitively, the more a relation occurs, the higher the score, and it gives
us more confidence that the definition is indeed correct.
1.0
0.8
0.6
score
0.4
tanh(x / 2)
0.2
tanh(x)
0.0
0 2 4 6 8 10
x
The category of this namespace is selected as the category that the ma-
jority of the documents in the namespace-defining cluster share.
48
5 Evaluation
In this chapter we describe the experimental setup and the obtained
results.
First, section 5.1 verifies that the namespace discovery is possible by
applying the proposed technique to Java source code. Next, section 5.2 de-
scribes parameter tuning: there are many possible choices of parameters, and
we find the best. Once the best algorithm and its parameters are selected,
we analyze the obtained results in section 5.3. Next, we describe how the
discovered clusters can be mapped to a hierarchy in section 5.4, and finish
by summarizing our findings in section 5.5.
// ...
carding declarations from the standard Java API, primitives and types with
generic parameters, only 15 869 declarations were retained.
The following is top-15 variable/type declarations extracted from the
Mahout source code:
22 30
20 k = 100 k = 100
18 k = 200 25 k = 200
k = 300 k = 300
no. pure clusters
There are many different clustering algorithms, each with its own set of
parameter. In this section we describe how we find the settings that find the
best namespaces.
The following things can be changed:
Class name
org.apache.mahout.math.neighborhood.UpdatableSearcher
org.apache.mahout.common.distance.CosineDistanceMeasureTest
org.apache.mahout.common.distance.DefaultDistanceMeasureTest
org.apache.mahout.common.distance.DefaultWeightedDistanceMeasureTest
org.apache.mahout.common.distance.TestChebyshevMeasure
org.apache.mahout.common.distance.TestMinkowskiMeasure
(b) A namespace-defining cluster about Distances.
ID Class
chebyshevDistanceMeasure org.apache.mahout.common.distance.DistanceMeasure
distanceMeasure org.apache.mahout.common.distance.DistanceMeasure
euclideanDistanceMeasure org.apache.mahout.common.distance.DistanceMeasure
manhattanDistanceMeasure org.apache.mahout.common.distance.DistanceMeasure
minkowskiDistanceMeasure org.apache.mahout.common.distance.DistanceMeasure
v org.apache.mahout.math.Vector
vector org.apache.mahout.math.Vector
(c) Definitions in the v namespace.
0.8
0.7
0.6
Purity
0.5
0.4
0.3
0.2
0 2000 4000 6000 8000 10000 12000 14000 16000
Number of clusters
(a) Number of clusters K vs overall purity of clustering: the purity in-
creases linearly with K (R2 = 0.99).
80
70 no. clusters
smoothed line
Number of pure clusters
60
50
40
30
20
10
0
0 2000 4000 6000 8000 10000 12000 14000 16000
Number of clusters
(b) Number of clusters K vs the number of pure clusters: it grows ini-
tially, but after K ≈ 8 000 starts to decrease.
However it is not enough just to find the most pure cluster assignment:
because as the number of clusters increases the overall purity also grows.
Thus we can also optimize for the number of clusters with purity p of size at
least n. When the number of clusters increase, the purity always grows (see
fig. 7a), but at some point the number of pure clusters will start decreasing
(see fig. 7b).
54
5.2.1 Baseline
25
20
15
frequency
10
5
0
15 20 25 30 35 40
no. clusters
To establish the baseline, we repeated this experiment for 200 times (see
fig. 8), and the maximal achieved value is 39 pure clusters, while the mean
value is 23.85.
The first way of building the identifier space is to use only identifiers and
do not use definitions at all. If we do this, the identifier-document matrix
is 6075 × 22512 (we keep only identifiers that occur at least twice), and it
contains 302 541 records, so the density of this matrix is just 0.002.
First, we try to apply agglomerative clustering, then DBSCAN with SNN
similarity based on Jaccard coefficient and cosine similarity, then we do K-
Means and finally we apply LSA using SVD and NMF and apply K-Means
on the reduced space.
Agglomerative clustering algorithms are quite fast for small datasets,
but they become more computationally expensive as the dataset size grows.
55
40
35 observed runtime
30 regression fit
25
time, min
20
15
10
5
0
2000 4000 6000 8000 10000 12000 14000 16000
number of documents N
90
80 =3 =5
70 =4 =6
70
60 30
50
40 3 20
30 4
20 5 10
10 6
0 7 0
3 4 8 3 4 5 6 7 8 9 10
5
M6inPts7 8 9
9
MinPts
10 10
(a) Number of clusters when 10 nearest neigh-(b) Performance of selected ε with 10 nearest
bors are considered neighbors
80
70
=3 =5
=4 =6
60
60 40
50
40 3 30
30 4
20 5 20
10 6
0 7 10
3 4 8 3 4 5 6 7 8 9 10
5
M6inPts7 8 9
9
MinPts
10 10
(c) Number of clusters when 15 nearest neigh-(d) Performance of selected ε with 15 nearest
bors are considered neighbors
the best closets neighbors. For example, the nearest neighbor of “Singular
value decomposition” is “Rule of Sarrus”, and although their cosine score is
0.92, they have only 3 identifiers in common.
With cosine as the base similarity function for SNN DBSCAN we were
able to discover 124 namespace-defining clusters (see fig. 11), which is signifi-
cantly better than the baseline. The best parameters are 10 nearest neighbors
and ε = 4, MinPts = 3 (see fig. 11b).
Next, we apply K-Means. We observe that increasing K leads to linear
increase in time (see fig. 12a), which means that for bigger values of K, it
takes longer, so it is not feasible to run: for example, we estimate the runtime
of K-Means with K = 10 000 to be about 4.5 hours. As MiniBatch K-
Means is expected to be significantly faster than usual K-Means, we use it
as well. Although we observe that the run time of MiniBatch K-Means also
increases linearly with K (see fig. 12b), it indeed runs considerably faster.
57
140
=3 =5
120 =4 =6
100
80 60
60 3
40 4 40
5
20 6
0 7 20
3 4 8 3 4 5 6 7 8 9 10
5
M6inPts7 8 9
9
MinPts
10 10
(a) Number of clusters when 10 nearest neigh-(b) Performance of selected ε with 10 nearest
bors are considered neighbors
90
80
70
60 50
50 3
40 4 40
=5 =7
5 =6 =8
30 6
20 7 30
3 4 8 3 4 5 6 7 8 9 10
5
M6inPts7 8 9
9
MinPts
10 10
(c) Number of clusters when 15 nearest neigh-(d) Performance of selected ε with 15 nearest
bors are considered neighbors
20
15
time, min
10
5
trend
observed
0
0 100 200 300 400 500 600 700
number of clusters K
10
8
time, sec
2 trend
observed
0
0 100 200 300 400 500 600 700
number of clusters K in MiniBatch K Means
Randomized SVD is very fast, but the runtime does not grow linearly
with k, it looks quadratic (see fig. 14). However, the typical values of k for
SVD used in latent semantic analysis is 150-250 [22] [39], therefore the run
time is not prohibitive, and we do not need to rut it with very large k.
When the dimensionality is reduced, the performance of K-Means and
MiniBatch K-Means is similar (see fig. 15a), but with MiniBatch K-Means
we were able to discover more interesting pure clusters (see fig. 15b). The
reason for this may be the fact that in the reduced space there is less noise
and both methods find equally good clusters, but because MiniBatch K-
Means works faster, we are able to run it multiple times thus increasing its
chances to find a good local optimum where there are many pure document
clusters. Note that the obtained result is below the baseline.
59
0.24 20
Usual Usual
0.22
MiniBatch 15
MiniBatch
0.20
0.18 10
0.16
5
0.14
0.12 0
100 200 300 400 500 600 100 200 300 400 500 600
number of clusters K number of clusters K
4
3
2
1
0
1
500 1000 1500 2000 2500 3000
K in K Means
60
50 mean
40 observed
time, sec
30
20
10
0
0 200 400 600 800 1000
k in reduced k rank SVD
0.28 20
MiniBatch MiniBatch
0.26
MiniBatch mean 15 MiniBatch mean
0.24 K-Means K-Means
no. pure clusters
Fig. 15: The performance of K-Means and MiniBatch K-Means on the re-
duced document space with k = 150.
ferent k ∈ {150, 250, 350, 500}. The performance in terms of discovered pure
clusters does not depend much on the rank k of the reduced space (see
fig. 16). In fact, it is very hard to distinguish different lines because they are
quite perplexed. The maximum for is achieved at K ≈ 10 000 for all k.
250
200
no. pure clusters
150
100
50
k = 150 k = 250 k = 350 k = 500
0
2000 4000 6000 8000 10000 12000 14000
number of clusters K
Fig. 16: Number of discovered pure clusters in K-Means for different number
of clusters K and rank k.
6
5
4
time, hours
3
2
1
0
150 200 250 300 350
rank k
200
150
no. pure clusters
100
50
k = 150 k = 250 k = 350
2000 4000 6000 8000 10000 12000 14000
number of clusters K
Fig. 18: Number of discovered pure clusters in K-Means and NMF for dif-
ferent number of clusters K and rank k.
220
200
180
no. pure clusters
160
140
120
100
TF logTF
80 TF-IDF logTF-IDF
60
3000 4000 5000 6000 7000 8000 9000 10000 11000
number of clusters K
Fig. 19: The effect of using different weighting systems on K-Means with
SVD.
250
=3 =5
200 =4 =6
200 100
150
100 3
4 50
50 5
6
0 7 0
3 4 8 3 4 5 6 7 8 9 10
5
M6
inPts7 8 9
9
MinPts
10 10
(a) Number of clusters when 10 nearest neigh-(b) Performance of selected ε with 10 nearest
bors are considered neighbors
200
180
160
150
80
100 3
4 60 =3 =5
50 5 =4
6 40 =6
0 7 20
3 4 8 3 4 5 6 7 8 9 10
5
M6
inPts7 8 9
9
MinPts
10 10
(c) Number of clusters when 15 nearest neigh-(d) Performance of selected ε with 15 nearest
bors are considered neighbors
400
300
250
200
150
k = 150 k = 250 k = 350
100
4000 6000 8000 10000 12000 14000
number of clusters K
Fig. 21: Number of discovered pure clusters in K-Means and SVD for differ-
ent number of clusters K and rank k.
350
no. pure clusters
300
250
320
300
no. pure clusters
280
260
240 TF logTF
TF-IDF logTF-IDF
220
7000 8000 9000 10000 11000 12000
number of clusters K
Fig. 23: The effect of using different weighting systems on K-Means with
SVD (k = 150).
350
300
no. pure clusters
250
200
150
k = 150 k = 250 k = 350 k = 500
100
4000 6000 8000 10000 12000 14000
number of clusters K
Fig. 24: Number of discovered pure clusters in K-Means and SVD for differ-
ent number of clusters K and rank k.
100
90
80
no. pure clusters
70
60
50
40 k = 150 k = 250 k = 350
30
1000 1500 2000 2500 3000 3500 4000
number of clusters K
observe significant differences across different values of k (see fig. 25). The
best achieved result is 105 namespace-defining clusters.
In the previous chapter we have established that the best way to incor-
porate definitions into Intensifier Vector Space is by using soft association,
and the best clustering performing method is MiniBatch K-Means.
ID Definition Score
D diagonal matrix 0.72
t real argument 0.46
Article Identifiers
u eigenvalues 0.42
Diagonalizable matrix v1 , λ1 , vk , λ3 , λ2 , λi , λk , λj , λn , ...
ui eigenvector 0.42
Eigenvalues and eigenvectors vi , µA , λi , d, λn , ...
v1 eigenvectors 0.73
Principal axis theorem v1 , u1 , λ1 , λ2 , D, S, u, ...
Λ diagonal matrix 0.87
Eigendecomposition of a matrix λ, λ1 , λ, λ2 , λk , R, U, T, ...
λ eigenvalue 0.4
Min-max theorem σ, un , uk , ui , u1 , α, λ1 , λ, λi , ...
λ1 eigenvalues 0.95
Linear independence Λ, vj , u2 , v3 , un , λ1 , λ3 , λ2 , ...
λ2 eigenvalues 0.71
Symmetric matrix Λ, λ1 , λ2 , D, Q, P, λi , ...
λ3 eigenvalues 0.39
(a) Wiki Articles in the cluster “Linear Algebra” λi eigenvalue 0.98
(b) Definitions in “Linear Al-
gebra”
ation using parameters K = 9750 and k = 350. The purity of this clustering
is 0.63. The largest namespace-defining clusters discovered by this methods
are presented in the table 3.
Let us consider a “Linear Algebra” cluster (table 4) with 6 documents
and some of extracted definitions in documents of this cluster, and all these
articles share identifers λ1 , m and n. Let us consider all definitions of iden-
tifier “λ”. In total, there are 93 clusters where “λ” is used (see table 5), and
in many cases it is possible to determine that the assignment is correct (e.g.
“eigenvalue”, “wavelength”, “regularization parameter”). Some cases are not
correct, for example, when we have clusters with the same name where λ
denotes different things (e.g. in two “Quantum Mechanics” clusters), or in
the case of “Linear Algebra” cluster where it denotes a matrix.
Clustering results with sort association are better than results obtained
with hard association. One of the reasons for that can the the fact that
definitions may act as keywords that describe the document and they are
better in describing the semantic content of the document.
Additionally, we see that clustering on the reduced space works better,
and in our case the best dimensionality reduction method is SVD.
We also note that we do not discover many namespace-defining clusters.
The best result identifiers 414 clusters, while the desired number of clusters
K is almost 10 000. It means that information from about 9 000 clusters
is discarded: in total there are 22 512 documents, but identifiers from only
1 773 are used, and the rest of the document (about 92%) are not utilized
at all.
68
λ
Size Namespace Name Definition Score
3 Algebra multiplicity 0.43
4 Analysis of variance marquardt 0.69
3 Applied and interdisciplinary physics wavelength 0.98
6 Cartographic projections longitude 1.00
3 Cartography longitude 1.00
3 Category theory natural isomorphisms 0.40
4 Condensed matter physics penetration depth 0.44
5 Continuous distributions affine parameter 0.46
3 Coordinate systems longitude 0.88
3 Differential equations differential operator 0.42
8 Differential geometry vector fields 0.72
7 Electronic amplifiers typical value 0.43
3 Electrostatics unit length 0.43
10 Fluid dynamics wavelength 1.00
6 Fluid dynamics free path 0.43
3 Infinity limit ordinals 0.87
7 Linear algebra eigenvalue 0.4
5 Linear algebra matrix 0.41
3 Linear algebra eigenvalue 0.85
3 Liquids relaxation time 0.41
3 Materials science rate 0.44
3 Mathematical analysis eigenvalue 0.41
3 Mathematical theorems poisson distribution 0.41
4 Measure theory lebesgue measure 0.44
3 Measurement order 0.42
8 Mechanics previous expression 0.44
4 Mechanics power series 0.41
3 Metalogic empty word 0.45
7 Number theory partition 0.74
4 Number theory modular lambda function 0.46
3 Operator theory algebraic multiplicity 0.44
5 Optics wavelength 0.71
5 Partial differential equations constants 0.41
4 Physical optics wavelength 0.95
5 Physics exciton state 0.88
6 Probability distributions references 0.42
4 Quantum field theory coupling constant 0.75
5 Quantum mechanics wavelength 1.00
5 Quantum mechanics state 0.87
3 Radioactivity decay 0.72
4 Representation theory of Lie groups weight 1.00
3 Riemannian geometry contravariant vector field 0.45
4 Rubber properties engineering strain 1.00
3 Statistical data types regularization parameter 0.45
20 Statistics words 0.43
3 Statistics expectation 0.46
3 Stellar astronomy mean free path 0.43
3 Surface chemistry ideal gas 0.39
3 Theoretical physics eigenvalue 0.88
5 Theories of gravitation dicke 0.44
3 Wave mechanics wavelength 0.8
Table 5: Some of definitions of “λ”.
69
λ
Size Original name English name Original definition English definition Score
3 Алгебра Algebra поль field 0.74
5 Гидродинамика Fluid dynamics тепловой движение thermal motion 0.42
4 Гравитация Gravitation коэф. затухание damping coefficient 0.46
6 Картография Cartography долгота longitude 0.98
5 Линейная алгебра Linear algebra скаляр scalar 0.46
4 Оптика Optics длина length 0.88
3 Оптика Optics длина волна wavelength 0.44
Релятивистские и Relativity and
5 частота frequency 0.42
гравитационные явления gravitation
3 Статистическая физика Statistical physics итоговый выражение final expression 0.42
Теоремы Theorems of
3 нуль порядок “zero order” 0.45
комплексного анализа complex analysis
3 Теория алгоритмов Algorithms функция переход transition function 0.89
5 Физические науки Physical sciences длина length 0.43
(c) Definitions of “λ” across all namespaces.
14
http://grnti.ru/
73
E m c λ σ µ
Linear related algebraic
matrix matrix scalar eigenvalue
algebra permutation multiplicity
General
energy mass speed of light length shear reduced mass
relativity
Coding encoding transmitted natural
message
theory function codeword isomorphisms
speed of light
Optics order fringe wavelength conductivity permeability
in vacuum
affine
Probability expectation sample size variance mean vector
parameter
Table 8: Definitions for selected identifiers and namespaces extracted from
the English Wikipedia.
76
E m c λ σ µ
6 Conclusions
However, there are many ways in which the present approach can be
improved further. In the next section we discuss possible directions.
that uses random projections [75] [76], which potentially can give a speed up
while not significantly losing in performance. Another dimensionality reduc-
tion technique useful for discovering semantics is Dynamic Auto-Encoders
[77].
Additionally, we can try different approaches to clustering such as Spec-
tral Clustering [78] or Micro-Clustering [79].
Finally, topic modeling techniques such as Latent Dirichlet Allocation
[80] can be quite useful for modeling namespaces. It can be seen as a “soft
clustering” technique and it can naturally model the fact that a document
may import from several namespaces.
8 Bibliography
References
1. Erik Duval, Wayne Hodgins, Stuart Sutton, and Stuart L Weibel. Metadata principles and
practicalities. D-lib Magazine, 8(4):16, 2002.
2. Kevin McArthur. What’s new in PHP 6. Pro PHP: Patterns, Frameworks, Testing and
More, pages 41–52, 2008.
3. Henry Thompson, Tim Bray, Dave Hollander, Andrew Layman, and Richard Tobin. Names-
paces in XML 1.0 (third edition). W3C recommendation, W3C, December 2009. http:
//www.w3.org/TR/2009/REC-xml-names-20091208/.
4. James Gosling, Bill Joy, Guy Steele, Gilad Bracha, and Alex Buckley. The Java 8
R Language
Specification, Java SE 8 Edition. Addison-Wesley Professional, 2014.
5. Wikipedia. Mathematical notation — Wikipedia, the free encyclopedia, 2015. https://en.
wikipedia.org/w/index.php?title=Mathematical_notation&oldid=646730035, accessed:
2015-07-01.
6. Robert Pagel and Moritz Schubotz. Mathematical language processing project. arXiv
preprint arXiv:1407.0167, 2014.
7. Anders Møller and Michael I Schwartzbach. An introduction to XML and Web Technologies.
Pearson Education, 2006.
8. Craig Larman. Applying UML and patterns: an introduction to object-oriented analysis and
design and iterative development. Pearson Education India, 2005.
9. Dan Jurafsky and James H Martin. Speech & language processing. Pearson Education India,
2000.
10. Ulf Schöneberg and Wolfram Sperber. POS tagging and its applications for mathematics.
In Intelligent Computer Mathematics, pages 213–223. Springer, 2014.
11. Beatrice Santorini. Part-of-speech tagging guidelines for the penn treebank project (3rd
revision). 1990.
12. Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J Bethard,
and David McClosky. The stanford CoreNLP natural language processing toolkit. In Pro-
ceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System
Demonstrations, pages 55–60, 2014.
13. Giovanni Yoko Kristianto, MQ Ngiem, Yuichiroh Matsubayashi, and Akiko Aizawa. Extract-
ing definitions of mathematical expressions in scientific papers. In Proc. of the 26th Annual
Conference of JSAI, 2012.
14. Mihai Grigore, Magdalena Wolska, and Michael Kohlhase. Towards context-based disam-
biguation of mathematical expressions. In The Joint Conference of ASCM, pages 262–271,
2009.
15. Keisuke Yokoi, Minh-Quoc Nghiem, Yuichiroh Matsubayashi, and Akiko Aizawa. Contex-
tual analysis of mathematical expressions for advanced mathematical search. In Prof. of
12th International Conference on Intelligent Text Processing and Comptational Linguistics
(CICLing 2011), Tokyo, Japan, February, pages 20–26, 2011.
16. Minh Nghiem Quoc, Keisuke Yokoi, Yuichiroh Matsubayashi, and Akiko Aizawa. Mining
coreference relations between formulas and text using Wikipedia. In 23rd International
Conference on Computational Linguistics, page 69, 2010.
17. Jerzy Trzeciak. Writing mathematical papers in English: a practical guide. European Math-
ematical Society, 1995.
18. Giovanni Yoko Kristianto, Akiko Aizawa, et al. Extracting textual descriptions of mathe-
matical expressions in scientific papers. D-Lib Magazine, 20(11):9, 2014.
19. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to in-
formation retrieval, volume 1. Cambridge university press Cambridge, 2008.
85
20. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM computing
surveys (CSUR), 34(1):1–47, 2002.
21. Nora Oikonomakou and Michalis Vazirgiannis. A review of web document clustering ap-
proaches. In Data mining and knowledge discovery handbook, pages 921–943. Springer, 2005.
22. Charu C Aggarwal and ChengXiang Zhai. A survey of text clustering algorithms. In Mining
Text Data, pages 77–128. Springer, 2012.
23. Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text re-
trieval. Information processing & management, 24(5):513–523, 1988.
24. Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding clusters of different sizes,
shapes, and densities in noisy, high dimensional data. In SDM, pages 47–58. SIAM, 2003.
25. Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearest
neighbor” meaningful? In Database Theory – ICDT’ 99, pages 217–235. Springer, 1999.
26. Deborah Hughes-Hallett, William G. McCallum, Andrew M. Gleason, et al. Calculus: Single
and Multivariable, 6th Edition. Wiley, 2013.
27. Tuomo Korenius, Jorma Laurikkala, and Martti Juhola. On principal component analysis,
cosine and euclidean measures in information retrieval. Information Sciences, 177(22):4893–
4905, 2007.
28. Douglass R Cutting, David R Karger, Jan O Pedersen, and John W Tukey. Scatter/Gather:
A cluster-based approach to browsing large document collections. In Proceedings of the 15th
annual international ACM SIGIR conference on Research and development in information
retrieval, pages 318–329. ACM, 1992.
29. Michael Steinbach, George Karypis, Vipin Kumar, et al. A comparison of document clus-
tering techniques. In KDD workshop on text mining, volume 400, pages 525–526. Boston,
2000.
30. Rui Xu, Donald Wunsch, et al. Survey of clustering algorithms. Neural Networks, IEEE
Transactions on, 16(3):645–678, 2005.
31. Mark Hall, Paul Clough, and Mark Stevenson. Evaluating the use of clustering for auto-
matically organising digital library collections. In Theory and Practice of Digital Libraries,
pages 323–334. Springer, 2012.
32. David Sculley. Web-scale k-means clustering. In Proceedings of the 19th international con-
ference on World wide web, pages 1177–1178. ACM, 2010.
33. Hinrich Schütze and Craig Silverstein. Projections for efficient document clustering. In ACM
SIGIR Forum, volume 31, pages 74–81. ACM, 1997.
34. Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm
for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages
226–231, 1996.
35. Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding topics in collections of docu-
ments: A shared nearest neighbor approach. pages 83–103, 2004.
36. Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic
analysis. Discourse processes, 25(2-3):259–284, 1998.
37. Scott C. Deerwester, Susan T Dumais, Thomas K. Landauer, George W. Furnas, and
Richard A. Harshman. Indexing by latent semantic analysis. JAsIs, 41(6):391–407, 1990.
38. Stanislaw Osiński, Jerzy Stefanowski, and Dawid Weiss. Lingo: Search results clustering
algorithm based on singular value decomposition. In Intelligent information processing and
web mining, pages 359–368. Springer, 2004.
39. Nicholas Evangelopoulos, Xiaoni Zhang, and Victor R Prybutok. Latent semantic analysis:
five methodological recommendations. European Journal of Information Systems, 21(1):70–
86, 2012.
40. Stanislaw Osiński. Improving quality of search results clustering with approximate matrix
factorisations. In Advances in Information Retrieval, pages 167–178. Springer, 2006.
41. Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401(6755):788–791, 1999.
86
42. Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix
factorization. In Proceedings of the 26th annual international ACM SIGIR conference on
Research and development in informaion retrieval, pages 267–273. ACM, 2003.
43. Eric Evans. Domain-driven design: tackling complexity in the heart of software. Addison-
Wesley Professional, 2004.
44. Alfio Gliozzo and Carlo Strapparava. Semantic domains in computational linguistics.
Springer Science & Business Media, 2009.
45. Christopher Stokoe, Michael P Oakes, and John Tait. Word sense disambiguation in in-
formation retrieval revisited. In Proceedings of the 26th annual international ACM SIGIR
conference on Research and development in informaion retrieval, pages 159–166. ACM, 2003.
46. Wikimedia Foundation. Russian wikipedia XML data dump, 2015. http://dumps.
wikimedia.org/ruwiki/latest/, downloaded from http://math-ru.wmflabs.org/wiki/,
accessed: 2015-07-12.
47. Apache Software Foundation. Apache Flink 0.8.1. http://flink.apache.org/, accessed:
2015-01-01.
48. David Carlisle, Robert R Miner, and Patrick D F Ion. Mathematical markup language
(MathML) version 3.0 2nd edition. W3C recommendation, W3C, April 2014. http://www.
w3.org/TR/2014/REC-MathML3-20140410/.
49. Ronald Rivest. The MD5 message-digest algorithm. 1992.
50. Eclipse Foundation. Mylyn WikiText 1.3.0, 2015. http://projects.eclipse.org/
projects/mylyn.docs, accessed: 2015-01-01.
51. Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich
part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Con-
ference of the North American Chapter of the Association for Computational Linguistics
on Human Language Technology-Volume 1, pages 173–180. Association for Computational
Linguistics, 2003.
52. Определение части речи слова на PHP одной функцией (part of speech tagging in PHP
in one function), 2012. http://habrahabr.ru/post/152389/, accessed: 2015-07-13.
53. Daniel Sonntag. Assessing the quality of natural language text data. In GI Jahrestagung
(1), pages 259–263, 2004.
54. Julie D Allen et al. The Unicode Standard. Addison-Wesley, 2007.
55. Mikhail Korobov. Morphological analyzer and generator for russian and ukrainian languages.
arXiv preprint arXiv:1503.07283, 2015.
56. Martin F Porter. Snowball: A language for stemming algorithms, 2001.
57. Steven Bird. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on
Interactive presentation sessions, pages 69–72. Association for Computational Linguistics,
2006.
58. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, et al. Scikit-
learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830,
2011.
59. A Tropp, N Halko, and PG Martinsson. Finding structure with randomness: Stochastic
algorithms for constructing approximate matrix decompositions. Technical report, 2009.
60. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, and et al. DBpedia – a
crystallization point for the web of data. Web Semant., 7(3):154–165, September 2009.
61. Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys
(CSUR), 33(1):31–88, 2001.
62. SeatGeek. FuzzyWuzzy 0.6.0. https://pypi.python.org/pypi/fuzzywuzzy/0.6.0, ac-
cessed: 2015-07-01.
63. American Mathematical Society. AMS mathematics subject classification 2010, 2009. http:
//msc2010.org/, accessed: 2015-06-01.
64. American Physical Society. PACS 2010 regular edition, 2009. http://www.aip.org/
publishing/pacs/pacs-2010-regular-edition/, accessed: 2015-06-01.
87
65. Bernard Rous. Major update to ACM’s computing classification system. Commun. ACM,
55(11):12–12, November 2012.
66. Alistair Miles, Brian Matthews, Michael Wilson, and Dan Brickley. SKOS Core: Simple
knowledge organisation for the web. In Proceedings of the 2005 International Conference on
Dublin Core and Metadata Applications: Vocabularies in Practice, DCMI ’05, pages 1:1–1:9.
Dublin Core Metadata Initiative, 2005.
67. Daniel Krech. RDFLib 4.2.0. https://rdflib.readthedocs.org/en/latest/, accessed:
2015-06-01.
68. V. I. Feodosimov. Государственный рубрикатор научно-технической информации (state
categorizator of scientific and technical information). 2000.
69. Sreenivasa Viswanadha, Danny van Bruggen, and Nicholas Smith. JavaParser 2.1.0, 2015.
http://javaparser.github.io/javaparser/, accessed: 2015-06-15.
70. Apache Software Foundation. Apache Mahout 0.10.1. http://mahout.apache.org/, ac-
cessed: 2015-06-15.
71. Jure Leskovec, Anand Rajaraman, and Jeffrey Ullman. Mining of massive datasets, 2nd
edition. Cambridge University Press Cambridge, 2014.
72. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. Randomized algorithms and nlp:
using locality sensitive hash function for high speed noun clustering. In Proceedings of the
43rd Annual Meeting on Association for Computational Linguistics, pages 622–629. Associ-
ation for Computational Linguistics, 2005.
73. Andrea Baraldi and Palma Blonda. A survey of fuzzy clustering algorithms for pattern
recognition. i. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
29(6):778–785, 1999.
74. Stan Z Li, Xin Wen Hou, HongJiang Zhang, and QianSheng Cheng. Learning spatially
localized, parts-based representation. In Computer Vision and Pattern Recognition, 2001.
CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1,
pages I–207. IEEE, 2001.
75. Fei Wang and Ping Li. Efficient nonnegative matrix factorization with random projections.
In SDM, pages 281–292. SIAM, 2010.
76. Anil Damle and Yuekai Sun. Random projections for non-negative matrix factorization.
arXiv preprint arXiv:1405.4275, 2014.
77. Piotr Mirowski, M Ranzato, and Yann LeCun. Dynamic auto-encoders for semantic indexing.
In Proceedings of the NIPS 2010 Workshop on Deep Learning, 2010.
78. Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an
algorithm. Advances in neural information processing systems, 2:849–856, 2002.
79. Takeaki Uno, Hiroki Maegawa, Takanobu Nakahara, Yukinobu Hamuro, Ryo Yoshinaka, and
Makoto Tatsuta. Micro-clustering: Finding small clusters in large diversity. arXiv preprint
arXiv:1507.03067, 2015.
80. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal
of machine Learning research, 3:993–1022, 2003.
81. Rodrigo A. Botafogo and Ben Shneiderman. Identifying aggregates in hypertext structures.
In Proceedings of the Third Annual ACM Conference on Hypertext, HYPERTEXT ’91, pages
63–74, New York, NY, USA, 1991. ACM.
82. Andrew Johnson and Farshad Fotouhi. Adaptive clustering of hypermedia documents. In-
formation Systems, 21(6):459–473, 1996.
83. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. Proceedings of Workshop at ICLR, 2013.
84. Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe: Global vectors for
word representation. Proceedings of the Empiricial Methods in Natural Language Processing
(EMNLP 2014), 12:1532–1543, 2014.
85. Anatoly Anisimov, Oleksandr Marchenko, Volodymyr Taranukha, and Taras Vozniuk. Se-
mantic and syntactic model of natural language based on tensor factorization. In Natural
Language Processing and Information Systems, pages 51–54. Springer, 2014.
88
86. Alexander Strehl and Joydeep Ghosh. Cluster ensembles – a knowledge reuse framework for
combining multiple partitions. The Journal of Machine Learning Research, 3:583–617, 2003.
87. Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In
ICML, volume 96, pages 148–156, 1996.