1 Introduction

Mathematical expressions are important for communicating information in scientific papers, for instance, to explain or define concepts written in natural language. Despite their importance, current general-purpose search engines, such as Google, cannot effectively locate mathematical expressions contained in a scientific paper (Zanibbi and Blostein 2012). Over the past decade, several researchers in the fields of information retrieval and mathematics have developed mathematical search systems using several approaches. One essential characteristic of math search systems is the data stored by these systems, that is, mathematical expressions. Kohlhase and Sucan (2006) identified three main challenges for the development of a semantic search system for mathematical formulae: (1) mathematical notation is context dependent (i.e., identical mathematical presentations can represent multiple distinct mathematical objects), (2) different mathematical notation may actually have the same meaning, and (3) certain variations in notation are widely considered irrelevant. To summarize, mathematical expressions exhibit a considerable amount of ambiguity.

Grigore et al. (2009), Wolska et al. (2011) hypothesized that exploiting the extracted textual information associated with math expressions can resolve this ambiguity. For instance, mathematical expressions \({}_nP_k\), \({}^nP_k\), \(P_{n,k}\), and P(nk) represent the same mathematical concept, that is, k-permutations of n, even though they have different representations. The availability of textual information for each of these expressions helps the search system to determine that these expressions refer to the same concept. In this case, the textual information is expected to contain at least the key term “permutation.” In this paper, we hypothesize that the textual information explaining a math expression can be further complemented by the textual information that explains the meaning of each symbol contained in the expression, which is useful for removing the ambiguity. For instance, Fig. 1a indicates that given two math expressions \(\frac{Q}{S}\) and \(\frac{Q}{A}\) from two different documents, the expression \(\frac{Q}{S}\) may express the same quantity as \(\frac{Q}{A}\), given that in the examined documents, Q expresses the same math concept in both expressions, and S and A can both be used to express the surface area. Thus, disambiguating S and A should also help to disambiguate \(\frac{Q}{S}\) and \(\frac{Q}{A}\). Conversely, Fig. 1b shows that in different documents, the math expression \(\frac{Q}{S}\) is used either for surface charge density or superficial velocity, depending on whether Q represents total electric charge or volume flow rate. Again, disambiguating the meaning of the subexpression Q with respect to these two meanings helps to determine the meaning of the larger expression.

Fig. 1
figure 1

The meaning of a math expression depends on the meaning of its constituent symbols. a Two math expressions from different documents that have different representations, that is \(\frac{Q}{S}\) and \(\frac{Q}{A}\), express the same math concept. b Two math expressions from different documents that have the same representation, that is \(\frac{Q}{S}\), express different math concepts

Textual information that explains a mathematical expression is typically contained in the text that surrounds the expression. However, this surrounding text does not always contain the explanation for the symbols that exist within the expression. This issue can be resolved by naively setting the surrounding text to have a very wide window size. However, this risks including many words from the surrounding text that do not necessarily explain the expression or the constituent symbols. Thus, instead of naively expanding the window size of the surrounding text, we propose capturing the dependency relationships between each math expression and its constituent subexpressions in a document. For instance, given three math expressions \(\frac{Q}{A}\), Q, and A that occur in a document, we extract two dependency relationships, that is, between \(\frac{Q}{A}\) and Q, and between \(\frac{Q}{A}\) and A, as shown in Fig. 1. Subsequently, we can use the meanings (as captured in the surrounding text) of Q and A to determine the meaning of the larger expression \(\frac{Q}{A}\). In this paper, we use the term “dependency graph” to refer to the set of all constituent expressions and the set of ordered pairs that have individual dependencies. This proposed dependency graph allows us to enrich the textual information of each math expression without the need to expand the window size of the surrounding text.

The majority of previous studies on math search systems (Hambasan et al. 2014; Kristianto et al. 2014c; Lipani et al. 2014; Pattaniyil and Zanibbi 2014; Pinto et al. 2014; Růžička et al. 2014; Wang et al. 2015) exploited both math expressions and free text within document collection to address queries that contain both math expressions and textual keywords. These works associated each math expression with text found in the containing document, for example, all words found in the document and words surrounding the expression. By contrast, our method associates each math expression with not only its surrounding words, but also words that are related to symbols contained in the expression. We extend our previous work (Kristianto et al. 2014a) by presenting a better heuristic to capture these dependency relationships between math expressions. We also evaluate our heuristic using the NTCIR-11 Math-2 Task (Aizawa et al. 2014) dataset, which contains more topics (50) than the dataset used in our initial work (13).

The contributions of this paper are as follows:

  • We propose the concept of a dependency graph of mathematical expressions.

  • We also propose a heuristic method to construct dependency graphs and evaluate its effectiveness using manually annotated data.

  • We validate the effectiveness of the dependency graph for enriching two types of textual information related to math expressions, that is, contextual words and descriptions, and for improving the retrieval results of math search systems.

The remainder of this paper is organized as follows: In Sect. 2, we introduce several recent works on the extraction of textual information related to math expressions and the development of math search systems, including the available test collections. In Sect. 3, we explain the dependency graph of mathematical expressions, and describe our proposed heuristic construction method and the application of this dependency graph in math search systems. In Sect. 4, we present experimental results that demonstrate the effectiveness of our heuristic method for constructing a dependency graph of math expressions. In Sect. 5, we examine the influence of these dependency graphs on a math search system. Finally, in Sect. 6, we provide our conclusion and ideas for future work.

2 Related work

In this paper, we focus on using dependency graphs to enrich textual information related to math expressions. Then, we investigate their influence on math search systems. Thus, previous works related to this paper can be summarized into three categories: development of math search systems, textual information related to math expressions, and test collections to evaluate math search systems.

2.1 Development of mathematical search systems

Information needs in math information retrieval (MIR) can be classified into several categories, as shown in Table 1. To satisfy these information needs, the majority of current MIR research has focused on the development of search techniques based on query-by-expression (Kohlhase and Kohlhase 2007; Zhao et al. 2008; Zanibbi and Blostein 2012).

Table 1 Categories of information needs in math information retrieval derived by Kohlhase and Kohlhase (2007), Zhao et al. (2008), Zanibbi and Blostein (2012)

Kamali and Tompa (2013a) found that searching for mathematical expressions and retrieving relevant documents based on their mathematical content is not straightforward. The following reasons were identified:

  • Mathematical expressions are objects with a complex structure, and few distinct symbols and terms. The symbols and terms alone are usually inadequate for distinguishing between different math expressions.

  • Relevant math expressions may include small variations in presentation.

  • The majority of published math expressions are encoded with respect to their presentation, and most instances do not preserve sufficient semantic information.

As a result, formula indexing and retrieval has become one of the key issues in math expression retrieval (Zanibbi and Blostein 2012). Many math search systems have attempted to solve this issue, with the most popular implementation reducing the search for formulae to a full-text search. Other techniques (Guidi and Sacerdoti Coen 2015) have also been proposed, such as substitution tree indexing, reduction to Structured Query Language (SQL) queries, and reduction to Extensible Markup Language (XML) search; however, fewer systems have been derived from these. We now provide a review of several math search systems.

2.1.1 Reduction to full text searches

The most popular implementation of a math search system is to reduce the search for formulae to a full-text search. Several early studies on math search systems, for example, ActiveMath (Libbrecht and Melis 2006), MathGO! (Adeel et al. 2008), DLMF (Miller and Youssef 2003; Youssef 2005, 2006, 2007a, b), MathFind (Munavalli and Miner 2006), and Mathdex (Miner and Munavalli 2007), implemented full-text search technology. In ActiveMath (Libbrecht and Melis 2006), mathematical data are represented in Open Mathematical Documents (OMDoc) format. These data are tokenized and then indexed using the LuceneFootnote 1 search engine. MathGO! (Adeel et al. 2008) uses a template-based approach to identify and search for math expressions. MathFind (Munavalli and Miner 2006) translates each math expression in a document into a sequence of text-encoded math fragments, and then indexes this sequence together with all words found within the document.

The Digital Library of Mathematical Functions (DLMF) search system (Miller and Youssef 2003; Youssef 2005, 2006, 2007a, b) is a fully math-aware fine-grained search system that supports access to fine-grained targets, such as equations, figures, tables, definitions, and named rules and theorems. In its implementation, the system considers mathematical expressions as a collection of mathematical terms. Hashimoto et al. (2007) and Hijikata et al. (2007) proposed a search engine for Mathematical Markup Language (MathML) objects using the structure of mathematical expressions, whereby inverted indices were constructed using the Document Object Model (DOM) structure and represented in XPath.

Several of the more recent math search systems can be classified as follows:

  • Extracting literals (identifiers, constants, and operators) from mathematical expressions

    The Qualibeta system (Pinto et al. 2014) extracts features such as categories, a set of identifiers, a set of constants, operators, and a set of unique identifiers to represent each math expression.

  • Extracting and generalizing the substructures of math expressions

    WikiMirs (Hu et al. 2013) is a tree-based indexing system that indexes all substructures of the formulae in LaTeX  considering the generalization of the substructures. Subsequent versions of WikiMirs (Gao et al. 2014; Lin et al. 2014; Wang et al. 2015) use a semantic enrichment technique to extract useful semantic information from math expressions, and then apply hierarchical generalization to the substructures of the expressions to support substructure matching and fuzzy matching. The MIaS system (Líška 2013; Růžička et al. 2014; Sojka 2012; Sojka and Líška 2011b) focuses on a similarity search based on canonical MathML. It processes each math expression by applying ordering, subexpression extraction, variable unification, and constant unification steps to the math expression. TUW (Lipani et al. 2014) implements a tokenizer that starts from the tree structure of the math expressions, and then extracts and linearizes the tokens. This tokenizer slices a math expression tree by levels, and collapses each subexpression obtained from the slicing step.

  • Extracting the placement of symbols in math expressions.

    The Tangent system (Pattaniyil and Zanibbi 2014) captures the structure of math expressions by generating symbol pair tuples from a symbol layout tree representation of the expression. These symbol pair tuples describe the relative placement of symbols in an expression. Our previous work, the MCAT system (Kristianto et al. 2014a, c; Topić et al. 2013), captures the content and structure of math expressions by encoding the paths between nodes within the tree structure of the expressions.

2.1.2 Other approaches

Another type of approach used in math search systems is based on substitution tree indexing (Graf 1996), which was originally developed for automatic theorem proving to store lemmas and quickly retrieve formulae up to instantiation or generalization. MathWebSearch (Hambasan et al. 2014; Kohlhase and Sucan 2006; Kohlhase et al. 2012) is based on this approach. Additionally, Schellenberg (2011), Schellenberg et al. (2012) introduced a system for the layout-based (LaTeX) indexing and retrieval of math expressions using substitution trees.

Other math search systems aim to approximate math expressions through a series of relations to be stored in a relational database. Asperti and Selmi (2004) generated an SQL query for each given query formula that is represented by a set of relations. The recall of their system can be maximized by relaxing such relations or using normalization. The fourth type of math search system reduces the search for math expressions to an XML search. For instance, the FSE system (Schubotz et al. 2013) formalized formula queries using XQuery/XPath technology. This system, which accessed data in a non-index format, was batch oriented and did not normalize the input in any manner. In addition to the systems described above, there is a lattice-based system developed by Nguyen et al. (2012b) that cannot be classified into any of the aforementioned categories. This system extracts math features from the MathML-formatted math expressions and then constructs a mathematical concept lattice using these features.

In addition to math expression indexing, several studies have focused on determining the ranking function for formula retrieval. Yokoi and Aizawa (2009) proposed a similarity search scheme for mathematical expressions based on a subpath set and reported that it works well on “simplified” Content MathML. Sojka and Líška (2011a) computed a weight for each indexed math expression that describes how different this expression is from its original representation. They then attempted to create a complex and robust weighting function that would be appropriate for documents from many scientific fields. Nguyen et al. (2012a) proposed an online learning-to-rank model to learn a scoring function after extracting math expression features based on the Content MathML format. Their model outperformed other standard information retrieval models; however, there was no comparison with other math-specific similarity functions. Kamali and Tompa (2013a) proposed a tree edit distance-based method to calculate the structural similarity between two expressions, and also introduced a pattern-based search to enable flexible matching of expressions in a controlled manner. Furthermore, Kamali and Tompa (2013b) described optimization techniques to reduce the index size and query processing time required by this tree edit distance-based method. Zhang and Youssef (2014) proposed five math similarity measure factors: taxonomy of functions and operators, data type hierarchical level of the math expressions, depth of matching position, query coverage, and whether math expressions are categorized as formulae or non-formulae. Schubotz et al. (2014) evaluated each of these five factors individually and confirmed that each factor is relevant to math similarity, with a note that the last factor mentioned above is of lower relevance than the other four factors.

2.2 Extraction of textual information for mathematical expressions

Previous work on extracting textual information for math expressions can be classified into two categories: The first assumes that the meaning of a math expression can be found from text preceding and/or following the expression (Grigore et al. 2009; Pinto and Balke 2015; Wolska et al. 2011), and the second attempts to extract only those terms that precisely describe the math expression (Kristianto et al. 2014b; Nghiem et al. 2010; Yokoi et al. 2011).

2.2.1 Using surrounding text to capture the meaning of math expressions

One of the earliest works on extracting textual information for mathematical expressions actually focused on resolving the semantics of mathematical expressions (Grigore et al. 2009). This approach uses the natural language within which math expressions are embedded to resolve their semantics and enable their disambiguation. It focuses on mathematical expressions that are syntactically part of a nominal group and in an apposition relation with the immediately preceding noun phrase, that is, the target expressions are from a linguistic pattern: “\(\ldots\) noun_phrase symbolic_math_expression \(\ldots\).” By assuming that a target mathematical expression can be disambiguated using its left context, this method disambiguates mathematical expressions by computing the term similarity between the local lexical context of a given expression, that is, all nouns appearing in the five-word window to the left of the target expression, and a set of terms from term clusters based on the OpenMath Content Dictionaries.Footnote 2 The calculated similarity score determines the disambiguation of the target expression. Wolska and Grigore (2010) complemented the work of Grigore et al. (2009) by conducting three corpus-based studies on the declarations of simple symbols in mathematical writing. This work counted how many mathematical symbols were explicitly declared in 50 documents. Subsequently, Wolska et al. (2011) determined that each target mathematical expression had a local and global lexical context. The local context is a set of domain terms that occur within the immediately preceding and following linguistic context. The global context is a set of domain terms that occur in the declaration statements of the target expression or other expressions that are structurally similar to the target expression. These two types of lexical context were then used to enhance the disambiguation work. Another recent study by Pinto and Balke (2015) extracted the sense of math expressions by applying latent Dirichlet allocation over context words surrounding the expressions. They concluded that the senses obtained from the context words were helpful for classification, such as predicting the mathematics subject classification (MSC) classes from a document collection.

2.2.2 Extracting specific descriptions of math expressions

Nghiem et al. (2010) proposed text matching and pattern matching methods to mine mathematical knowledge from Wikipedia. These methods extracted coreference relations between formulae and the concepts that refer to them. Yokoi et al. (2011) extracted textual descriptions of mathematical expressions from Japanese scientific papers. This work considered only compound nouns as the description candidates. When the descriptions are expressed as complex noun phrases that contain prepositions, adverbs, or other noun phrases, only the final compound noun in the phrase (Japanese is a head-final language) is annotated and extracted. A subsequent study (Kristianto et al. 2012a) proposed a set of annotation guidelines by introducing full and short descriptions, and enabling multiple descriptions to be related. Following this, two applications based on the extracted descriptions of mathematical formulae were introduced: semantic search and semantic browsing (Kristianto et al. 2012b). Finally, Kristianto et al. (2014b) applied a machine learning method to a dataset that was annotated using the latest set of annotation guidelines.

2.2.3 Using textual information in math search systems

Prior to the indexing process, several math search systems extract textual information to relate mathematical expressions. These systems store both the math expressions and associated textual information in their index systems to enable mathematical expressions to be searched using queries that contain both math expressions and textual keywords. Nguyen et al. (2012a) presented a math-aware search engine that manages both textual keywords and mathematical expressions, and benchmarked the system using math documents crawled from an online math question answering system, MathOverflow.Footnote 3 Several systems (Hambasan et al. 2014; Lipani et al. 2014; Pattaniyil and Zanibbi 2014; Růžička et al. 2014) construct word vectors to relate each math expression by associating the expression with all words found in the same document or paragraph. WikiMirs 3.0 (Wang et al. 2015) regards the preceding and following paragraphs of a math expression as the context of the expression. Similarly, Qualibeta (Pinto et al. 2014) and MCAT (Kristianto et al. 2014c) associate each expression with the surrounding text. We consider that, when surrounding text is used to relate math expressions, the proportion of words related to that expression is higher than when all words in the document are used.

2.3 Test collections for math search systems

Most of the early works on math search systems used specially generated datasets to evaluate the resulting systems. The number of publicly available test collections, which consist of a document set, list of topics, and assessment of pooled results, for the evaluation of math search systems is limited.

Kamali and Tompa (2013a) generated an evaluation dataset that consists of pages from Wikipedia and DLMF, and contains a total of 863,358 math expressions. The 98 queries in this dataset were produced from an interview process (45) and a mathematics forum (53). During the interview process, invited students and researchers were asked to search for math expressions of potential interest to them in a practical situation. To prepare queries from the mathematics forum, Kamali and Tompa (2013a) gathered discussion threads, each of which can be described with a query that consists of a single math expression. For each discussion thread, the authors manually assessed whether a given math expression, together with the page that contained it, could satisfy the information need of the user who started the thread. Queries in the dataset contained math expressions, but no textual keywords.

The NTCIR-10 Math Pilot Task (Aizawa et al. 2013) was the first attempt to develop a common evaluation dataset for math formula searches based on a pooling method. This dataset includes 100,000 scientific papers, 21 topics for formula searches (each query contains only math formulae), and 15 topics for full-text searches (each query contains both a list of formulae and a list of textual keywords). For each topic, 100 retrieval units (math expressions) from the pooled results were assessed. Topics in this task express several user information needs, for example, in Table 1, specific or similar math formulae (category 1), theorems (category 2), examples (category 3), solutions (category 4), and applications (category 6).

The NTCIR-11 Math-2 Task (Aizawa et al. 2014) dataset consists of 105,120 scientific articles (with approximately 60 million math expressions) converted to HyperText Markup Language (HTML) + MathML-based format and 50 full-text search topics, which express the same information need categories as those in the Math Pilot Task. This task focused on full-text searches, and used paragraph-level retrieval units (8,301,578 units in the dataset). Fifty retrieval units (paragraphs) from pooled results (from 20 submitted runs) were assessed for each topic. Each retrieval unit was assessed as being non-relevant, partially relevant, or highly relevant. In addition to this Math-2 Task, the Wikipedia Open Subtask (Aizawa et al. 2014; Schubotz et al. 2015) used the Wikipedia dataset instead of scientific articles.

The aforementioned general-purpose MIR test collections are focused on accommodating information needs of varying complexity. A recent work by Stathopoulos and Teufel (2015) focused on the retrieval of research-level mathematical information and the construction of an MIR test collection for these needs. They regarded the questions in the online collaboration website MathOverflow as the information needs, and then attempted to retrieve scientific publications that answered these questions. To construct the topics, they examined each identified MathOverflow discussion thread for conformance to two criteria: (1) useful questions should express an information need that is clear and can be satisfied by describing objects or properties, stating conditions, and/or producing examples or counter-examples and (2) scientific documents cited in the MathOverflow accepted answers should address all subparts of the question in a manner that requires minimal deduction and does not synthesize mathematical results from multiple resources. Topics in their dataset had the form of coherent text interspersed with math expressions. Most of these topics represented the need for answers to mathematical questions (cf., Table 1). The relevance judgments for the constructed topics were procured from the answers in the corresponding discussion threads in MathOverflow.

In this paper, we evaluate the effectiveness of the dependency graph for improving the precision of math search systems for retrieving math expressions relevant to a given query, which is a combination of keywords and math expressions. For evaluation purposes, we use the NTCIR-11 dataset because it provides more topics for a full-text search than the NTCIR-10 dataset. We consider that the other available datasets are not suitable for this evaluation. The dataset from Kamali and Tompa (2013a) contains topics only for formula searches (each query contains only math formulae). By contrast, the topics from Stathopoulos and Teufel (2015) are in the form of mathematical questions, and not all these topics contain math expressions.

3 Constructing dependency graphs from math expressions

The meaning of a mathematical expression in a scientific work can often be located in the surrounding text. This text can be a set of words appearing in a fixed-size word window immediately preceding and/or following the math expression (hereinafter, called the context window), or a set of terms that precisely describes the meaning the math expression (hereinafter, called the description). However, in addition to this surrounding text, the meaning of the expression can often be located in other regions of the document, especially if the expression appears several times. Additionally, to understand a math expression, we often need to search the examined document for the meaning of each constituent symbol. To meet these challenges, we propose a method to capture the dependency relationships between the mathematical expressions within each document. We call the set of such relationships obtained for each document a dependency graph.

3.1 Dependency graph of math expressions

A simple example of a dependency graph is shown in Fig. 2. A dependency graph G(VE) of mathematical expressions is a directed graph in which a set V of vertices represents a set of distinct mathematical expressions contained in a document. Each math expression in this graph is associated with its textual descriptions, as shown by text information contained in each vertex in Fig. 2. To define the dependency graph, we use Assumption 1. The first condition in Assumption 1 uses the semantics of math expressions (i.e., the meaning of expressions captured from the surrounding text). The remaining conditions exploit the syntax (structure) of the math expressions.

Assumption 1

In a document, the sufficient conditions to determine whether two math expressions represent the same math concept are as follows:

  • The examined document specifies that both math expressions express the same math concept,

  • Both math expressions have the same visually rendered form,

  • Both math expressions have at least the same base form (i.e., the same structure or letters are used to denote math identifiers, including variables, function names, and symbolic constants) with minor modifications (e.g., superscripts and subscripts), or

  • For a math expression representing an (in)equality, its subexpression on the left-hand side of the (in)equality symbol has the same presentation or base form, with minor modification, as the other math expression or its left-hand side subexpression. Note that it is assumed in this paper that a math expression representing an (in)equality states the property of the subexpression on the left-hand side of the (in)equality symbol. Thus, this left-hand side subexpression is the object explained by the (in)equality.

A further assumption is that, when two math expressions describe the same math concept, these two expressions share the same meaning.

If a distinct form of some mathematical expression appears several times in a document, these multiple appearances are considered to share the same meaning. As a result, all textual information extracted for these appearances is merged into a set, and this set is associated with the distinct form of the math expression in the dependency graph.

As mentioned at the beginning of Sect. 3, understanding a math expression often requires access to the meaning of each of its symbols. To meet this requirement, given two different math expressions, \(mathexp_1\) and \(mathexp_2\), occurring in the same scientific article, the dependency graph is designed to relate \(mathexp_1\) to \(mathexp_2\) when \(mathexp_2\) appears as a subexpression of \(mathexp_1\). Within the edge set E, a directed edge from vertex \(mathexp_1\) to \(mathexp_2\) implies that the base form (i.e., without any minor modifications) of \(mathexp_2\) is a subexpression of expression \(mathexp_1\). In this relationship, vertices \(mathexp_1\) and \(mathexp_2\) are called the parent and child, respectively. Additionally, a bidirectional edge \(mathexp_1 \leftrightarrow mathexp_2\) between the two math expressions, \(mathexp_1\) and \(mathexp_2\), means that these two expressions satisfy every requirement specified in Assumption 1. Using the above assumption, these two expressions have the same meaning because they express the same math concept. For instance, in Fig. 2, there is a bidirectional edge between \(p(\theta |\mathbf {x_i}) = \frac{p(\mathbf {x_i}|\theta )p(\theta )}{p(\mathbf {x_i})}\) and \(p(\theta |{\mathbf {x}})\) because both expressions explain \(p(\theta |{\mathbf {x}})\). The former expression defines \(p(\theta |{\mathbf {x}})\) using Bayes’ theorem, whereas the latter simply mentions \(p(\theta |{\mathbf {x}})\) as the posterior probability. By using the dependency graph, we can enrich the textual information of a target math expression by including textual information from all other expressions connected to the target expression by outgoing edges.

Fig. 2
figure 2

Example of a dependency graph for seven mathematical expressions. Each mathematical expression is followed by its corresponding description

3.2 Naive methods to construct a dependency graph

At least two techniques for constructing dependency graphs have been proposed in previous studies: string matching (Kristianto et al. 2014a) and unification (Líška 2013; Růžička et al. 2014). Both methods assume that math expressions are available in Presentation MathML format because several tools can generate Presentation MathML from LaTeX (e.g., LaTeXMLFootnote 4), Portable Document Format (PDF) files (e.g., MaxtractFootnote 5), and handwritten text and digitized documents (e.g., Infty ReaderFootnote 6). They are briefly described as follows:

  1. 1.

    String matching This approach determines whether math expression \(mathexp_2\) is a subexpression of \(mathexp_1\) by performing string matching over the Presentation MathML formatFootnote 7 of the two expressions. If the Presentation MathML of \(mathexp_1\) contains that of \(mathexp_2\), then string matching considers that there is a relationship \(mathexp_1 \rightarrow mathexp_2\).

  2. 2.

    Unification Líška (2013), Růžička et al. (2014) applied a modified unification method to the Presentation MathML format of math expressions for math searching. This method substitutes all math identifiers (values of MathML element mi) with unified symbols and, unlike the common unification method, all numbers (values of MathML element mn) with one unified symbol. To determine whether two math expressions are a match, Líška (2013), Růžička et al. (2014) applied string matching to the generalized form (i.e., after substitutions) of the two math expressions. For instance, \(x^5 + y^5\) matches \(a^2+y^3\) because both expressions are unified to \(var_1^{const} + var_2^{const}\) (assuming that x, y, a, and b are represented by mi elements in the MathML format and 2, 3, and 5 are represented by mn elements). To avoid over-generalization, this method does not unify stand-alone variables or constants. Given math expressions, we can implement this approach (i.e., apply the unification method to math expressions, and then perform string matching over the generalized form of the expressions) to capture the dependency relationships between them.

These two methods, however, have some drawbacks. String matching fails to relate expressions with minor modifications, for example, \(x_i\) and \(x_j\), even though they have the same base form. The unification method followed by string matching generalizes all variables and all constants. Thus, it may determine that two math expressions are related when they are not, for example, the displacement formula \(s=vt\) and Newton’s second law of motion \(F=ma\).

3.3 Dependency graph construction

Given the drawbacks of the two techniques described above, we propose a new heuristic method for constructing dependency graphs. The pseudo code for the proposed method is presented in Algorithm 1, and its implementation is provided online.Footnote 8 We base our heuristic method on string matching between math expressions, with five normalization steps applied to each expression prior to the matching procedure. Similar to the two construction methods described above, our method also assumes that math expressions are available in Presentation MathML format.

figure a

The normalization steps do not necessarily output an expression equivalent to the input expression. These steps are applied so that the string matching procedure obtains the relationship between math expressions when either:

  • The two expressions have the same base form, or when they are (in)equalities, the same left-hand side subexpression (which implies that the two math expressions have a similar meaning) OR

  • One math expression is a subexpression of another expression (which implies that the meaning of the former helps to determine the meaning of the latter).

The relationships are obtained even though the examined expressions have different representations.

The first normalization step minimizes the number of mrow elements within each math expression. This element is used for grouping other math elements. We determine an mrow element to be unnecessary by examining whether its removal from the Presentation MathML of a math expression will produce a MathML that is still valid under MathML document type definition (DTD).Footnote 9

The second normalization step obtains the main math expression enclosed by a pair of outermost parentheses. For instance, \((\ A = [a_{i}]\ )\) is normalized to \(A = [a_{i}]\). In all cases that we located, the outermost parentheses enclosing math expressions that appear inline do not hold significant meaning. They are often used to emphasize the separation of math expressions from their surrounding text. These outermost parentheses may result in the failure of string matching to capture a bidirectional relationship between two supposedly identical mathematical expressions; that is, string matching captures only one of the two directions. Removing the outermost parentheses solves this issue.

The third step is to obtain a base expression from each math expression by removing all attachments, that is, scripts attached to the base expression. For instance, \(x^2\) is normalized to x. Based on the MathML 2.0 specification, there are eight elements targeted by this normalization step: mroot (root index), msub (subscript), msup (superscript), msubsup (subscript and superscript), munder (underscript), mover (overscript), munderover (underscript and overscript), and mmultiscript (subscript, superscript, and prescripts). By applying this step, we obtain a bidirectional relationship between expressions, such as \(x^3\) and \(x^2\).

The fourth normalization step, which normalizes math expressions such as \(f(x) = x+c\) to f(x), is applied because subexpressions are often defined several times within a document. For example, the function f(x) may be defined as \(f(x) = x+c\) and \(f(x) = x+5\) in a document. These expressions should be related to each other because both define f(x). To capture a bidirectional relationship between \(f(x) = x+c\) and \(f(x) = x+5\) using string matching, we need to apply string matching to the left-hand side subexpression, that is, f(x), of these two expressions. We assume that the left-hand side subexpression of an equality is the object explained by the equality. Thus, we extract the left-hand side subexpression and not the right-hand side subexpression. We extend this idea to inequalities because when we state an equality or inequality, it is often the case that we want to state the property of its left-hand side subexpression. To obtain the left-hand side subexpression, we locate the top-level node that contains one of the standard (in)equality relation symbolsFootnote 10 and split the MathML tree on that node.

We normalize the symbol case, for example, X is normalized to x, as the final normalization step. When uppercase and lowercase versions of a letter appear as a math symbol within a document, they sometimes express a collection (i.e., set, vector, or matrix) and its elements, respectively. Even though a collection and its elements are not equivalent, we consider their meanings to be closely related. An element represents a math object, and a collection is merely an accumulation of several objects. This final step allows us to relate a symbol representing a collection to other symbols representing the elements of that collection.

3.4 Exploiting dependency graphs for math search systems

To demonstrate the potential of dependency graphs, we can apply them to math search systems. In such systems, we use dependency graphs to enrich the textual information of each math expression.

3.4.1 Textual information associated with math expressions

In this paper, we use two types of textual information to capture the meaning of math expressions, that is, words in the context window and descriptions, which are described as follows:

  • Words in the context window represent textual information that can be obtained rapidly because it is not necessary to use a natural language parser to obtain it. In the present study, we used a window size of 10 words preceding and following the target mathematical expression.

  • Descriptions were automatically extracted for each mathematical expression using an support vector machine (SVM) model (Kristianto et al. 2014b) that was trained using a manually annotated corpus generated for the NTCIR-10 Math Task. The description extraction problem was treated as a binary classification problem by first pairing each mathematical expression with its description candidates, that is, all noun phrases that exist in the same sentence as the expression, and then classifying the pairs as correct or not.

Examples of these two types of textual information are provided in Table 2. Let us consider a mathematical expression \(d_i\), shown in Fig. 3. By using the surrounding text, we can extract two types of textual information for the target expression \(d_i\), as displayed in Table 2.

Table 2 Example of textual information extracted for the target mathematical expression \(d_i\) in a sentence “Given n points \((x_i, y_i)\), \(1 \le i \le n\), the objective function is defined by \(F = \sum _{i=1}^{n} d_{i}^{2}\) where \(d_i\) is the Euclidean (geometric) distance from the point \((x_i, y_i)\)

3.4.2 Using a dependency graph to enrich the textual information of math expressions

In this paper, we use a dependency graph to enrich the textual information, that is, words in the context window and descriptions, of each math expression by including textual information from its child expressions. In this scenario, the dependency graph helps to determine the meaning of math expressions, thus improving the search results.

For instance, let us consider the three mathematical expressions shown in Fig. 3, all taken from Chernov and Lesort (2003): \(F = \sum _{i=1}^{n}{d_{i}^{2}}\) (described as the “objective function”), F (described as a “nonlinear function”), and \(d_{i}\) [described as the “Euclidean (geometric) distance from the point \((x_i, y_i)\)”]. From these three math expressions, we build a dependency graph as shown in Fig. 4 by drawing two edges, that is, connecting \(F = \sum _{i=1}^{n}{d_{i}^{2}}\) with F and \(d_{i}\). Subsequently, we search for expressions that are similar to the query formula \(\sum _{i=1}^{n}{d_{i}^{2}}\) and related to the query text “minimizing the sum of squares of geometric distances.” We expect the expression \(F = \sum _{i=1}^{n}{d_{i}^{2}}\) to be highly ranked, which requires this expression to be related to the concept described by the query text. However, although the original paper describes the target expression as the sum of squares of Euclidean (geometric) distances, no term from the description “objective function” of the target expression matches any term from the query text. Consequently, other expressions that resemble the query formula and have only the term “sum of squares” in their descriptions may be ranked higher than the target expression. These other expressions may actually be irrelevant to the query because there is no explanation of \(d_i\), which is the object of squaring and summing, in their descriptions. For instance, these expressions may actually relate to other concepts, such as sum convergence or a geometric series test. Thus, without a dependency graph, the target expression may not appear with a sufficiently high rank in the list of returned results.

When we use a dependency graph, however, we obtain matches between the query text and expanded textual description of the target expression \(F = \sum _{i=1}^{n}{d_{i}^{2}}\). The use of the dependency graph expands the description of the target expression to contain the descriptions of F and \(d_{i}\). The expanded description contains the terms “objective, function, nonlinear, function, Euclidean, geometric, distances, from, the, point.” Two of these terms match the query text, that is, “geometric” and “distances,” with higher inverse document frequency (IDF) weights (Sparck Jones 1972) than common words in the query text, such as “minimizing,” “sum,” and “squares.” The use of a dependency graph ensures that these term matches give the target expression a higher rank than it would have otherwise.

Fig. 3
figure 3

Excerpt from a scientific article titled “Least squares fitting of circles and lines” (Chernov and Lesort 2003) followed by the MathML representation of the shaded math expressions. The dependency graph of the shaded math expressions is shown in Fig. 4

Fig. 4
figure 4

Use of a dependency graph to place a relevant math expression at a high rank

4 Dependency graph construction experiment

The first experiment conducted in this paper is to measure the effectiveness of our proposed heuristic method for constructing a dependency graph. In this section, we describe the manual construction of gold-standard dependency graphs from several documents containing math expressions. Then, based on this gold-standard dataset, we discuss the performance of the baseline and our proposed methods for constructing dependency graphs.

4.1 Data construction

To evaluate whether Assumption 1 holds and whether our proposed heuristic method can effectively extract dependency relationships between math expressions, we first had to manually construct gold-standard dependency graphs, each of which was from a document containing math expressions. We chose five scientific documents, as shown in Table 3, from the public Association for Computational Linguistics (ACL) Anthology,Footnote 11 and then created one gold-standard dependency graph from each of these documents.

Table 3 List of scientific documents from which gold-standard dependency graphs were created

Because the documents were originally in PDF format, we first converted them into XML format using the InftyReader, which is an Optical Character Recognition (OCR)-based application. We manually checked the XML files to correct OCR and MathML recognition errors.

The manual construction of the dependency graphs involved extracting relationships representing the following:

  • Equivalence of math concepts expressed by the math expressions (specified by points 1, 3, and 4 of the annotation guideline defined below),

  • Collection-and-element relationships (point 5), and

  • Appearance of a math expression as a subexpression of another math expression (point 2).

The guidelines for annotating dependency relationships are given as follows:

  1. 1.

    A bidirectional edge between two variables indicates that these variables represent the same math concept. For instance, in a document that defines a certain similarity score sim(uv) between two data points u and v, we obtain \(u \longleftrightarrow v\) because the examined document specifies that u and v express the same concept, that is, a data point.

  2. 2.

    Given two different individual math expressions \(m_1\) and \(m_2\), an edge is drawn from \(m_1\) to \(m_2\) if \(m_2\) appears as a subexpression of \(m_1\), for example, \(f(x) = ax + b \longrightarrow ax + b,\ a,\ x,\ b\).

  3. 3.

    Given an expression m that represents an (in)equality, the subexpression on the left-hand side of the (in)equality symbol is used as another representation of m. Thus, we can draw a bidirectional edge between two (in)equality expressions that have the same left-hand side subexpressions, for example, \(f(x) = ax + b \longleftrightarrow f(x) = 5x + 6\).

  4. 4.

    Ordering variables and constants within an expression is allowed if the reordered expression is equivalent to the original expression, that is, both expressions evaluate to the same value. For instance, given \(ax+b\) and \(b+ax\) from a document, if the annotator determines that the ‘\(+\)‘ operator in the examined document is commutative, we have \(ax + b \longleftrightarrow b + ax\).

  5. 5.

    As an exception to rule 1, a bidirectional edge is also drawn between a collection, such as a set, vector, or matrix (e.g., X), and its elements (e.g., \(x_i\)). They do not express the same concept, but are closely related to each other.

The syntax (structure) of math expressions is used in points 1–4 of these guidelines. By contrast, the semantics of math expressions are used in points 1 and 5 of the guidelines. Additionally, we specify the following simplifications to the annotation process.

  • To determine whether two expressions are equivalent, we avoid performing algebraic rewriting, except for the application of commutativity as per step 4 in the guidelines, and accessing knowledge other than the two given expressions. For instance, given two equivalent expressions \(log(\frac{b}{c})\) and \(log\ b - log\ c\), we do not draw an edge between them because to do this would require additional knowledge about logarithmic identities. As another example, given three expressions \(sim(A,B) = cos(\theta )\), sim(AB), and \(cos(\theta )\), we draw edges connecting \(sim(A,B) = cos(\theta )\) with sim(AB) and \(sim(A,B) = cos(\theta )\) with \(cos(\theta )\). However, we do not relate sim(AB) and \(cos(\theta )\) given that this requires access to another expression \(sim(A,B) = cos(\theta )\).

  • If several expressions originally composed a single math expression, it is not necessary to draw edges between them. For instance, \(y = f(x) + c = ax + b + c\) can sometimes be recognized in the OCR process as three separate expressions: y, \(f(x) + c\), and \(ax + b + c\). In these situations, we do not draw any edges between these constituent expressions because this requires knowledge that is not provided by the OCR tool.

From the five scientific documents used as the dataset, we obtained 438 math expressions, of which 230 were distinct. The sum of the number of edges over all five dependency graphs manually constructed from the documents was 4,181 (with bidirectional edges counted as two edges). We found that these edges connected 2,815 pairs of math expressions.

4.2 Experimental settings

To implement the dependency graph construction, we proposed a string matching-based method strengthened by the preprocessing steps described in Sect. 3.3. We then evaluated the proposed method using the manually constructed gold-standard dependency graphs. For comparison, we applied the unification method, which is frequently used in math search systems to match two expressions, as the baseline method. We adopted the unification method implemented by the MIaS search system (Růžička et al. 2014) and applied it to the Presentation MathML form of the math expression. We then performed string matching using the generalized expressions to capture the dependency relationships between them.

Among the five normalization steps applied in our method, mrow minimization is the only step that does not generalize math expressions. As this step only addresses the MathML-specific formatting, we also applied this normalization step in the baseline method (prior to the unification process). To evaluate the performance of the dependency graph construction methods, we report the micro and macro-average of the precision, recall, and F1-score metrics, which are defined as follows:

$$\begin{aligned} Precision&= \dfrac{\left| \{ \text{gold-standard\, dependency\,rel.} \} \cap \{ \text {extracted\,dependency\,rel.} \}\right| }{\left| \{ \text {extracted\,dependency\,relationships} \}\right| } \\ Recall&= \dfrac{\left| \{ \text {gold-standard\,dependency\,rel.} \} \cap \{ \text {extracted\,dependency\,rel.} \}\right| }{\left| \{ \text {gold-standard\,dependency\,relationships} \}\right| } \\ F1&= \dfrac{2 \times Precision \times Recall}{Precision + Recall}. \end{aligned}$$

In the micro-average evaluation, we constructed a global confusion matrix that covered five scientific documents used in the experiment, and then calculated the precision and recall using this matrix. In the macro-average evaluation, by contrast, we first calculated the precision and recall for each scientific document, and then determined their average. In each evaluation method, the F1-score is the harmonic mean of the obtained precision and recall.

4.3 Results

The dependency graph construction results are presented in Table 4. Even without using the preprocessing steps described in Sect. 3, our proposed string matching method (method 2 in Table 4) delivered a higher F1-score than the unification method. It also provided much higher precision, but lower recall, than unification (method 1). This lower recall occurs because our string matching method cannot generalize math expressions without the preprocessing steps.

Table 4 Dependency graph construction results

As we complemented our method with the preprocessing steps (methods 3–6), its recall and F1-score improved, although the precision decreased as a consequence. The improvements in recall occurred when we removed attachments (method 4) from the expressions and extracted the left-hand side subexpressions. Based on the order in which the preprocessing steps were applied, attachment removal was the first step in the generalization process. Compared with the generalization conducted using the unification method, this step delivered significantly higher precision and recall. This is because attachment removal maintains the base expression, for example, \(x^2\) is generalized to x, whereas unification generalizes all identifiers and numbers, but does not preserve the base, for example, \(x^2\) is generalized to \(var^{const}\). Preserving the base allowed our method to deliver higher precision, and simply removing the attached information without inserting any placeholder enabled our method to achieve higher recall.

The highest recall was obtained when all preprocessing steps, including case normalization, (method 6) were applied. However, while normalizing the case added several points of recall, it significantly degraded the precision, causing the F1-score to significantly decrease (p value \(< 0.05\)). This occurrence indicates that math expressions containing the same letter but in a different case should be considered as different math concepts, and it is unlikely that there is any relationship between them. This suggests that we should complement our string matching method with only the first three preprocessing steps (method 5), that is, without case normalization, to effectively construct dependency graphs.

There are still several edges in the gold-standard dependency graph that were not extracted by our method. The most frequently encountered case was when a math concept was represented using different identifiers. Although, in many cases, such a concept is expressed using identifiers with at least the same base expression, for example, \(x_1\) and \(x_n\) to represent data points, a document may represent this concept with completely different identifiers, for example, variables x and y to represent points from which the Euclidean distance is to be calculated.

We also investigated several false positive edges extracted by our method. A frequent problem was caused by imprecise OCR. When a math identifier is composed of several letters, for example, sim to express a similarity score, it is recognized by OCR as multiplication between several variables represented by the constituent letters, for example, sim is recognized as \(s \times i \times m\). The existence of this multiplication does not change the presentation of the math expression, even though the MathML representation is different. This issue, which was also reported in (Larson et al. 2013), caused our method to capture incorrect relationships between math expressions, for example, relating sim with variable i. Another false positive case occurred when an expression held multiple meanings within a document, for example, the same math identifier was used to express several unrelated math concepts within a document.

5 Math search experiment

The second experiment in this paper was performed to examine the influence of the dependency graph when used in a math search system. In this section, we first describe the general framework of the math search system implemented for this experiment. We use the NTCIR-11 Math-2 Main Task dataset (Aizawa et al. 2014) to discuss the influence of the dependency graphs in the implemented search system. We compare the obtained search performance with that of the baseline system.

5.1 Overview of the math search system implementation

We implemented a math search system to examine the influence of the proposed dependency graphs. Figure 5 shows the framework of the implemented system. It involves the following steps:

  1. 1.

    Build the database of math expressions

    1. 1.1.

      Extract and encode mathematical expressions from a document collection.

    2. 1.2.

      Capture the textual information describing each expression.

    3. 1.3.

      For each document, construct the dependency graph of math expressions.

    4. 1.4.

      Use the relationships within the dependency graph to enrich the textual information for each expression, that is, each math expression is also associated with text describing the related expressions (e.g., individual variables inside the expression).

    5. 1.5.

      Index each encoded math expression, together with its associated textual information.

  2. 2.

    Accept a user query, match the query with data stored in the math expression database, and rank the relevant math expressions.

Steps 1.1, 1.5 (usually without indexing the textual information), and 2 are the common pipeline of many math search systems. Steps 1.2 to 1.4 are the additional processes in the proposed system.

Fig. 5
figure 5

Overview of the math search system used in this paper. A general framework of math search systems is shown by the sequence of boxes with black borders. In this paper, there are two additional processes, shown as boxes with dashed borders, in the pipeline: the extraction of textual information for each math expression and the construction of a dependency graph (to enrich the math expressions) for each document

5.1.1 Indexing math expressions

Based on the results of the NTCIR-10 Math Pilot Task (Aizawa et al. 2013), which allows participating systems to evaluate their techniques for indexing math expressions, we found that several recent math search systems (MCAT, MIaS, MathWebSearch, and FSE) had a similar capability to capture the content and structure of math expressions. In this paper, without loss of generality, in terms of the implemented math search system, we used the DOM-based technique implemented in our MCAT system (Kristianto et al. 2014a, c). For math expressions that are represented in Presentation MathML format, this technique encodes the structure and content of math expressions and stores the results in a database. The storage schema for the encoding results has three fields: opaths (ordered paths), which store the vertical path of each node in the math expression and preserve the ordering information, upaths (unordered paths), which store the same content as opaths, but without ordering information, and sisters, which store the sibling nodes in each subtree. This encoding technique is used at both index-time and query-time. However, at query-time, opaths and upaths always start from the root of the math expression, whereas at index-time, opaths and upaths are generated for every subtree of the math expression being indexed.

5.1.2 Extracting textual information associated with math expressions

As we described in Sect. 3.4, two types of textual information are used in this paper: words in the context window and descriptions. Prior to indexing each type of extracted textual information, the following additional processes are applied:

  • Tokenize the string,

  • Eliminate any mathematical expressions, and

  • Stem the remaining tokens.

5.1.3 Constructing dependency graphs of math expressions

A dependency graph of mathematical expressions was created for each document in the dataset. We first applied the set of preprocessing procedures defined in Sect. 3 to the Presentation MathML form of each math expression. However, we excluded the case normalization step because this was found to have a negative influence on the F1-score in the dependency graph construction experiment. Subsequently, a dependency graph was built by performing string matching over the preprocessed and normalized MathML of each expression in the document. An edge was drawn from expression \(m_1\) to \(m_2\) if the MathML representation of \(m_1\) contained the MathML of \(m_2\).

For this experiment, we used Apache SolrFootnote 12 as the search platform. We imported the encoding result and associated text of each math expression into Solr. There are seven fields in our storage schema: three fields for storing the encoding results (opaths, upaths, and sisters), two fields for textual information (i.e., words in the context window and descriptions), and another two for textual information obtained using the dependency graph (i.e., context window and descriptions from child expressions, which are expressions connected to the target expression by outgoing edges).

5.2 Experimental settings

5.2.1 Dataset

For the math search experiment, we used the NTCIR-11 Math-2 Main Task dataset (Aizawa et al. 2014). Eight MIR systems participated in this task, thus we can compare the performance of our proposed system with that of state-of-the-art systems by simply reusing their runs submitted to this task. This dataset consists of a document set, list of topics, and assessments of the pooled results. The document set contains 105,120 scientific papers, which are automatically divided into 8,301,578 retrieval units (i.e., paragraphs), containing approximately 60 million math expressions. There are 50 topics in the dataset, each of which includes one or more query expressions and/or keywords. The topics’ math expressions in the NTCIR-11 Math-2 Task may contain query variables. During the retrieval process, each of these query variables can be matched to a variable, operator, or even subexpression. Table 11 lists examples of these topics.

The relevance assessments provided by this dataset were based on the pooling of the total of 20 runs (from eight participating MIR systems, including our system) submitted to the NTCIR-11 Math-2 Task. From the pooled results, 50 retrieval units were selected for each topic. The task organizer selected retrieval units for the assessment as evenly as possible from all the runs based on the ranking order in the individual submitted runs. During the assessment process, to ensure sufficient familiarity with the mathematical documents, the evaluators were chosen from third year and graduate (pure) mathematics students. Each retrieval unit was assessed by two evaluators, each of whom selected either relevant (R), partially relevant (PR), or non-relevant (N). The evaluators assessed the relevance of the hit to the query by comparing it with the formulae and their contexts in the retrieval unit. There was, however, no specific instruction given regarding how the relevance was to be assessed. The evaluators had to rely on their mathematical intuition, the described information need, and the query itself. Aizawa et al. (2014) reported that the evaluators were relatively lenient with the hits, and assessed them to be partially relevant if there was a considerable overlap in symbols or if the respective keywords were found in the result.

5.2.2 Experimental design

In this experiment, we investigated the performance of our math search system for exploiting math expressions’ structures, textual information (i.e., context windows or descriptions), and dependency graphs (shown by the third and fifth models in Table 6). For comparison, we set the baseline to be the performance of our math search system that only exploited the structures and textual information of math expressions without a dependency graph (the second and fourth models in Table 6). Of the two types of textual information used, we expected the use of descriptions to deliver better results than the use of context windows. Kristianto et al. (2014a) found that using textual information specific to each math expression (represented as a description) provides better ranking results than using arbitrary textual information surrounding each expression, and can be obtained without any further check to ensure that it represents the target expression (represented as the context window).

In addition to the aforementioned reports, we also include the performance when our search system used only the structures of math expressions (the first model in Table 6). We compared the search performance of our system with that of MIaS (Růžička et al. 2014), the top performer in the NTCIR-11 Math-2 Task. The performance scores for MIaS were computed using the system’s submission to the NTCIR-11 Math-2 Task, which used the Presentation MathML format of the query formulae.

5.2.3 Evaluation method

Specific to the dataset used in this paper, there are three matters that were taken into account during the experiment: query variables inside query formulae, retrieval unit dissimilarity, and incomplete assessment. First, we addressed the query variables by simply excluding from the encoded results the vertical-path directed to them. Retrieval unit dissimilarity occurred because the retrieval unit in our database schema was a math expression, whereas the assessment result in the dataset expected the retrieval unit to be a paragraph. To manage this issue, we needed a procedure to generate the ranked list of paragraphs after we obtained a ranked list of math expressions for a given query. In this study, we set the score for a paragraph to be the highest score obtained by an expression within that paragraph.

The incomplete assessment issue arose because the number of assessed units per topic was very small compared with the total number of retrieval units in the dataset. To manage this issue, we first generated a ranked list of all retrieval units that matched the query. Next, we created a condensed list (Sakai 2009) from the raw ranked list by removing all unassessed retrieval units. This evaluation enabled us to accurately compare the search performance with and without a dependency graph because we could use all the assessed units for evaluation. This approach was also used when we compared our system with MIaS (based on the runs these systems submitted to the NTCIR-11 Math-2 Task). We first obtained the top 100 and top 1,000 retrieval units per query from each system (our system and MIaS), and generated condensed lists from the initial ranked lists. The performance of each system was then measured over these condensed lists.

To evaluate the ranking performance, we used trec_eval to report the precision-at-5 (P@5), precision-at-10 (P@10), and mean average precision (MAP) metrics. Because trec_eval only accepts binary relevance judgment, the score of the two evaluators were converted into an overall relevance score using the mapping table shown in Table 5.

Table 5 Relevance score mapping

5.3 Results

The influence of using dependency graphs in our implemented math search system is shown in Table 6. Because we mostly use condensed list construction to report the results, we present in Table 8 the number of unassessed retrieval units that were removed for several ranks.

Table 6 Influence of using dependency graphs to enrich different types of textual information on search performance

The first part of the table shows the performance of our math search system without condensed list construction. The reported precision is quite low because there are many unassessed retrieval units in the top 100. As shown in Table 8, there are on average four and eight unassessed retrieval units in the top five and top 10 of the initial ranked lists, respectively. We suggest that this also caused the P@5 and P@10 improvement obtained using dependency graphs to be statistically insignificant. However, the MAP improvement delivered by the dependency graphs is statistically significant.

The second part of Table 6 shows the search performance based on the condensed list constructed from the top 100 retrieval units. The use of dependency graphs improves the P@5 and P@10, but not significantly. We suggest that this happened because there are too few assessed units in the top 100. Based on Table 8, in average, there are only approximately five assessed retrieval units in the top 100 of the initial ranked lists. There is, however, statistically significant improvement in the MAP resulting from the use of dependency graphs. This indicates that the use of dependency graphs allows our search system to return better ordering of retrieval units.

The third part of the table shows the precision based on the condensed list from the top 1,000 retrieval units. The trend here is the same as explained above, that is, the dependency graphs deliver precision improvement. However, because there are now more retrieval units in the constructed condensed lists, the dependency graphs can improve MAP, P@5, and P@10 significantly.

The positive influence of using dependency graphs is also shown when we measure the precision based on the condensed list created from deeper initial ranked list, that is, containing all retrieval units. The result is shown in the last part of Table 6. The table shows that the use of dependency graphs successfully enriches textual information (regardless of type, i.e., whether context window or description) of each indexed math expression and delivers precision that is statistically significantly (\(p<0.05\)) higher than when no dependency graph is used. Compared with when the search system uses math as indexed information alone, incorporating both textual information and dependency graphs leads to even higher precision. These experimental results indicate that dependency graphs enable textual information from child expressions to be effectively used to represent the target expressions.

5.3.1 Comparing the search performance of using dependency graphs with that of the other system

We analyzed the influence of textual information and dependency graphs in our system by comparing the obtained precision with that of MIaS (Růžička et al. 2014). In this analysis, we used the context window as the textual information, considering that Table 6 suggests this is more important than descriptions. Table 7 summarizes the performance of our system and MIaS.

Table 7 Performance comparison between our system and MIaS
Table 8 Average number of unassessed documents removed in the early ranks, that is, top five, top 10, top 20, top 50, top 100, top 500, and top 1000, for each query

The first part of Table 7 compares the performance of our system and MIaS using condensed list constructed from the top 100 retrieval units. In the high relevancy setting, the precision of MIaS is higher than that of our system (both with and without dependency graph), although it is not statistically significant. In the partial relevancy setting, however, there is significant difference between MIaS’s and our system’s performance. We suggest that this occurs because the performance of MIaS was measured using the run submitted to the NTCIR-11 Math-2 Task, and consequently, there are more assessed retrieval units in the condensed lists than in our system. Table 8 shows that there are on average 20 assessed retrieval units in MIaS’s top 100 units, whereas there are only five in our system’s top 100 units.

The second part of Table 7 shows the search performance when we created a condensed list from the top 1000 units. In the high relevancy setting, the precision achieved by MIaS is still slightly higher than that of our system, except at P@5. However, the results of two-tailed t tests at this relevancy setting show that, in fact, there is no statistically significant difference between MIaS and our system (both with and without dependency graphs). The difference in precision between MIaS and our system in terms of partial relevance also still appears to be slightly higher than for high relevance. The significance test indeed shows that at P@5 and P@10, MIaS statistically significantly (p value \(< 0.05\)) outperformed our search system incorporated with the context window, that is, with p values of 0.0365 and 0.0259, respectively. By introducing the dependency graph to our system, these differences were reduced, and were no longer statistically significant. Using this analysis, in which our ranked lists now contain more assessed units than when in top 100 retrieval unit setting, we can suggest that our system using both the context window and dependency graphs is comparable with MIaS.

We further evaluated the influence of using dependency graphs in our math search system for the NTCIR-12 MathIR Task (Zanibbi et al. 2016; Kristianto et al. 2016). In this shared task, we also compared the performance of our search system with the other available systems. Based on our runs submitted to this shared task, we found that the influence of using dependency graphs is consistent with our findings in this paper. Dependency graphs had a major influence on our system. These graphs delivered significant precision improvement. This shared task also revealed that our search system provided the highest performance in the arXiv subtask.

5.3.2 Comparing the search performance of using the context window with using descriptions

We initially expected the description-based search system to outperform the context window-based system. However, the experimental results show the opposite. For completeness, we briefly examine the cause of this contradiction.

Table 6 indicates that the precision obtained when using descriptions as textual information was lower than when using the context window. To investigate the lower performance delivered by the descriptions, we measured how well each type of textual information matched the textual keywords in the queries. Table 9 presents the precision and recall resulting from this experiment. The precision and recall used for this measurement are defined as follows:

$$\begin{aligned} {\begin{matrix} Precision &{}= \frac{\left| \text {relevant\, assessed\,paragraphs\,retrieved\,by\,textual\,keywords\,only}\right| }{\left| \text {all\,assessed\,paragraphs\,retrieved\,by\,textual\,keywords\,only}\right| } \\ Recall &{}= \frac{\left| \text {relevant\,assessed\,paragraphs\,retrieved\,by\,textual\,keywords\,only}\right| }{\left| \text {all\,relevant\,assessed\,paragraphs}\right| }. \end{matrix}} \end{aligned}$$
(1)

For a given query, a paragraph was retrieved if any of its constituent math expressions were associated with textual information containing any stemmed term from the textual keywords specified in the given query.

Table 9 Performance of the context window and descriptions for retrieving relevant paragraphs

Table 9 reports that the use of descriptions enabled relevant paragraphs to be retrieved with slightly more precision than contextual words. Even after complementing this system with dependency graphs, however, it retrieved far fewer paragraphs than contextual words. We believe that this lower recall performance from descriptions caused searches using descriptions as textual information to deliver lower precision than when using contextual words.

5.3.3 Search performance of using the context window with different widths

We measured the contribution of the dependency graph for the enriching context window with different widths. The context window used in the experiments above originates from the same sentences as the target math expressions. We report the search performance of using a context window with width \(w=5\) and \(w=10\). We also report the performance of using all words in the containing paragraph as the context window (\(w=para\)). The results are shown in Table 10.

Table 10 Performance comparison between different widths of context window. Precision is measured from the condensed list created from the raw ranked list containing all retrieval units

For all width options, the use of dependency graphs significantly improves precision. We also found that among width options, there is no statistically significant difference. Without using dependency graphs, the performance of \(w=10\) is the highest. This indicates that \(w=5\) might be too short, so not many useful terms appear in the context window. Additionally, the use of \(w=para\) may introduce many terms that are not related to the target expressions.

With the use of dependency graphs, there is no conclusive performance difference between \(w=5\) and \(w=10\). For a high relevancy setting, \(w=10\) appears to perform better than \(w=5\); however, for the partial relevancy setting, the opposite occurs. By contrast, the combination of dependency graphs and \(w=para\) performs worse than \(w=10\) for most measurements.

From these experimental results, it is shown that the context window with \(w=10\) outperforms \(w=5\) for most measurements, that is, except when the dependency graphs and a partial relevancy setting are used. We checked that this \(w=10\) setting covers most words in the containing sentences. For the dataset that we used, the average number of words to the left and right of each math expression was 14.37 and 11.30, respectively. Thus, regarding the context window that originates from the same sentences as the expressions, the width that covers most of the terms in the containing sentences can be a good option to use in a math search system. Additionally, the use of a context window that originates from the same sentences as the target math expressions is more effective than the context window that covers the entire paragraph. Finally, these experimental results illustrate that for any selected width option, the incorporation of dependency graphs can enhance the search precision.

5.4 Analysis of ranking results

Two examples of the ranking results are given in Table 11. This table shows the influence of using both textual information and dependency graphs when searching for math expressions. The first part of the table exhibits the search results when 10 words were used in the context window as textual information, and the second part shows the results of using descriptions as textual information.

5.4.1 Using dependency graphs to enrich the contextual window

The first part of the table uses \(\lim \nolimits _{n\rightarrow \infty }\int \nolimits _{?X}?f_{n}\ d?u= \int \nolimits _{?X}\lim \nolimits _{n\rightarrow \infty }?f_{n}\ d?u\) as the query math expression and the term “dominated convergence and Lebesgue” as the text of the query. This query asks for expressions regarding Lebesgue’s dominated convergence theorem.Footnote 13 In this query math expression, identifiers preceded by “?” denote query variables. When we used textual keywords in the query to complement the query expression and searched for these keywords in the context, we obtained two highly relevant and two partially relevant expressions in the top five results. The highly relevant expressions were ranked first and second because both were similar to the query expression and their contextual information contained all the textual keywords. The remaining three expressions did not contain any textual keywords. Their appearance in the top five is a result of their similarity to the query expression.

When we searched for the query keywords in the context window obtained from the dependency graph, we obtained an additional relevant expression in the top five. There were two highly relevant and three partially relevant expressions in the results. The first two were the same as the first two from the previous search, and the third and fourth were the same as the fourth and fifth from the previous search. They passed over the math expression \(\displaystyle \lim \nolimits _{n\rightarrow \infty }\int _{T}{d\psi _{n}}=-\left| T\right|\), which was the third result in the previous search because they had matching terms in their extended context (from the dependency graph). Additionally, the final expression in the top five also passed over the previous third expression because its context from the dependency graph contained a matching keyword.

5.4.2 Using dependency graphs to enrich descriptions

The second part of Table 11 provides the results of using \(\frac{1}{n^{?s}}\) as the query expression, and the terms “Riemann” and “zeta” as keywords to search for expressions describing the Riemann zeta function. This query expression denotes a simple fraction with a power in its denominator. Consequently, when a math expression in the database matched the query formula, but not the query textual keywords, the expression was likely to contain a fraction, but was not necessarily relevant to the query. By incorporating descriptions during the search, we obtained expressions that were not only similar, but also relevant to the query. There were three highly relevant and one partially relevant expressions retrieved by the system. All four relevant expressions contained the specified textual keywords in their descriptions.

When we complemented this search by applying keywords in the descriptions from the dependency graph, we obtained another highly relevant expression. The first four expressions were the same as the first four from the search without the dependency graph. The final expression in the top five passed over the expression \(\delta \ge \frac{1}{n^{c}}\), which was the fifth expression returned in the previous search, because its extended description contained the query keywords. This last expression, \(\zeta (s)=\sum _{n>0}\frac{1}{n^{s}}\), did not have any associated descriptions. However, the dependency graph constructed using our method captured a dependency between this expression and the notation \(\zeta _K(s)\). Figure 6 displays an excerpt of the dependency graph. This dependency enables the fifth returned expression to extend its descriptions using those extracted for \(\zeta _K(s)\), that is, “Dedekind zeta function” and “an analogue of the Riemann zeta function, which is closely related to the algebraic integers in a number field.” The query keywords “Riemann” and “zeta” appeared in the extended descriptions and improved the score, and ensured that \(\zeta (s)=\sum _{n>0}\frac{1}{n^{s}}\) became a top five result.

Fig. 6
figure 6

Excerpt from the dependency graph showing the dependencies between \(\zeta (s)=\sum _{n>0}\frac{1}{n^{s}}\) and other math expressions

Table 11 Examples of ranking results

6 Conclusion

We proposed an approach using dependency graphs between mathematical expressions to improve the search results of math search systems. We first introduced a heuristic technique to construct dependency graphs from scientific papers. The experimental results showed that the construction technique delivered an F1-score of 87%. Subsequently, we used dependency graphs to enrich the textual information of mathematical expressions and investigated their influence on a math search system. The experimental results showed that the use of dependency graphs in the math search system delivered 13% higher precision.

In future work, we plan to use dependency graphs of math expressions to construct a graph of concepts from scientific documents. If we can extract concepts or knowledge to represent each math expression, we can use the dependency relationships between math expressions in the dependency graph to build a concept graph. Such graphs could then be used to provide the following: (1) a summary list of concepts expressed by math expressions in each document and (2) a coherent list of concepts required to understand a given concept.

Additionally, there is room for improvement in the heuristic method proposed in this paper. To scale up the coverage of the dependency graphs, that is, to cover multiple scientific documents, the present heuristic method may not be sufficiently robust. Our method is based on the assumption that for two math expressions, having the same (base) representation or the same left-hand side subexpressions is equivalent to sharing the same meaning. However, to develop a dependency graph that covers multiple documents, this assumption cannot be used because different documents may use different math symbols to represent the same math concept and the same math symbol to represent different math concepts. We will consider extracting a dependency graph by exploiting more information related to each math expression, for example, by exploiting both the encoded MathML representation (e.g., opaths, upaths, and sisters) and the words surrounding each math expression.