Optimizing subgraph matching over distributed knowledge graphs using partial evaluation

Song, Yanyan; Qin, Yuzhou; Hao, Wenqi; Liu, Pengkai; Li, Jianxin; Choudhury, Farhana Murtaza; Wang, Xin; Zhang, Qingpeng

doi:10.1007/s11280-022-01075-6

Optimizing subgraph matching over distributed knowledge graphs using partial evaluation

Open access
Published: 08 July 2022

Volume 26, pages 751–771, (2023)
Cite this article

Download PDF

You have full access to this open access article

World Wide Web Aims and scope Submit manuscript

Optimizing subgraph matching over distributed knowledge graphs using partial evaluation

Download PDF

Yanyan Song¹,
Yuzhou Qin¹,
Wenqi Hao¹,
Pengkai Liu¹,
Jianxin Li²,
Farhana Murtaza Choudhury³,
Xin Wang ORCID: orcid.org/0000-0001-9651-0651¹ &
…
Qingpeng Zhang⁴

2564 Accesses
1 Altmetric
Explore all metrics

Abstract

The partial evaluation and assembly framework has recently been applied for processing subgraph matching queries over large-scale knowledge graphs in the distributed environment. The framework is implemented on the master-slave architecture, endowed with outstanding scalability. However, there are two drawbacks of partial evaluation: if the volume of intermediate results is large, a large number of repeated partial matches will be generated; and the assembly computation handled by the master would be a bottleneck. In this paper, we propose an optimal partial evaluation algorithm and a filter method to reduce partial matches by exploring the computing characteristics of partial evaluation and assembly framework. (1) An index structure named inner boundary node index (IBN-Index) is constructed to prune for graph exploration to improve the searching efficiency of the partial evaluation phase. (2) The boundary characteristics of local partial matches are utilized to construct a boundary node index (BN-Index) to reduce the number of local partial matches. (3) The experimental results over benchmark datasets show that our approach outperforms the state-of-the-art methods.

Optimal Subgraph Matching Queries over Distributed Knowledge Graphs Based on Partial Evaluation

Using partial evaluation in holistic subgraph search

Article 18 September 2018

Efficient distributed subgraph similarity matching

Article 07 March 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Knowledge graphs have become the important cornerstone of the research and development of artificial intelligence technology. In recent years, the scale of knowledge graphs has increased at an unprecedented rate, and data processing with millions of vertices (10⁶) and hundreds of millions of edges (10⁸) has become commonplace [1]. Therefore, it is necessary to consider how to perform distributed query processing to cope with the growing demand for knowledge graphs.

In the Semantic Web community, the Resource Description Framework (RDF) has become a de-facto standard format for knowledge graphs and has been extensively applied [1]. An RDF dataset consists of a set of triples 〈s, p, o〉 and can be transformed into a graph where the resources denoted by s and o are vertices, and the attributes denoted by p are labeled edges. SPARQL [2] is the standard graph query language on RDF graphs. A SPARQL query can be regarded as the subgraph homomorphism problem [3], which is recognized as an NP-complete problem [4, 5].

The efficient processing of subgraph matching queries over large-scale RDF graphs in a distributed setting is a challenging problem. Recently, the partial evaluation technique [6] has been applied to solve the problem of regular path queries on distributed environment [7,8,9]. The queries Q are partially evaluated in parallel to obtain partial results on each fragment of data F_i on each site S_i, then all the partial results are transmitted to a master site. Finally, assemble these partial results to get the final results of Q.

Based on partial evaluation technique, the partial evaluation and assembly framework has been proposed to answer SPARQL queries [10]. However, a large number of partial results can be generated during the partial evaluation phase, making the assembly phase a computational bottleneck. To improve the efficiency of assembly phase, Peng et al. [11] proposed the LEC feature-based optimization strategy to prune some unpromising intermediate results. However, existing works only focus on the assembly phase of query processing, while ignoring the partial evaluation phase, and efficient index-based methods are not effectively utilized by the partial evaluation phase to speed up the search process. The following example demonstrates the drawback of computing partial matching results by the method in [11].

Example 1

As shown in Fig. 1a, given a distributed RDF graph G₀, and a query Q = (?a, spouse, ?s) ∧ (?a, director, ?b) ∧ (?b, country, ?c) ∧ (?c, capital, ?d), the query engine traverses the dataset on F_i according to the query graph to obtain all the candidate sets of query variables. In fragment F_i, the candidate sets of query is denoted as \(B_{g_{i}}\), where the subscript g_i is used to distinguish among different fragments. Based on the candidate generation strategy, the candidates of internal query nodes of query Q are \(B_{g_{1}}\) = {〈?a, (v₂, v₅)〉, 〈?b, (v₃, v₄, v₂₁)〉, 〈?c, (v₈)〉} and \(B_{g_{2}}\) = {〈?a, (v₁₁, v₁₅, v₂₂)〉, 〈?b, (v₁₃, v₁₄)〉, 〈?c, (v₉)〉}. The internal query nodes denote the query nodes that have more than one edge (the nodes ?s and ?d are not internal query nodes because there only exists one edge connected with them in query graph Q). The candidate vertices are in yellow, green, and blue, respectively, in Fig. 1a. When computing local partial matches on each site, each candidate vertex will start a graph exploration, so that the search process will iterate six times on the first site and six times on the second. To further explore, we can find that a graph exploration from either v₂ or from v₃ can obtain the local partial match (〈?s, v₁〉, 〈?a, v₂〉, 〈?b, v₃〉, 〈?c, v₉〉, 〈?d, v₁₀〉) on F₁. As a result, this strategy can generate a large number of repeated local partial results, leading to a degradation in the performance of partial evaluation.

To handle this problem, we propose an effective optimization strategy to accelerate the partial evaluation phase by a constructed index named inner boundary node index which exploits characteristics of partial evaluation and assembly framework (the formal definition will be explained in detail in Section 4). For the example shown in Fig. 1, the following candidate sets are generated after filtered by IBN-Index: \(B_{g_{1}}\) = {〈?a, NULL〉, 〈?b, (v₃, v₄, v₂₁)〉, 〈?c, (v₈)〉}, \(B_{g_{2}}\) = {〈?a, (v₂₂)〉, 〈?b, (v₁₃, v₁₄)〉, 〈?c, (v₉)〉}. These candidate vertices of ?a, ?b and ?c are colored yellow, green and blue respectively in Fig. 1b. The size of the whole candidate sets is much smaller than the candidates mentioned in Example 1. Furthermore, the local partial match (〈?s, v₁〉, 〈?a, v₂〉, 〈?b, v₃〉, 〈?c, v₉〉, 〈?d, v₁₀〉) will be only generated once which is started from v₃ on fragment F₁.

Since the growing number of partial matches heavily influence the assembly stage, filtering out part of the partial results becomes an effective way to speed up the query. Therefore, we propose another method that utilizes constructed boundary node index (the formal definition will be explained in detail in Section 5) to filter out part of the false local partial matches in advance, reducing the cost of communication and centralized computation.

We summarize the contributions of the paper with the following three aspects:

Based on partial evaluation and assembly framework, we propose an inner boundary node index (IBN-Index) and a partial evaluation algorithm that utilizes IBN-Index to filter the candidate sets of query nodes, which can significantly speed up partial evaluation phase.
To reduce the number of local partial matches, the boundary node index (BN-Index) is constructed by exploiting the characteristics of local partial matches. Furthermore, a BN-Index-based filter algorithm is proposed.
Extensive experiments on benchmark datasets have been conducted to verify the efficiency and scalability of our method. The experimental results show that our method outperforms the state-of-the-art method.

The rest of this paper is organized as follows. Section 2 reviews the related work. In Section 3, we present the preliminaries and problem definition. An overview of the methods is depicted in Section 4. In Section 5 and Section 6, we propose the inner boundary node index and the boundary node index, with their corresponding algorithm, respectively. Finally, experimental evaluations are presented in Section 7 and we draw conclusions in Section 8.

2 Related work

Due to performance, confidentiality, and security factors [12, 13], the cluster-based distributed data management architecture has become the inevitable research trend to deal with the knowledge graph. In this section, we will review several distributed subgraph matching research on large-scale RDF graphs, which can be classified into two categories, including MapReduce-based graph systems, and specialized RDF systems. Furthermore, existing works on partial evaluation and assembly and graph indexing methods are summarized.

2.1 MapReduce-based graph systems

SHARD [14], a MapReduce-based triple store for RDF graphs, is able to process SPARQL queries, which decomposes the query graph into a set of triples (a triple containing variables) and binds variables to the vertices of the data graph by iterating on the triple patterns. Meanwhile, it is necessary to satisfy all the constraints in the query. For example, each round of the MapReduce operation adds only one query clause through the join operation. Likewise, the smallest decomposition unit of the query graph in HadoopRDF [15] is also the triple pattern, and it also utilizes the MapReduce framework to divide the RDF triples into multiple small files based on the predicate. However, both methods mentioned above ignore the structural information of the query graph, require multiple MapReduce iterations, and require a large number of join transactions, resulting in high query cost.

2.2 Specialized RDF systems

Trinity.RDF [16], a distributed in-memory key-value store, stores RDF data in the native form, with vertex identifiers as keys and adjacent lists of vertices as values. Trinity.RDF finds the optimal exploration plan and reduces the number of intermediate results using the graph exploration instead of join operations, while the final results need to be obtained using a single thread on the master node. In addition, systems based on partial evaluation and assembly framework have also been extensively and deeply studied in recent years.

The method of partial evaluation and assembly is first applied in distributed XML data management by Peter et al. [17]. The key idea is to transmit the whole query graph to each site that is partially evaluated in parallel, and after each node computes the partial results of the query, the results are transmitted as compact Boolean functions to the master node, which are combined to obtain the result. Fan et al. [7, 18, 19] subsequently propose a series of algorithms based on partial evaluation and assembly framework to deal with XQuery on distributed XML data, reachability query and graph simulation on distributed graph. Peng et al. [10, 11] design a subgraph matching query algorithm based on partial evaluation and assembly framework to process SPARQL queries on distributed RDF data and propose relevant optimization strategies.

However, the huge overhead caused by repeated partial matches during partial evaluation is not handled in aforementioned methods. Furthermore, when the number of intermediate results obtained in the local computation phase is extensive, the aforementioned methods may suffer from a performance bottleneck in the assembly phase.

2.3 Graph indexing

As a classic space-for-time strategy, graph indexing has been researched extensively and deeply in the past years. Graph indexing methods can be classified into value-based indexing and structure-based indexing.

Through value-based methods, a graph index is usually constructed on one or more properties of an entity. SB-Tree [20] is a variant of B-Tree, which has a better performance on dynamic data. Hexastore [21] and RDF-3X [22] index the RDF data in six possible ways. Based on S-Tree [23], gStore [24] proposed VS-Tree to prune the search space efficiently.

Structure-based methods focus on mining the features of a graph, such as a path, subtree, or other substructure, indexing them to filter the search space and accelerate the query process. Closure-Tree [25] proposed graph closure, which is a generalized graph that represents several graphs, and based on it to organize graphs as a tree. Both subgraph queries and similarity queries can benefit from this method. SPath [26] takes the shortest paths around the vertex neighborhood as the basic process unit and decomposes the query into a set of shortest paths to exploit its indexing. K-path-bisimulation [27] is a path-based index, the path with identical length and the same edge label sequence is divided into the same catalog, and the query is decomposed into a set of paths.

In this paper, the features of partial evaluation and assembly framework are exploited profoundly, and two kinds of indexes are constructed to compensate for the shortcomings of previous methods. Although the proposed index-based methods are value-based, the structural information of the graph is also included in the indexes.

3 Preliminaries

Let U and L be the disjoint infinite sets of URIs and literals. Then, a tuple in the form of 〈s, p, o〉 ∈ U × U × (U ∪ L) is called an RDF triple, where s is the subject, p the predicate, and o the object. Given an RDF dataset as a finite set of triples, it can be converted to its corresponding RDF graph. In this paper, we focus on the problem of subgraph matching query over a distributed RDF graph. This section will present preliminaries for distributed RDF graphs and subgraph matching queries. Table 1 lists the notations frequently used in this paper.

Table 1 Frequently used notations

Full size table

Definition 1 (RDF Graph)

Given an RDF dataset as a finite set of triples in the form 〈s, p, o〉, its corresponding RDF graph is G = (V, E, Σ), where the set of vertices V is the union of all s and o. For each 〈s, p, o〉, there is a directed edge e ∈ E from the vertex s to the vertex o, where p is the label of that edge e. Here, Σ is the set of all labels, i.e., Σ = {p ∣ 〈s, p, o〉 ∈ G}.

Definition 2 (Distributed RDF Graph)

RDF graph G is partitioned into n disjoint ‘entity sets’ {\(\mathcal {E}_{1}\), ..., \(\mathcal {E}_{n}\)}, where each \(\mathcal {E}_{i}\) = (V_i, E_i, Σ_i). Here, (1) for each i ∈{1,...,n}, \(\mathcal {E}_{i}\) is a subset of G, where \(V_{i} \subseteq V\), \(E_{i} \subseteq E\), and \({\Sigma }_{i} \subseteq {\Sigma }\); (2) for each i, j ∈ {1,...,n} ∧ i≠ j, there is \(\mathcal {E}_{i}\) ∩ \(\mathcal {E}_{j}\) = ∅; and (3) \(\bigcup ^{n}_{i=1}\) \(\mathcal {E}_{i}\) = G.

To ensure the integrity and consistency of the RDF graph when partitioned in a distributed system, each computing node needs to store some copies of the edges that cross between different entity sets. Let the copy of the associated edges with other partition be denoted as \({N_{i}^{c}}\).

Definition 3 (Fragment)

Graph G is partitioned into n fragments \(\mathcal {F}\) = {F₁, ..., F_n}, such that F_i = \(\mathcal {E}_{i} \cup {N_{i}^{c}}\). In other words, G can be considered as a distributed RDF graph w.r.t. \(\mathcal {F}\), such that:

1)
For each \(\mathcal {E}_{i}\) = (V_i, E_i, Σ_i), V_i, E_i, and Σ_i represent the set of internal vertices, the set of edges, and the set of labels in F_i, respectively. Formally, V_i = {\(s \mid \langle s,\;p,\;o \rangle \in \mathcal {E}_{i}\)} ∪ {\(o \mid \langle s,p,o \rangle \in \mathcal {E}_{i}\)}, \(E_{i} \subseteq V_{i} \times V_{i}\), and Σ_i = {\(p \mid \langle s,p,o \rangle \in \mathcal {E}_{i}\)};
2)
\({N_{i}^{c}}\) = (\({V_{i}^{e}}\), \({E_{i}^{c}}\), \({{\Sigma }_{i}^{c}}\)), where \({E_{i}^{c}}\) is the set of crossing edges between F_i and other fragments. If an internal vertex of F_i has a direct edge with any vertex v in F_j, where i≠j, then \(v\in {V_{i}^{e}}\). Formally, \({E_{i}^{c}} \subseteq V_{i} \times V_{j}\), \({{\Sigma }_{i}^{c}}\) = {\(p \mid \langle s, p, o \rangle \in {E_{i}^{c}}\) }, the set of boundary vertices between F_i and F_j is \({V_{i}^{e}}\) = {\(s \mid \langle s, p, o \rangle \in {E_{i}^{c}} \wedge s \in V_{j}\)} ∪ {\(o \mid \langle s,\;p,\;o \rangle \in {E_{i}^{c}} \wedge o \in V_{j}\)}, i, j = 1,2,...,n ∧ i≠j;

Let \(\mathcal {S}\) = {S₀, S₁, \(\dots\), S_n} be a set of n + 1 computing nodes, i.e., sites, in a cluster. Without loss of generality, each fragment F_i is stored at a slave site S_i for i ∈ {1, \(\dots\), n}.

Example 2

As shown in Fig. 2, given a distributed RDF graph G₁ extracted from the DBpedia dataset, G₁ can be divided into four parts \(\mathcal {F}\) = {F₁, F₂, F₃, F₄}, which are respectively stored on the corresponding sites {S₁, S₂, S₃, S₄} in the cluster. For a fragment F₂ = \(\mathcal {E}_{2}\) ∪ \({N_{2}^{c}}\), the partition \(\mathcal {E}_{2}\) = (V₂, E₂, Σ₂), and V₂ = {v₄, v₅, v₉, v₁₀, v₁₁, v₁₂, v₁₃, v₁₄, v₁₅}, E₂ = {(v₄, v₅), (v₉, v₁₀), (v₉, v₁₄), (v₁₄, v₁₅), (v₁₃, v₁₁), (v₁₁, v₁₂)}. The copy between F₂ and other fragments \({N_{2}^{c}}\) = (\({V_{2}^{e}}\), \({E_{2}^{c}}\), \({{\Sigma }_{2}^{c}}\)), where \({V_{2}^{e}}\) = {v₃, v₆, v₁₇}, \({E_{2}^{c}}\) = {(v₃, v₉), (v₅, v₆), (v₁₃, v₁₇)}. In particular, we colored the incoming and outgoing vertices of fragment F₂ in blue.

Given an RDF graph G and a query graph Q as a set of triple patterns, a subgraph matching problem is to find the subgraphs over G that satisfy all the triple patterns in Q. Such a subgraph matching problem is a conjunctive query (CQ) on G, which is the focus of this paper. In the following, we formally present the query graphs and the other necessary definitions adapted from [28].

A query graph includes m triple patterns 〈s_r, p_r, o_r〉, where the value of each s_r, o_r can either be a member of V, or ‘not labeled’. If a s_r or o_r is ‘not labeled’, s_r or o_r belongs to a special set Var, and the name of each element in Var starts with the character ‘?’. Similarly, the value of each p_r can either be a member of Σ, or Var.

Definition 4 (Query Graph)

Given an RDF graph G, a CQ Q over G is defined as: Q(z₁, \(\dots\), z_t)\(\gets \bigwedge _{1\le r \le m} tp_{r}\), where tp_r = 〈s_r, p_r, o_r〉 is a triple pattern. s_r,o_r ∈ V ∪ V ar, z_l is a variable and z_l ∈ {s_r∣1 ≤ r ≤ m} ∪ {o_r∣1 ≤ r ≤ m}. A CQ Q is also referred to as a query graph.

Before defining subgraph matching, we recapitulate certain definitions of the mapping. For a mapping μ, dom(μ) is its domain. Two mappings μ₁ and μ₂ are compatible, i.e., μ₁ \(\sim\) μ₂, iff for every element v ∈ dom(μ₁) ∩ dom(μ₂), it holds that μ₁(v) = μ₂(v). Furthermore, the set-union of two compatible mappings, i.e., μ₁ ∪ μ₂, is also a mapping.

Definition 5 (Subgraph Matching)

The semantics of a CQ Q over an RDF graph G is defined as:

1)
μ is a mapping from the vertices in Q to the vertices in V, i.e., mapping from \(\overline {s}\) = {s₁,...s_m} and \(\overline {o}\) = {o₁,...o_m} to the vertices in V;
2)
\((G, \mu ) \vDash Q\) iff 〈μ(s_r), μ(p_r), μ(o_r)〉∈ E and the labels of s_r, p_r and o_r are the same as that of μ(s_r), μ(p_r), and μ(o_r), respectively, if s_r, p_r, o_r∉Var;
3)
P_Q is the set of all results, where each result satisfies the subgraph matching query Q over G.

Problem statement

Consider a distributed RDF graph G, w.r.t., a fragmentation \(\mathcal {F}\) = {F₁,...,F_n}, and let F_i stored in the cluster \(\mathcal {S}\) = {S₀, S₁,...S_n}. For simplicity, we assume that each site S_i hosts one fragment F_i. Given a query graph Q, the problem is to find all subgraph matching results P_Q of Q in G.

Example 3

Given a CQ, Q = (?a, spouse, ?s) ∧ (?a, director, ?b) ∧ (?b, country, ?c) ∧ (?c, capital, ?d). Q consists of five query vertices and its semantic is to find the films directed by a person with his spouse, the film’s country and the capital of the country. The corresponding query graph is shown in Fig. 3, with one of the query results being highlighted in purple in Fig. 2.

4 Overview

The partial evaluation and assembly framework is extended to answer SPARQL queries over a distributed RDF graph G, as shown in Fig. 4. In the execution model, there are two phases: the partial evaluation phase and the assembly phase. In addition, two optimization strategies are designed and embedded in this framework.

Before the query starts, the entire RDF graph G is divided into multiple fragments according to a certain partitioning strategy, and an index named BN-Index is built on each fragment. The fragments and corresponding BN-Index are then transmitted to each site, and an index named IBN-Index is further constructed on each site locally. When querying, the master node sends the entire query graph to all slave nodes, and the subsequent partial evaluation phase can be summarized into three processes. (1) Each site S_i first receives the complete query graph Q and finds all candidate sets of the query graph variables. (2) The query engine uses the IBN-Index to filter the candidates, and executes the graph exploration algorithm according to the filtered candidates to find local matches. (3) Finally, BN-Index is utilized to filter the local partial matches after graph exploration.

The local partial matches are then sent to the master site to compute the complete SPARQL matches, which is called the assembly stage. Benefiting from the filtering effect of BN-Index, the number of partial matches is drastically reduced, which alleviates the assembly bottleneck problem to a certain extent.

To better illustrate the effect of IBN-Index, it is necessary to explain the partial evaluation process of gStoreD in detail. After each site S_i receives the complete query graph, the candidate set of each query variable is obtained according to the predicates connected with the variables. Specifically, the vertices in the candidate set can be classified into internal candidate vertices and boundary candidate vertices. The internal candidate vertices denote the vertices contained in the subgraph F_i allocated on S_i, while the boundary candidate vertices denote those vertices connected with F_i but do not belong to it. After obtaining the candidate sets, all the sites transmit the internal candidate vertices to the master site, and the collection of all internal candidates is resent to all sites.

In order to find partial results on F_i, graph exploration starts with each internal candidate vertex of the internal query variables. Specifically, for a query graph, the query variables can be classified into internal query variables and satellite variables, depending on the edges connected with the query node. If a query node only has one edge, it is denoted as a satellite variable.

The reasons why we choose these special candidate vertices as the starting vertices are considered from two aspects. (1) A boundary vertex is also an internal vertex on another site simultaneously so that it will be set as a starting vertex on that site, and the path connected with it will not be lost. (2) If a partial match on S_i only matches a satellite node, it will also be found on other sites. Take the partial match in purple on S₁ in Fig. 2 as an example, the partial match will be found on S₂ and connected with v₆ as a complete SPARQL match. Therefore, only starting with internal query variables will not lose any partial results.

At each expansion step of graph exploration, query engine judges whether the matching node belongs to the collection of all the internal candidates. The graph exploration process will not stop until all maximal partial matches are obtained.

5 Inner boundary node-based algorithm

Recall that in partial evaluation and assembly framework, given a distributed RDF graph G, each site S_i receives a part of the graph F_i and constructs IBN-Index according to the subgraph. Then, when answering query Q, each site S_i computes local partial matches utilizing the constructed index. In this section, the structure of the IBN-Index is defined, and the IBN-Index construction algorithm is introduced. Then we present the subgraph matching algorithm utilizing IBN-Index on each site and give the complexity analysis of the proposed method.

5.1 Inner boundary node

The local partial match computation algorithm based on partial evaluation and assembly framework have been proposed in [10]. First, an in-depth analysis of the performance problem of existing work in computing local matches during partial evaluation is carried out, and on this basis, an optimization using the inner boundary node index is proposed.

When computing local partial matches, each internal candidate vertex of the candidate vertices sets starts the graph exploration to find the local partial matches. It should be noted that it is not necessary to traverse all internal nodes as the starting point of graph exploration. The reasons can be considered from the following two aspects.

(1) Intuitively, a complete SPARQL match only needs to be traversed from one vertex so that the candidate vertices in other candidate sets can be discarded. (2) Besides, as mentioned in [10], a local partial match is the overlapping part of an unknown crossing match and a fragment F_i. There must be a crossing edge derived from an internal vertex connected with other fragments. Therefore, only searching from the vertices connected with other fragments can get all the partial matches. These kinds of vertices are defined as inner boundary nodes, and the formal definition is given as follows:

Definition 6 (Inner Boundary Node.)

Given a distributed RDF graph Q and a fragmentation \(\mathcal {F}\) = {F₁,...,F_n}. In fragment F_i = \(\mathcal {E}_{i} \cup {N_{i}^{c}}\), if an internal vertex v of F_i has a direct edge with any vertex u in F_j, where i≠j, then v is an inner boundary node.

Based on inner boundary nodes, the internal entity set V_i of fragment F_i can be divided into two mutually exclusive subsets, pure internal node set P_i and inner boundary node set D_i. Formally, V_i = P_i ∪ D_i, where D_i = {\(s \mid \langle s ,p ,o \rangle \in {E_{i}^{c}} \wedge o \in V_{j}\)} ∪ {\(o \mid \langle s ,p ,o \rangle \in {E_{i}^{c}} \wedge s \in V_{j}\)}, i,j = 1,2,...,n ∧ i≠j. The definition of the inner boundary node index is presented as follows:

Definition 7 (Inner Boundary Node Index.)

Given a fragment F_i of RDF graph G, the inner boundary node index, i,e., IBN-Index of F_i is a key-value map I^IBN where

1)
for any tuple (v, tag) ∈ I^IBN, the key is a vertex v ∈ V_i, and the value tag is a Boolean value denoting if v is an inner boundary node or not; if a vertex v is an inner boundary node, its corresponding tag will be set to boolean True, otherwise it will be set to False;
2)
for any vertex v ∈ V_i, there exists a unique tuple in I^IBN with v as the key and a Boolean tag as the value.

Example 4

As shown in Fig. 2, on site S₁, the inner vertices (inner boundary nodes) connected with vertices on other sites (S₂ and S₃) are v₃, v₆, and v₈. And on site S₂, the inner boundary nodes are v₉, v₅, and v₁₃. As a result, the inner boundary node index on site s₁ is \(I_{1}^{IBN}\) = {(v₃, True), (v₆, True), (v₈, True), (v₁, False), (v₂, False), (v₇, False), (v₃₀, False), (v₃₁, False), (v₃₁, False), (v₃₂, False), (v₃₃, False), (v₃₄, False)}. And the IBN-Indexes on site S₂ is \(I_{2}^{IBN}\) = {(v₅, True), (v₉, True), (v₁₃, True), (v₄, False), (v₁₀, False), (v₁₁, False), (v₁₂, False), (v₁₄, False), (v₁₅, False)}.

To improve space efficiency of IBN-Index, the strategy of dictionary encoding is adopted, such that each vertex is encoded into an integer by hash operation.

The construction approach of the IBN-Index is shown in Algorithm 1. First, the inner boundary node set and IBN-Index are initialized by the fragment identifier F_i (line 1). Then, for each triple 〈s, p, o〉, if it is a crossing edge and the subject s (or object o) is an internal vertex, the s (or o) is put into the inner boundary node set D_i (lines 2-6). For each node v in the internal node set V_i, if it is also an inner boundary node in D_i, a mapping (v, True) will be put into \(I_{i}^{IBN}\); otherwise, a mapping (v, False) will be put into \(I_{i}^{IBN}\) (lines 7-11).

5.2 IBN-index based partial evaluation

In order to answer query Q, each site S_i computes the local partial matches based on the known fragment F_i. The formal definition of local partial match was defined in [10]. Intuitively, a local partial match PM_i is an overlapping part between a crossing match M and fragment F_i.

Algorithm 2 describes the local partial match computation process utilizing the IBN-Index. The key idea is to use the inner boundary node index to filter the candidate sets, thereby reducing the search space when finding local partial results. Since the result of partial evaluation can be divided into complete SPARQL matches and local partial matches, the correctness of Algorithm 2 can be considered from the following two aspects. (1) Since some complete SPARQL matches contain only pure internal vertices (e.g., the dashed partial match on S₁ in Fig. 2), i.e. they do not have any inner boundary nodes, in order to ensure that all complete SPARQL matches are obtained, we keep a complete candidate set in which all vertices will start graph exploration (line 8). Here a greedy strategy is applied, selecting the candidate set with the smallest size as the reserved set (line 6). (2) In other candidate sets, only the candidate vertex judged as an inner boundary node will start the graph exploration (lines 13-19).

Example 5

We take site S₁ as an example. As shown in Fig. 2, for query Q, the candidate sets of each internal query variables on site S₁ are B_g1 = {〈?a, (v₂, v₃₁, \(v_{v_{8}}\))〉, 〈?b, (v₃, v₃₂)〉, 〈?c, (v₃₃)〉}. Firstly, the candidate sets are sorted to find the candidate set with the smallest size and the set 〈?c, (v₃₃)〉 is reserved. In the candidate set of ?a, the nodes v₂ and v₃₁ are not inner boundary nodes according to the IBN-Index, so they are all filtered out. As for the candidate set of ?b, v₃ will not be filtered, while v₃₂ will be discarded. The filtered set of candidate sets is B_g1 = {〈?a, (v₈)〉, 〈?b, (v₃)〉, 〈?c, (v₃₃)〉}. As a result, graph exploration can find all partial matches on S₁ starting only from v₈, v₃, and v₃₃. Likewise, the candidate sets on other sites are also filtered out of a large number of candidate vertices using the same method.

Space complexity of IBN-Index

For each fragment F_i, each vertex corresponds to a tag indicating whether it is an inner boundary node. The extra space is bounded with \(O(|V_{i}| + |{V_{i}^{e}}|)\), where V_i is the set of internal nodes of fragment F_i and \({V_{i}^{e}}\) is the set of boundary nodes of fragment F_i.

6 Filter local partial matches with boundary node index

As shown in Fig. 5, since the RDF data graph is distributed and stored on multiple sites, the boundary node on each site becomes a bridge connecting any two sites. Node 1 (in blue) on F₁ is a boundary node for F₂, while it is an internal node in the view of F₁. As a result, it is named an inner boundary node on F₁, while it is a boundary node for F₂. In partial evaluation and assembly framework, after the local partial matches are obtained, all of them are sent to the master site uniformly. However, not all the partial results can continue to be joined to form a complete SPARQL match on the master site. Therefore, it is unnecessary to transmit local partial matches, which are not associated with partial matches on other sites, to the master site. Based on the above problem, this paper proposes an optimization strategy for pre-judging edge labels from boundary nodes to reduce the communication overhead between local sites and the master site.

The definition of boundary nodes (see the definition of \({V_{i}^{e}}\)) has been given in Section 3, which means vertices that belong to other fragments but are directly connected to internal vertices in F_i. For an RDF graph G, when dividing the data, we record each boundary node’s out-edge and in-edge information on the fragment F_i as boundary node index. The formal definition of boundary node index is as follows:

Definition 8 (Boundary Node Index)

. Given a fragment F_i of RDF graph G, the boundary node index, i,e., BN-Index of F_i is a key-value map \(I_{i}^{BN}\) where

1)
\(I_{i}^{BN}\) = \(I_{i}^{Out} \cup I_{i}^{In}\);
2)
for any tuple \((v,v.Out) \in I_{i}^{Out}\), the key is a vertex \(v \in {V_{i}^{e}}\), and the value v.Out = {(p₁, p₂, ..., p_n) ∣i≠j ∧〈v, p_l, u〉∈ F_j ∧ l ∈{1,...,n}};
3)
for any tuple \((v,v.In) \in I_{i}^{In}\), the key is a vertex \(v \in {V_{i}^{e}}\), and the value v.In = {(p₁, p₂, ..., p_n) ∣i≠j ∧〈u,p_l,v〉∈ F_j ∧ l ∈ {1,...,n}}.

The construction approach of boundary node index is shown in Algorithm 4. First, the BN-Index is initialized by the identifier F_i (line 1). Then, for each triple 〈s, p, o〉 in RDF graph G, if s (or o) is a boundary node in F_i and the triple 〈s, p, o〉 ∉ F, the the (s, p) (or (o, p)) will be put into \(I_{i}^{Out}\) (or \(I_{i}^{In}\)) (lines 2-8). Algorithm 4 will iterate over each triple until there is no triple left.

Example 6

As shown in Fig. 2, on site S₃, the boundary nodes (with their predicates in orange) are \({V_{i}^{e}}\) = {v₈, v₁₃, v₂₃, v₂₆, v₂₈}. And the corresponding boundary node index is \(I_{3}^{Out}\) = {〈v₈, (spouse)〉, 〈v₁₃, (director)〉, 〈v₂₃, (director)〉, 〈v₂₈, (prime_minister)〉, 〈v₂₆, NULL〉}, \(I_{3}^{In}\) = {〈v₂₆, (director)〉,〈v₈, NULL〉, 〈v₁₃, NULL〉, 〈v₂₃, NULL〉, 〈v₂₈, NULL〉}.

Example 7

As shown in Fig. 6, the local partial matches on S₃ is PM₃ = {(〈?s, null〉, 〈?a, null〉, 〈?b, v₁₃〉, 〈?c, v₁₇〉, 〈?d, v₁₆〉), ((〈?s, null〉, 〈?a, null〉, 〈?b, v₂₃〉, 〈?c, v₁₇〉, 〈?d, v₁₆〉), (〈?s, null〉, 〈?a, v₂₆〉, 〈?b, v₁₉〉, 〈?c, v₁₇〉, 〈?d, v₁₆〉), (〈?s, v₂₀〉, 〈?a, v₂₁〉, 〈?b, v₂₂〉, 〈?c, v₂₈〉, 〈?d, null〉)}. According to the boundary node index, the last two local partial matches could not constitute any final results. Then these two matching pairs will not be sent to the master node as a message.

The filtering partial matches process is presented briefly in Algorithm 5. On site S_i, each partial match is iterated to judge on the boundary nodes (line 3). Only when the value (also known as the predicates belong to other fragments) of BN-Index of the boundary node contains the unmatched predicates on the query graph, can the partial match be further matched and reserved to transmit to the master site.

Space complexity of BN-Index

For each fragment F_i, the number of the BN-Index is \(O(|{V_{i}^{e}}|)\) at most. As a result, the extra space of BN-Index is bounded with \(O(|{V_{i}^{e}}|)\), where \({V_{i}^{e}}\) is the set of crossing edges.

7 Experimental evaluation

In order to verify the effectiveness and efficiency of the IBN-Index method and BN-Index filtering method under the partial evaluation and assembly framework, a comparative experiment with gStoreD [10, 11] was performed over several benchmark RDF datasets. The proposed algorithm is implemented on top of gStoreD, and is deployed on a 3-node cluster, of which each node is in a Docker. The three dockers are all deployed on a machine with 16 cores Intel Xeon Silver 4216 2.10 GHz processors, 512 GB of RAM, and 1.92 TB SSD, running the 64-bit CentOS 7.7 operating system.

7.1 Datasets and queries

In this experiment, the proposed method and gStoreD are evaluated using the LUBM [29] synthetic benchmark dataset of different scales. The statistics of the datasets are shown in Table 2. We need to compare the query efficiency of the method based on IBN-Index, the method based on BN-Index and the combination of the two, and the original gStoreD on different queries. It is necessary to gradually change the number of intermediate results that the query satisfies while limiting the basic structure of the query. Therefore, we choose benchmark datasets rather than real-world datasets to keep the dataset size positively correlated with the number of intermediate results. In addition, to eliminate the impact of the partitioning strategy on query performance, we use a random partitioning method to divide each dataset into four fragments.

Table 2 Datesets

Full size table

Table 3 Queries

Full size table

As shown in Table 3, eight complex queries of different scales on the LUBM dataset are presented, i.e., \(Q_{1}\sim Q_{8}\). To exhibit the effect of IBN-Index and BN-Index, the proposed queries may generate large amount of intermediate results in the partial evaluation phase, as showed in Fig. 7a.

7.2 Experimental results

To verify the effectiveness of the IBN-Index and BN-Index based partial evaluation algorithm, extensive experiments were conducted.

Exp 1. Number of partial matching results. To make it more intuitive to observe and evaluate the performance of the partial evaluation algorithm with different queries and datasets, the number of partial matching results and complete SPARQL matches of \(Q_{1} \sim Q_{8}\) over LUBM10, LUBM20, and LUBM30 is recorded. The maximal total number of all partial evaluation results (including complete SPARQL matches and local partial matches) generated from each site are depicted in Fig. 7a, which determines the time consumption of the partial evaluation and influences assembly phases. As shown in Fig. 7a, for Q₁, the number of partial evaluation results increases approximately linearly with the size of the datasets. For the query \(Q_{2} \sim Q_{8}\), their partial evaluation results on LUBM20 are the most, which is affected by the partitioning strategy.

Exp 2. The construction time and space occupied by IBN-Index and BN-Index. Figure 8a shows the largest time overhead and space occupation of the IBN-Index among all slave nodes on different datasets. It can be observed that the time to construct the IBN-Index and the size of the IBN-Index are positively correlated with the size of the graph. Figure 8b presents the construction time and space of the BN-Index, which have similar trends to that of the IBN-Index.

For IBN-Index, since the value of each node is only a Boolean value, the space required for the index is small, which guarantees the time and space complexity of the proposed method. As for BN-Index, the space required is correlated with the number of boundary nodes, which depends on the graph partitioning strategy. Although the random partitioning method we use will produce a large number of intermediate results, the experimental results prove the time and space complexity of BN-Index. Overall, the size of the BN-index is proportional to the size of the graph except that on LUBM20. The reason is that the number of crossing edges on LUBM20 is even more than that on LUBM30, which can be verified by the partial evaluation results depicted in Fig. 7a.

Exp 3. Efficiency of IBN-Index Based Optimization. A measurement of the statistical consequences of graph exploration time in the partial evaluation phase of \(Q_{1} \sim Q_{8}\) over different datasets is shown in Fig. 9. It can be observed that the partial evaluation method based on IBN-Index outperforms gStoreD on all queries and can improve query performance by 1.64 times in the best case. Furthermore, the method combining IBN-Index and BN-Index (the grey lines) improves the query efficiency by 1.79 times in the best case. As the size of the dataset increases, the proportion of time reduction also increases.

Interestingly, the optimization becomes more significant as the number of query nodes increases. The reasons can be summarized in two aspects. (1) When dealing with more query nodes, the number of candidate sets also rises, leading to more repetitive partial evaluation results. Therefore, the IBN-Index can be affected on more candidate sets resulting in a better pruning efficiency. (2) As the length of the result of partial evaluation expands, the candidate nodes are more likely to be pure internal nodes that can be filtered by IBN-Index.

Exp 4. Efficiency of BN-Index Based Optimization. Although gStoreD already has a partial match filtering strategy, the BN-Index-based method could have the same filtering effect and even higher efficiency than gStoreD. As shown in Fig. 9, the efficiency of graph exploration during the partial evaluation of the BN-Index-based method (the blue lines) exceeds gStoreD in most cases, and the strength is expanded as the scale of datasets grows. The reason is that when dealing with a large number of candidate vertices, gStoreD collects all the internal candidate vertices from all slave sites and transmits them back to all sites, which enlarges the searching space of graph exploration seriously. However, BN-Index will not enlarge the candidate sets.

Exp 5. Scalability. To prove the scalability of the IBN-Index and BN-Index based methods, the whole query times of the improved partial evaluation and assembly method over eight queries are presented in Fig. 7b. It is obvious that the query time of our method is nearly linear with the scale of the datasets. Unfortunately, all runs of query Q₃, including gStoreD and the IBN-Index and BN-Index based methods, are failed on the LUBM30 dataset due to the limited memory.

8 Conclusion

In this paper, we proposed an inner boundary node index-based method and a boundary node index-based method to improve the computing efficiency of the subgraph matching queries in distributed settings based on the partial evaluation and assembly framework. Moreover, we also proved that the IBN-Index and BN-Index are both time-efficient and space-effective. The extensive experimental results on benchmark datasets verified the efficiency and scalability of the proposed method, which clearly outperforms gStoreD when large-scale intermediate results need to be processed.

References

Wang, X., Zou, L., Wang, C., Peng, P., Feng, Z.: Research on knowledge graph data management: a survey. J. Softw. 30(7), 2140 (2019)
Google Scholar
Consortium, W.W.W., et al.: Sparql 1.1 overview (2013)
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of sparql. In: International Semantic Web Conference, pp. 30–43. Springer (2006)
Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the Ninth Annual ACM Symposium on Theory of Computing, pp. 77–90 (1977)
Ren, X., Wang, J., Han, W.-S., Yu, J.X.: Fast and robust distributed subgraph enumeration. arXiv:1901.07747 (2019)
Jones, N.D.: An introduction to partial evaluation. ACM Computing Surveys (CSUR) 28(3), 480–503 (1996)
Article Google Scholar
Fan, W., Wang, X., Wu, Y.: Performance guarantees for distributed reachability queries. arXiv:1208.0091 (2012)
Wang, X., Wang, J., Zhang, X.: Efficient distributed regular path queries on rdf graphs using partial evaluation. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1933–1936 (2016)
Wang, X., Wang, S., Xin, Y., Yang, Y., Li, J., Wang, X.: Distributed pregel-based provenance-aware regular path query processing on rdf knowledge graphs. World Wide Web 23(3), 1465–1496 (2020)
Article Google Scholar
Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D.: Processing sparql queries over distributed rdf graphs. The VLDB Journal 25(2), 243–268 (2016)
Article Google Scholar
Peng, P., Zou, L., Guan, R.: Accelerating partial evaluation in distributed sparql query evaluation. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 112–123. IEEE (2019)
Ge, Y.-F., Orlowska, M., Cao, J., Wang, H., Zhang, Y.: Mdde: multitasking distributed differential evolution for privacy-preserving database fragmentation. The VLDB Journal, pp. 1–19 (2022)
Ge, Y.-F., Yu, W.-J., Cao, J., Wang, H., Zhan, Z.-H., Zhang, Y., Zhang, J.: Distributed memetic algorithm for outsourced database fragmentation. IEEE Trans. Cybern. 51(10), 4808–4821 (2020)
Article Google Scholar
Rohloff, K., Schantz, R. E.: Clause-iteration with mapreduce to scalably query datagraphs in the shard graph-store. In: Proceedings of the Fourth International Workshop on Data-intensive Distributed Computing, pp. 35–44 (2011)
Husain, M., McGlothlin, J., Masud, M. M., Khan, L., Thuraisingham, B. M.: Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)
Article Google Scholar
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale rdf data. Proceedings of the VLDB Endowment 6(4), 265–276 (2013)
Article Google Scholar
Buneman, P., Cong, G., Fan, W., Kementsietsidis, A.: Using partial evaluation in distributed query evaluation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 211–222 (2006)
Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial evaluation for distributed xpath query processing and beyond. ACM Transactions on Database Systems (TODS) 37(4), 1–43 (2012)
Article Google Scholar
Ma, S., Cao, Y., Huai, J., Wo, T.: Distributed graph pattern matching. In: Proceedings of the 21st International Conference on World Wide Web, pp. 949–958 (2012)
O’Neil, P. E.: The sb-tree: an index-sequential structure for high-performance sequential access. Acta Informatica 29(3), 241–265 (1992)
Article MathSciNet MATH Google Scholar
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proceedings of the VLDB Endowment 1 (1), 1008–1019 (2008)
Article Google Scholar
Neumann, T., Weikum, G.: Rdf-3x: a risc-style engine for rdf. Proceedings of the VLDB Endowment 1(1), 647–659 (2008)
Article Google Scholar
Deppisch, U.: S-tree: a dynamic balanced signature index for office retrieval. In: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 77–87 (1986)
Zou, L., Mo, J., Chen, L., Özsu, M.T.: Zhao, d.: gstore: answering sparql queries via subgraph matching. Proceedings of the VLDB Endowment 4 (8), 482–493 (2011)
Article Google Scholar
He, H., Singh, A.K.: Closure-tree: An index structure for graph queries. In: 22nd International Conference on Data Engineering (ICDE’06), pp. 38–38. IEEE (2006)
Zhao, P., Han, J.: On graph query optimization in large networks. Proceedings of the VLDB Endowment 3(1-2), 340–351 (2010)
Article Google Scholar
Sasaki, Y., Fletcher, G., Onizuka, M.: Structural indexing for conjunctive path queries. arXiv:2003.03079 (2020)
Wang, X., Chai, L., Xu, Q., Yang, Y., Li, J., Wang, J., Chai, Y.: Efficient subgraph matching on large rdf graphs using mapreduce. Data Sci. Eng. 4(1), 24–43 (2019)
Article Google Scholar
Guo, Y., Pan, Z., Heflin, J.: Lubm: a benchmark for owl knowledge base systems. Journal of Web Semantics 3(2-3), 158–182 (2005)
Article Google Scholar
Xing, J., Liu, B., Li, J., Choudhury, F.M., Wang, X.: Optimal subgraph matching queries over distributed knowledge graphs based on partial evaluation. In: International Conference on Web Information Systems Engineering, pp. 274–289. Springer (2021)

Download references

Acknowledgments

This work is expanded on the optimal subgraph matching queries over distributed knowledge graphs based on partial evaluation [30], and is supported by National Key Research and Development Program of China (2019YFE0198600); the National Natural Science Foundation of China (61972275), partially supported by Australian Research Council Linkage Project (LP180100750).

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Yanyan Song, Yuzhou Qin, Wenqi Hao, Pengkai Liu & Xin Wang
School of Information Technology, Deakin University, Melbourne, Australia
Jianxin Li
School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Farhana Murtaza Choudhury
School of Data Science, City University of Hong Kong, Hong Kong, China
Qingpeng Zhang

Authors

Yanyan Song
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhou Qin
View author publications
You can also search for this author in PubMed Google Scholar
Wenqi Hao
View author publications
You can also search for this author in PubMed Google Scholar
Pengkai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Farhana Murtaza Choudhury
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingpeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Wang.

Ethics declarations

Conflict of Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web Information Systems Engineering 2021

Guest Editors: Hua Wang, Wenjie Zhang, Lei Zou, and Zakaria Maamar

Appendix:: Workload Queries on LUBM

The query workload (\(Q_{1} \sim Q_{8}\)) designed on LUBM are listed as follows:

PREFIX ub: 〈 http://swat.cse.lehigh.edu/onto/univ-bench.owl#〉

PREFIX rdf: 〈 http://www.w3.org/1999/02/22-rdf-syntax-ns#〉

Q1: SELECT ? X ? Y ? Z ?d WHERE{ ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:undergraduateDegreeFrom ? Y }
Q2: SELECT ? X ? Y ? Z ?d WHERE{ ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:mastersDegreeFrom ? Y . }
Q3: SELECT ? X ? Y ? Z ?d ?c WHERE{ ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:undergraduateDegreeFrom ? Y . ?d ub:takesCourse ?c . }
Q4: SELECT ? X ? Y ? Z ?d ?c WHERE{ ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:mastersDegreeFrom ? Y . ?d ub:teacherOf ?c . }
Q5: SELECT ? X ? Y ? Z ?d ?c ?t WHERE{ ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:undergraduateDegreeFrom ? Y . ?d ub:takesCourse ?c . ?t ub:teachingAssistantOf ?c . }
Q6: SELECT ? X ? Y ? Z ?d ?c ?t WHERE{ ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:mastersDegreeFrom ? Y . ?d ub:teacherOf ?c . ?t ub:teachingAssistantOf ?c . }
Q7: SELECT ? X ? Y ? Z ?d ?p ?c ?t WHERE{ ? X ub:advisor ?p . ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:mastersDegreeFrom ? Y . ?d ub:takesCourse ?c . ?t ub:teachingAssistantOf ?c . }
Q8: SELECT ? X ? Y ? Z ?d ?c ?t ?p WHERE{ ? X ub:advisor ?p . ? X ub:memberOf ? Z . ? Z ub:subOrganizationOf ? Y . ?d ub:mastersDegreeFrom ? Y . ?d ub:teacherOf ?c . ?t ub:teachingAssistantOf ?c . }

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Song, Y., Qin, Y., Hao, W. et al. Optimizing subgraph matching over distributed knowledge graphs using partial evaluation. World Wide Web 26, 751–771 (2023). https://doi.org/10.1007/s11280-022-01075-6

Download citation

Received: 28 February 2022
Revised: 12 May 2022
Accepted: 02 June 2022
Published: 08 July 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s11280-022-01075-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimizing subgraph matching over distributed knowledge graphs using partial evaluation

Abstract

Similar content being viewed by others

Optimal Subgraph Matching Queries over Distributed Knowledge Graphs Based on Partial Evaluation

Using partial evaluation in holistic subgraph search

Efficient distributed subgraph similarity matching

1 Introduction

Example 1

2 Related work

2.1 MapReduce-based graph systems

2.2 Specialized RDF systems

2.3 Graph indexing

3 Preliminaries

Definition 1 (RDF Graph)

Definition 2 (Distributed RDF Graph)

Definition 3 (Fragment)

Example 2

Definition 4 (Query Graph)

Definition 5 (Subgraph Matching)

Problem statement

Example 3

4 Overview

5 Inner boundary node-based algorithm

5.1 Inner boundary node

Definition 6 (Inner Boundary Node.)

Definition 7 (Inner Boundary Node Index.)

Example 4

5.2 IBN-index based partial evaluation

Example 5

Space complexity of IBN-Index

6 Filter local partial matches with boundary node index

Definition 8 (Boundary Node Index)

Example 6

Example 7

Space complexity of BN-Index

7 Experimental evaluation

7.1 Datasets and queries

7.2 Experimental results

8 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendix:: Workload Queries on LUBM

Appendix:: Workload Queries on LUBM

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation