1. Introduction
Knowledge graphs have the strong expressive ability and modeling flexibility as semantic networks. Many knowledge graphs have been published for a variety of practical needs, such as DBpedia [
1], Freebase [
2], YAGO [
3], and IMDb (
http://www.imdb.com, accessed on 9 December 2020). The idea of knowledge graph is widely used in intelligent question answering [
4], recommendation systems [
5], semantic search [
6], and other fields. However, due to the lack of unified presentation standards for data and information, and/or the differences in the methods of obtaining data [
7], the relevant knowledge of the same entity in the real world is represented in various forms among different knowledge graphs. It is not conducive to knowledge sharing between different domains and applications.
Instance matching (IM) is defined as establishing a specific type of semantic link between instances. The semantic link is called the identity link represented by the
owl:sameAs. IM is also known as entity alignment [
8], record linkage [
9], duplicate detection [
10], or co-reference resolution [
11]. It allows us to explicitly link two instances that refer to the same entity in the real world. When merging different knowledge graphs, instance matching is adopted to achieve consistency and integrity.
Instance matching has attracted attentions since 2009 [
12], but the realization of the ultimate solution is still an open research problem. As the scale of the built knowledge graphs increases, the efficiency and cost requirements of instance matching methods become more strict. Matching instances between knowledge graphs corresponds to the Clique problem in graph theory, which is an NP-complete problem [
13,
14]. The Clique problem is to find cliques in an undirected graph, where a clique is a completed subgraph. Briefly, consider two knowledge graphs to form one graph, where vertices are instances from the two knowledge graphs, and the edges are identity links. Then, a clique represents a set of instances that point to the same real-world entity. Instance matching is the problem to list all such cliques. Earlier published methods [
12,
15] are not suitable for processing large-scale knowledge graphs containing tens of thousands of instances, mainly because these frameworks usually require the pair-by-pair comparison among instances from different knowledge graphs. To our best knowledge, there are mainly two approaches to matching instances between large-scale knowledge graphs. (i) The blocking algorithm can be adopted to reduce the searching space. This type of approach divides instances into overlapping blocks and executes the matching process only within blocks. ServOMap [
16], VMI [
17], RiMOM-IM [
18], ScLink [
19] and other frameworks [
20] leverage such approach. (ii) The distributed architecture can be utilized to provide sufficient computing resources. The distributed file system can be used to store large knowledge graphs. The distributed computing model, such as MapReduce [
21], allows the instance matching process to be divided into multiple matching tasks that can be executed by multiple workers. Frameworks that adopt this include LINDA [
14], BIGMAT [
22], etc.
There exist challenges to be solved in the field of large-scale instance matching. (i) A large number of candidate instance pairs need to be compared during the matching process, which has an adverse impact on the matching efficiency. Although blocking algorithms can reduce the number of candidate pairs, existing blocking algorithms adopted by conventional frameworks [
16,
17,
18] prefer to achieve high recall by replicating instances to multiple blocks. The redundancy of instances leads to the generation of extra candidate pairs, which increases the matching time. (ii) It is difficult to achieve a reasonable balance between matching efficiency and matching quality. Although standalone frameworks [
18,
20,
23] can obtain high-quality matching results, they have high requirements for time and computing resources to match large-scale knowledge graphs. Meanwhile, several distributed frameworks [
14,
22] have been proposed and claimed to be able to process large-scale knowledge graphs efficiently, but their matching quality can be further improved.
To tackle the above challenges, we propose a novel blocking algorithm MultiObJ to select candidate pairs effectively. The proposed algorithm constructs inverted indexes for instances based on the ordered joint of multiple objects’ features. The results of the joint serve as evidence for blocking. Only instances from different knowledge graphs within the same block can form candidate pairs. Based on the proposed algorithm, we design a distributed instance matching framework FTRLIM (code:
https://github.com/TOJSSE-iData/ftrlim, accessed on 12 May 2021). It firstly adopts MultiObJ to select candidate pairs. Then, it calculates the similarity of objects under pre-aligned predicates to generates similarity vectors for candidate pairs based on the attributes and relationships of instances. The FTRLIM framework models the problem of instance matching as logistic regression and leverages the online logistic regression model FTRL [
24] to determine whether candidate pairs are matched. The framework is implemented in a distributed architecture and scales well. In addition, we construct three data collections based on real-world data with different scales and levels of heterogeneity for comprehensive evaluation. The constructed data collections can be used as benchmarks to provide a quantitative evaluation of blocking algorithms and instance matching frameworks in further researches.
FTRLIM has participated in the competition of SPIMBENCH Track at OAEI 2019 and outperformed other state-of-the-art frameworks. This paper further evaluates the MultiObJ blocking algorithm and the FTRLIM framework on the three constructed data collections and two real-world datasets. Compared with RIMOM-IM’s method [
18], experiment results show that MultiObJ generates much fewer candidate pairs (
of RIMOM-IM’s) and brings a distinguished matching efficiency improvement for the FTRLIM framework. Evaluation results of matching quality show that FTRLIM achieves the same level of F1-score as the best one among more than ten advanced frameworks. Besides, FTRLIM has the capability to match instances between knowledge graphs containing more than
instances with satisfied quality and efficiency. The time cost of matching decreases as the number of available cores in the distributed cluster increases.
The main contributions of our work can be summarized as:
We propose a novel blocking algorithm MultiObJ, which divides instances into blocks by utilizing the ordered joint of multiple objects’ features. The experiment results indicate that the proposed algorithm can significantly reduce the number of candidate instance pairs with only an inconspicuous effect on the matching quality.
We design and implement a distributed instance matching framework FTRLIM for large-scale knowledge graphs based on MultiObJ. FTRLIM is able to match instances between large-scale knowledge graphs efficiently. It mines matched instances using the online logistic regression model follow-the-regular-leader (FTRL). The experiment results show that FTRLIM overall outperforms other state-of-the-art frameworks on real-world datasets and has excellent scalability and efficiency.
We construct three data collections with golden-standards based on a real-world large-scale knowledge graph. Knowledge graphs in these three data collections are constructed with different scales and levels of heterogeneity to meet various evaluation purposes. We evaluate MultiObJ and FTRLIM on these three data collections and two real-world datasets. The constructed data collections and experiment results can be replicated by others and provide a potential baseline for further research.
The rest of this paper is organized as follows. In
Section 2, we review related work. We formally describe the instance matching problem in
Section 3. In
Section 4, we describe the detailed working principle and process of the FTRLIM framework. We analyze the time complexity of our framework in
Section 6. Experiments and analyses are performed in
Section 5. In
Section 7, we summarize this paper.
2. Related Work
The term knowledge graph (KG) has been widely used since Google published their work in 2012 [
25]. Recently, Färber et al. use KG to describe any Resource Description Framework (RDF) graph [
26]. RDF is an infrastructure that is designed for encoding, exchanging and reusing structural data [
27]. It has been widely used in different domains to store and share knowledge. The European Bioinformatics Institute (EBI) develops the EBI RDF platform [
28] for describing, publishing and linking life science data. The Open European Nephrology Science Center leverages the RDF model to share and search medical data among research groups [
29]. The GEOLink [
30] database provides geoscience metadata repositories in RDF format and allows users to perform seamlessly query and reasoning. Recently, the team of Ali develop frameworks that treat the data from social networks into structural data for traffic event detection and condition analysis [
31] and for intelligent healthcare monitoring [
32].
Although RDF is a standard language for describing resources on the network, the description could be subjective and be various in different applications, which creates obstacles to knowledge sharing in the same domain or even across domains. One of the ways to overcome the obstacle is instance matching. Many methods have been proposed to complete the instance matching task. Several state-of-the-art instance matching methods evolve from ontology matching methods, such as LogMap [
33], AML [
34], RiMOM-IM [
18], and Lily [
35]. The first three frameworks adopt the idea of bootstrapping and iteratively discover more matched instance pairs based on pairs that are already matched. PARIS [
23] adopts a similar idea and models the probability that two instances can match. It is able to match both schema and instances. Lily [
35] focuses more on ontology matching and manual adjustments are required when completing the instance matching task. VMI [
17] and VDLS [
20] model the instance matching problem as a document matching problem and build vectors for instances based both on their local information and their neighbors’ information. They determine whether two instances are matched by calculating the similarity between their vectors. SERIMI [
36] selects the most discriminative attribute of instances by computing the entropy of each attribute and builds the pseudo-homonyms sets of instances. They complete the class-based disambiguation of instances by their set similarity function.
Researchers have been exploring applying machine learning and deep learning methods to the solution of instance matching problems. Supervised learning-based methods [
37,
38,
39] have been applied in instance matching problem, which consider instance matching a binary classification problem. These methods require labeling instances to train the model. Among them, TrAdaBoost [
38] adopts the transfer learning algorithm to obtain training data, which reduces the manual work of labeling. Moreover, rather than training models to match instances, MDedup [
40] trains models for discovering the matching dependencies (MDs) to select matched instances, where MD is one of the relaxed forms [
41,
42] of functional dependency [
43] in data mining. Semi-supervised learning methods [
44,
45], unsupervised learning methods [
46,
47] and self-supervised learning model [
48] are also introduced into the field of instance matching. Besides, works on representation learning for matching instances are gradually emerging [
49,
50,
51]. These methods firstly embed instances in each graph into different low-dimensional dense semantic spaces separately. Then, they align the spaces according to the pre-matched instances to find more matched instance pairs. There are also frameworks designed for training and evaluating the embedding models, such as References [
52,
53,
54]. Compared with other works, the FTRL model is more lightweight, and it can give the probability that two instances are matched. We introduce FTRL in more detail in
Section 4.3.
How to deal with large-scale data has become an inevitable problem in instance matching. As described in
Section 1, the instance matching problem corresponds to the NP-complete Clique problem. The most popular solution for large-scale IM is blocking. This approach divides similar instances into blocks and limits the comparison within blocks. There are views that blocking-based instance matching is the best approach for efficient matching [
55]. Some blocking algorithms require manual works [
56]. Moreover, automated blocking algorithms are applied by different instance matching frameworks [
16,
17,
18,
34]. These methods generate inverted indexes for instances by analyzing their attributes or types. Blocks are generated according to these indexes. The blocking approach can split the large-scale instance matching task into multiple subtasks. Therefore, it is usually performed as the first step of large-scale instance matching methods. A more detailed survey is presented in Papadakis’s research [
57]. The most similar blocking method to us may be the one proposed in Reference [
18]. This method distinguishes the objects related to different predicates and regards the instance pair with a unique index, i.e., the unique pair, as a matched pair. The obvious difference is that we also consider the correlation among different predicates, which further reduces the overlap between blocks. We use the features of the object rather than always using the entire object to construct block keys, which improves the robustness. Moreover, we only consider unique pairs as a special type of candidate pairs, rather than directly as matched pairs, to improve the precision.
Adopting the distributed architecture is another way to perform large-scale instance matching. The LINDA framework [
14] performs instance matching by considering joint evidence for instances and adopts a distributed version of the algorithm. MSBlockSlicer [
58] pays attention to the problem of load imbalance and adopts a block slice strategy to balance the load of each worker in the distributed cluster. The BIGMAT framework [
22] applies the affinity-preserving random walk algorithm to express IM as a graph-based node ranking and selection problem in the constructed candidate association graph and selects matching results through a distributed architecture. Our framework leverages the proposed blocking algorithm to divide the matching task into multiple logistic regression tasks that can be executed distributionally. We also introduce the load balancing mechanism to make full use of cluster resources.
As the number of proposed methods increases, researchers construct the Ontology Alignment Evaluation Initiative (OAEI) to evaluate these methods. The evaluation is carried out based on multiple tracks. The SPIMBENCH Track is one of the newest tracks for instance matching evaluation. FTRLIM was evaluated on this track in 2019 and out-performed other state-of-the-art frameworks.
4. The FTRLIM Framework
This section introduces the detailed working process of FTRLIM. The proposed framework consists of four major components:
Blocker,
Comparator,
FTRL Trainer, and
Matcher. The overview of the FTRLIM’s workflow is presented in
Figure 1.
Blocker obtains instance pairs to be compared, which adopts the proposed MultiObJ blocking algorithm to reduce the number of candidate pairs.
Comparator is responsible for generating similarity vectors for each instance pair.
FTRL Trainer takes similarity vectors and their scores as inputs to train the FTRL model, while
Matcher adopts the trained model to determine whether instances are matched. The training process is optional because FTRLIM allows users to load a pre-trained model. The framework is implemented in a distributed architecture.
4.1. Blocker
Identifying matched instance pairs by performing comparisons between every two instances is time and space-consuming. To solve this problem, FTRLIM adopts the MultiObJ blocking algorithm to efficiently select candidate instance pairs that are more likely to be matched. This work is done by Blocker.
The basic idea of the MultiObJ blocking algorithm is to construct indexes for each instance by leveraging features of the related objects. When constructing indexes, the interactions among different predicates of the instance should also be considered. Features of the objects under different predicates should be jointed to form the indexes of the instance, which allows instances to be fine-grained divided. This idea is intuitive: In the real world, researchers can use multiple attributives when describing an instance. The more attributives there are, the easier it is for others to locate the instance.
The MultiObJ blocking algorithm accepts triples of knowledge from both source KG and target KG and a predefined list of predicates P as the inputs, and it gives candidate instance pairs and unique instance pairs as the outputs. It includes three phases: Initialization, Indexing, and Candidate Pair Generation. In the Initialization phase, MultiObJ first creates the candidate pair set C, the unique pair set U, the index table K, and the inverted index table B. The index table K is prepared for storing indexes of instances, while the inverted index table B is prepared for storing the mapping from a certain index to instances. Then, it allocates a common initial inverted index for all instances that belong to either source or target KG and updates K and B, which leads to all instances are under the same block at the very beginning. In the Indexing phase, the algorithm sequentially extracts the features of related objects according to the predefined predicate list P and constructs inverted indexes for the instances. The instances with the same index are divided into the same block. As the iteration deepens, the large block will be subdivided into multiple small blocks. The processing of instances in which the objects are missing is also supported. The Candidate Pair Generation phase is responsible for combining instances from different knowledge graphs in the same block into candidate instance pairs.
The core phase of the algorithm is the Indexing phase, which includes three subphases: Explicit Indexing, Unique Pair Generation, and Index Inference. The algorithm processes each predicate p in the predefined predicate list P iteratively. Each iteration owns two additional predicate-specified tables: the indexing table and the inverted indexing table . These two tables are used for storing provisional results that are passed to K and B at the end of each iteration. The MultiObJ algorithm aims at leveraging object features of instances to divide blocks. It extracts object features, builds inverted indexes for instances in the Explicit Indexing phase, and utilizes the unique information among these features to generate unique pairs in the Unique Pair Generation phase. However, instances may have no corresponding object under certain predicates. MultiObJ infers possible features of these instances in the Index Inference phase. The following paragraphs will introduce the details of the algorithm.
We name the initial index and the indexes generated for the instance i in the previous iteration as the pre-index of the instance i. The object set of i under the predicate p is noted as . In the Explicit Indexing phase, all instances are divided into two parts depending on whether exists with the function devideByObjectMissing. This phase concentrates on building indexes for instances with . In this phase, the features of the object o are extracted with the function extractObjectFeatureSet, where . The strategies of feature extraction will be introduced later. The algorithm concatenates the extracted features with each pre-index of the instance i as the current indexes using the function catPreIdxAndFeature, and records the result in . The inverted index table is also updated using the function updateInvertedIndexTable.
In the Unique Pair Generation phase, the algorithm aims to detect the unique instance pairs. If and only if there is one pair of instances from different data sources with a certain index, these two instances are considered a unique instance pair. When two instances have the same index, and the index is unique in the source KG and the target KG, they are the most likely to be matched intuitively. FTRLIM achieves this intuition by setting a lower threshold for the unique pairs when determining whether two instances are matched, which will be introduced later.
The lack of knowledge is considered in the
Index Inference phase to avoid losing candidate instance pairs as much as possible. If the expected object of instance
s from the source KG under the predicate
p is missing, it means that the lack of knowledge occurs in the source KG. MultiObJ will identify all the instances in the target KG that have the same pre-index as
s using the function
getInstByPreIdx and use all their indexes generated according to
p as the index of
s. Moreover,
s is also indexed by a special string
NULL to indicate that it has no corresponding object under the predicate
p. The same process is performed on instances with missing objects in target KG. In this way, the instances without object under
p will have a wildcard as an index, so they can still form candidate pairs with other instances. When an iteration ends, the current indexes for instances become the new pre-indexes. The pseudo-code of the MultiObJ blocking algorithm are shown in Algorithm 1.
Algorithm 1 MultiObJ |
Input: S source knowledge graph T target knowledge graph P list of predicates used to generate indexes Output: C candidate pair set U unique pair set Initialization- 1:
- 2:
for all do - 3:
- 4:
Indexing - 5:
for all do - 6:
Explicit Indexing - 7:
for all do - 8:
- 9:
for all do - 10:
- 11:
) - 12:
Unique Pair Generation - 13:
for all do - 14:
if then - 15:
Index Inference - 16:
for all do - 17:
for all do - 18:
- 19:
for all do - 20:
- 21:
- 22:
- 23:
- 24:
Candidate Pair Generation - 25:
for all do - 26:
for all do - 27:
- 28:
return
|
An example of the
Indexing phase is given in
Table 1 and
Table 2.
Table 1 shows the relationship between the object features, the current indexes and the pre-indexes of each instance in an iteration. The current indexes of S3 and T3 are generated in the
Index Inference phase due to the lack of related objects.
Table 2 shows the blocking results in this case. It should be pointed out that
is a unique pair, while
and
are not. It is because the unique pair generation is completed before the
Index Inference phase. Such a setting can reflect that the inferred indexes are not so reliable as the directly constructed indexes.
The 10th line of MultiObJ requires to extract objects’ features to construct instance indexes with the function
extractObjFeatureSet. Many methods have been proposed to implement this function, such as extracting keywords with TF-IDF, extracting tokens with q-grams, and use the first three to four letters as tokens [
60]. We believe that different feature extraction methods and indexing strategies should be adopted for texts with different lengths and types. During our exploration of data, we have observed that the objects corresponding to some predicates are always in a finite set, while others are not. Specifically, we divide predicates into two types, the enumerative predicate (
EP) and the diverse predicate (
DP). Objects of
EP can form an enumerated set, while objects of
DP are variable with subjects. Considering about the subject of type
people, predicate
hasGender is an enumerative predicate, while predicate
hasName is a diverse predicate. Therefore, there are two index construction strategies that can be applied. For enumerative predicates, features of their corresponding objects are the objects themselves, which can be adopted as the construction basis of instance indexes after the unified processing. This construction strategy is called full index construction (
FIC) strategy. For diverse predicates, keywords of their corresponding objects can be extracted to form the features for constructing indexes. This is the keyword index construction (
KIC) strategy. Since
EP is more reliable than
DP, applying the
FIC strategy before the
KIC strategy will reduce the chance of the instance being incorrectly blocked in MultiObJ.
For the
KIC strategy, we also design a new algorithm
CombKey to deal with the long text. It extracts more discriminative features of objects to generate blocks with lower overlap. The algorithm first densely ranks the words in objects according to the word frequency from low to high. Words with higher rank are considered as keywords. After that, CombKey combines keywords in pairs as tokens according to the ranking. Since the possibility of repeated words within an object is low, only word frequency is used as the ranking indicator when considering the cost of calculating TF-IDF and other complex indicators. The detail of the CombKey algorithm is shown in Algorithm 2. The CombKey algorithm is designed for text in which the length is larger than a threshold, where the threshold is 2 empirically. For shorter texts, each word can be used as an object feature to improve robustness.
Algorithm 2 CombKey |
Input: i[p] object of instance i under predicate p Cp word frequency counter of objects under predicate p R maximum rank of words used for extracting features Output: set of object features of instance i under predicate p- 1:
- 2:
- 3:
- 4:
if then - 5:
for all do - 6:
- 7:
else - 8:
for do - 9:
for all do - 10:
for do - 11:
for all do - 12:
- 13:
return
|
Table 3 demonstrates an example of CombKey results on the Restaurant dataset with
. The format of the ID is
KG-Instance. In CombKey, the names of the given instances are split into words and counted globally. Then, CombKey densely ranks the words referring to their frequency. In the end, CombKey combines words with different ranks as objects’ features. Although the word ’club’ occurs in all instances’ names, CombKey avoids regarding the single word as a feature and distinguishes the first two instances from the last two instances.
Another example is given in
Figure 1 to illustrate how the MultiObJ blocking algorithm works. The algorithm leverages the objects’ features under the predicates
p (the orange arrow) and
q (the blue arrow) in turn to construct indexes for the instances. The six instances in
Figure 1 have the same object under
p so that they will be divided into the same block first. Then, according to the object under
q, instances
, and
Y will be divided into one new block, while instances
B and
Z will be divided into another block. Note that instance
C has no object under
q. MultiObJ will check the indexes of instances
, and
Z as part of the inference results of
C’s indexes. This is because the three instances are in the same block divided according to objects under
p as
C but are from the target KG. In this case,
C will be divided into both the block contains
X and
Y, and the block contains
Z.
In a knowledge graph, if the number of instances with a certain index is much greater than the number of instances with another one, the problem of data skew will occur and affect the efficiency of subsequent calculations. We introduce the load balance mechanism to avoid the problem of data skew. FTRLIM draws the FastAGMS draft [
61] for the distribution of indexes of instances, then estimates the workload of cores in the cluster and reassigns the work to balance the load.
4.2. Comparator
To obtain the similarity of the pair of instances, all candidate pairs are sent to the
Comparator. The
Comparator compares two instances under various predicates in different ways. The edit distance similarity is calculated for textual instance attributes, while the overlap similarity or the Jaccard similarity is calculated for instance relations. The calculation results will be sorted in order to form the similarity vector. Formally, let the list of predicates adopted by
Comparator be
, then the similarity vector of the two instances is
where
is the similarity of the two instances under the
i-th predicate.
Table 4 shows an example of similarity vector generation. The listed instances are two documents. The column Sim1 represents the edit distance similarity of their labels, and the column Sim2 represents the overlap similarity of sets of their authors.
When calculating the similarity of instance pairs under a certain predicate, some instances may have no corresponding objects due to the data flaws of the knowledge graph itself. A naïve way to obtain the similarity is to assign it a default value 0. However, this solution may confuse the difference between
the lack of knowledge and
the dissimilarity. To differentiate the two cases more clearly, we use the ratio of the number of instances with objects under a predicate to the total number of instances to represent the completeness rate of this predicate. If most instances have objects under a predicate, an instance may be more distinctive when its object is missing. Based on this consideration, we believe that the higher the predicate’s completeness, the lower the similarity between the instance without objects and other instances should be. Formally, we define the default similarity
for instance pairs without attributes or relations as:
where
indicate the instance sets of source and target knowledge graphs, and
are the sets of instances with objects corresponding to the predicate
p. The term
is the completeness of the predicate
p in the source or target KG.
4.3. FTRL Trainer
As described in
Section 3.2, FTRLIM treats IM as a logistic regression problem. We innovatively introduce the FTRL model [
24] to solve the problem. FTRL is an advanced online logistic regression model with high precision and excellent sparsity. It is designed to apply the logistic regression on large-scale datasets and online data streaming, which is a difficult situation for the conventional batch learning model. FTRL also has a fast training speed. Hence, we choose FTRL to discover matched instance pairs.
Let
be the similarity vector, and
y be the label of
, the FTRL model gives the predicted label
of
with the sigmoid function:
where
is the weight vector of the FTRL model.
The loss function of the FTRL model is the binary cross-entropy loss, which is defined as:
The formula of updating the FTRL model’s weight
at
t-th iteration is
where
is defined as the learning-rate schedule such that
,
and
are hyperparameters, and
is the sum of gradient up to the
t-th iteration.
The FTRL model adopts per-coordinate learning rates instead of the global learning rate. This approach is quite suitable for the logistic regression problem based on similarity vectors. The coordinates, or dimensions, of the similarity vector, are relatively independent. Therefore, it is more reasonable to use per-coordinate learning rates. In FTRL, the formula for updating the learning rate in dimension
i at
t-th iteration is:
where
and
are hyperparameters.
We develop the FTRL Trainer component to train the FTRL model. It generates the training set first. The training set is composed of instance pairs’ similarity vectors, as well as their similarity scores. The FTRL Trainer will first apply the average function on similarity vectors to obtain initial similarity scores. Then, it will select m instance pairs in which the initial similarity scores are higher than a certain threshold and m ones in which the initial similarity scores are lower than the threshold. These instance pairs will be scored by users. The similarity scores of matched pairs are considered to be 1, while others are assigned 0.
After generating the training set, the FTRL Trainer component trains the FTRL model according to the hyperparameters in the configuration file. The trained model is stored in HDFS so that it can be re-adopted.
FTRLIM is designed with a user-feedback mechanism that allows users to correct the matching results manually. The corrected results will be accepted by FTRL Trainer to adjust the parameters of the FTRL model. Users are able to choose a batch of candidate instance pairs and correct the similarity scores, or pick up a certain pair to correct. When updating the FTRL model, since the number of unmatched pairs is much greater than the number of actually matched pairs, the unmatched pairs are subsampled with probability p to avoid the sample imbalance problem. The probability can be configured by the user.
4.4. Matcher
All candidate pairs will obtain their final similarity scores in this component. This component loads a trained FTRL model and predicts similarity scores with Equation (
3). The similarity scores are in the interval
. As defined in
Section 3.2, only instance pairs in which the scores are larger than the manually set threshold
are possible to be matched. In our experiments, we set
for candidate pairs and
for unique pairs to make unique pairs more likely to be matched. The
Matcher component selects only the one-to-one matched pairs as the final matching results. Before being sent to the FTRL model, elements of similarity vectors are unified from
to
to satisfy the symmetry of Equation (
3).
4.5. Configuration
FTRLIM allows users to customize their own FTRLIM framework using configuration files. Users are able to set the attributes for index generation, the properties and relations for comparison, the hyperparameters for the FTRL model, and many other detailed parameters.
6. Discussion
In this section, we discuss the time complexity of each component of FTRLIM, focusing on the analysis of the MultiObJ blocking algorithm. We also explain how the time performance of the FTRLIM framework changes when the number of Spark cores in the distributed cluster changes.
For the source and target knowledge graph to be matched, we assume that the number of instances in each graph is N. Instances from the two graphs will form unique and candidate instance pairs via the MultiObJ blocking algorithm in Blocker. The inputs of MultiObJ are the source knowledge graph S, the target knowledge graph T, and an ordered list of predicates P. The algorithm generates the candidate pair set C and the unique pair set U.
The first phase of the MultiObJ blocking algorithm is Initialization. In this phase, the algorithm first creates and initializes required data structures , and B. For each instance, the algorithm records the initial index of the instance in the table K and records the instance corresponding to the initial index in the table B. In this phase, the algorithm needs to traverse all instances, so the time complexity is .
The second phase of MultiObJ is
Indexing, including
Explicit Indexing,
Unique Pair Generation, and
Index Inference. The algorithm will go through the loop and construct indexes for instances according to each predicate in the input
P in turn. Let
l be the number of loops the algorithm has reached, where
l ranges from 1 to
. As mentioned in
Section 4.2, objects corresponding to the predicate may be missing, and the degree of missing objects is described with the completeness rate. Let the average completeness rate over all predicates in
P of the source and target knowledge graph be
, in which the range is
. For an instance in the
l-th loop, we use
and
to represent the expectation of the number of indexes obtained in the
Explicit Indexing phase and in the
Indexing Reasoning phase, respectively. And we use
to indicate the expectation of the number of indexes obtained during the entire
Indexing phase.
and
are assigned 1 since the only index of an instance before the
Indexing phase is the
. The number of distinct indexes is the same as the number of blocks. The expectation of it in the
l-th loop is denoted as
. In the following analysis, we first give the time complexity of each phase represented by these expectations and then give the final time complexity representation by deducing the relationship between them.
At the phase of
Explicit indexing, line 8 of Algorithm 1 divides all instances into four sets,
, and
, depending on whether corresponding objects of the instance under predicate
p are missing. The subscript
v indicates the set contains instances in which the corresponding objects are not missing, while the subscript
n indicates the opposite condition. Therefore, for the number of elements in each set, we have
and
. This step needs to be completed by traversing all the instances, so the time complexity is
. The number of executions of the loop at line 9 is
. The algorithm extracts features of objects at line 10, and the time complexity of this step is
, regardless of the index construction strategy. For the
FIC strategy, the object feature
of an instance is exactly the object under
p, so the complexity is
. For the
KIC strategy, we need to count the word frequency on all objects under
p in the two knowledge graphs and store the results in HDFS. The time complexity of the statistics is
. However, in practice, the statistics should be carried out in preprocessing. If we assume that the average length of each word is 8 letters, storing a letter requires 2 bytes, storing the word frequency requires 4 bytes, then storing the word frequency of
words only requires about 20MB. For Spark cores, the time-consuming of reading such word frequency tables from HDFS is negligible. Therefore, after the word frequency table is constructed, the time complexity of identifying the corresponding word frequency could be
. In Algorithm 2, experience has shown that the number of construction results generally does not exceed 5, so it can be regarded as a constant, which means the time complexity of using
KIC strategy to generate
is also
. For each instance, the number of
in each loop is denoted as
v. The line 11 of MultiObJ constructs
indexes for each instance, in which the time complexity is
. And we have that
The update of the inverted index table
needs to traverse the constructed indexes, so the time complexity is also
. Therefore, the time complexity of the
Explicit Indexing phase is
.
At the phase of
Unique Pair Generation, MultiObJ needs to traverse
to identify unique instance pairs. Keys of dictionary
are distinct indexes constructed in
Explicit Indexing phase, the minimal number of which is 1. The maximum number of keys of
is
. This situation occurs when all the indexes constructed in the
Explicit Indexing phase are different from each other. On average, the number of keys of
is
Therefore, the time complexity of
Unique Pair Generation is
.
The
Index Inference phase of MultiObJ infers indexes for instances in which the objects under predicates are missing. In this phase, the algorithm searches for all suitable instance
for each instance
, where
i and
j have the same index in loop
and
. The indexes of the instance
j in the
l-th loop will become a part of the index set of the instance
i. The other part of the index set is formed by concatenating each index of instance
i in previous loop and
NULL. The instance
i obtains
indexes in previous loop, and each index corresponds to multiple eligible instances
j. In loop
, instances in a knowledge graph generates
indexes, among which the number of distinct indexes is
. According to Equation (
8), the average number of repetitions for each index is
, which is also the eligible
j’s quantity. The instance
j obtains
indexes in the
l-th loop, so the number of indexes obtained in the
Index Inference phase of each instance with missing objects is
There are
instances in
, so the time complexity of the
Index Inference phase is
The relationship between the aforementioned expectations is deduced as follows. In loop
l, there are
indexes in the
Indexing phase, which consist of
indexes constructed in the
Explicit Indexing phase and
indexes constructed in the
Index Inference phase. Therefore,
According to Equations (
7)–(
10), the recurrence relation of
can be derived as
Recall that , then , and . It can be seen that the closer the average predicate completeness rate is to 1, the smaller the high-order items in are, and the smaller the algorithm overhead is. When the number of loops l reaches 3, since the exponent of v in is too high, the influence of the constant v on the complexity of the algorithm cannot be ignored. Therefore, it is not recommended to construct indexes with more than 3 predicates. When , could be regard as constants, and the time complexity of Indexing phase is .
The final phase of MultiObJ is Candidate Pair Generation. The number of keys in dictionary B is , and the number of instances from different KGs under each key is . The time complexity of this phase is .
From the above analysis, we can know that the time complexity of MultiObJ is , where is the expectation of the number of indexes for an instance constructed referring to all the predicates in P, and is the expectation of the number of distinct indexes among these indexes. MultiObJ leverages the joint of multiple objects’ features to make indexes of instances more discriminative. In this way, the algorithm increases to reduce the high order term in the time complexity. In particular, the time complexity of MultiObJ is when .
After all pairs to be matched are generated, the framework will generate a similarity vector for each pair. For all pairs of instances, FTRLIM sequentially compares the similarity of related objects according to the predicates in the specified predicate set . FTRLIM generates instance pairs through Blocker, the total number of instance pairs does not exceed because unique instance pairs are all candidate instance pairs. Since can be regarded as a constant, and , the time complexity of comparison is . The generated similarity vectors will be judged by the FTRL model. Since the process of training the FTRL model is non-distributed and involves manual operations, we do not consider the time-consuming impact of this process on the whole. The trained FTRL model accepts similarity vectors as inputs and calculates similarity scores for them. Finally, the instance pairs with scores higher than the threshold are filtered and deduplicated to become the final matching results. The time cost of these two processes is proportional to the number of instance pairs, and the time complexity is . Thus, from the generation of similarity vectors to the generation of the matching results, the time complexity of the framework is .
In summary, the time complexity of the FTRLIM framework to complete the instance matching task is , where is the expectation of the number of indexes for an instance constructed referring to all the predicates in P, and is the expectation of the number of distinct indexes among these indexes. This complexity can be simplified to when , where P is the list of predicates specified for constructing inverted indexes for instances. FTRLIM is deployed on a distributed Spark cluster. One entire matching process will be divided into multiple tasks, which will be completed by Spark cores in a distributed manner. Theoretically, increasing the number of Spark cores can reduce the computation time for matching. The result of the analysis shows that FTRLIM has approximately linear time complexity. When encountering large-scale data that is difficult to handle, increasing the number of Spark cores in the cluster will improve the matching efficiency.