DOI: https://doi.org/10.1145/3178876.3186128
WWW '18: Proceedings of The Web Conference 2018, Lyon, France, April 2018
Network alignment or graph matching is the classic problem of finding matching vertices between two graphs with applications in network de-anonymization and bioinformatics. There exist a wide variety of algorithms for it, but a challenging scenario for all of the algorithms is aligning two networks without any information about which nodes might be good matches. In this case, the vast majority of principled algorithms demand quadratic memory in the size of the graphs. We show that one such method—the recently proposed and theoretically grounded EigenAlign algorithm—admits a novel implementation which requires memory that is linear in the size of the graphs. The key step to this insight is identifying low-rank structure in the node-similarity matrix used by EigenAlign for determining matches. With an exact, closed-form low-rank structure, we then solve a maximum weight bipartite matching problem on that low-rank matrix to produce the matching between the graphs. For this task, we show a new, a-posteriori, approximation bound for a simple algorithm to approximate a maximum weight bipartite matching problem on a low-rank matrix. The combination of our two new methods then enables us to tackle much larger network alignment problems than previously possible and to do so quickly. Problems that take hours with existing methods take only seconds with our new algorithm. We thoroughly validate our low-rank algorithm against the original EigenAlign approach. We also compare a variety of existing algorithms on problems in bioinformatics and social networks. Our approach can also be combined with existing algorithms to improve their performance and speed.
ACM Reference Format:
Huda Nassar, Nate Veldt, Shahin Mohammadi, Ananth Grama, and David F. Gleich. 2018. Low Rank Spectral Network Alignment. In WWW 2018: The 2018 Web Conference, April 23–27, 2018, Lyon, France. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3178876.3186128
Network alignment is the problem of pairing nodes across two different graphs in a way that preserves edge structure and highlights similarities between the networks. The node pairings can either be one-to-one or many-to-many. While the methods we propose are amenable to both settings with some modification, we focus on the one-to-one case as it has the most extensive literature. Applications of network alignment include (i) finding similar nodes in social networks, which uncovers information about one or both of the paired nodes, and can help with tailoring advertisements and suggesting activities for similar users in a network; (ii) social-network de-anonymization [10]; and (iii) pattern matching in graphs [3]. One very popular example of this problem is the alignment of protein-protein interaction networks in biology [6, 14, 27]. Often in biology one can extract valuable knowledge about proteins for which little information is known by aligning a protein network with another protein network that has been studied more. By doing so one can draw conclusions about proteins in the first network by understanding their similarities to proteins in the second.
There are two major approaches to network alignment problems [1]: local network alignment, where the goal is to find local regions of the graph that are similar to any given node, and global network alignment, where the goal is to understand how two large graphs would align to each other. Many approaches to network alignment rely on solving an optimization problem to compute what amounts to a topological similarity score between pairs of nodes in the two networks. Here, we focus on global alignment with one-to-one matches between the two graphs.
Some applications also come with prior information about which nodes in one network may be good matches for nodes of another network, which implicitly imposes a restriction on the number of the similarity scores that must be computed and stored in practice [2]. However, for problems that lack this prior, the data requirement for storing the similarity scores is quadratic, which severely limits the scalability of this class of approaches to solve the problem. For instance, methods such as the Lagrangian relaxation method of Klau et al. [7] require at least quadratic memory. There do exist memory-scalable heuristics for solving network alignment problems with no prior, including the GHOST procedure of Patro et al. [24], or the GRAAL algorithm of Kuchaieve et al [12] and its variants. However, these usually involve cubic or worse computation in terms of vertex neighborhoods in the graph (e.g. enumeration of all 5-node graphlets within a local region).
One principled approach that avoids the quadratic memory requirement is the Network Similarity Decomposition (NSD) [8, 9, 22], which provides a useful low-rank decomposition of a specific similarity matrix based on the IsoRank method [27]. This method enables alignments to be computed between extremely large networks. However, there have been many improvements to network alignment methods since the publication of IsoRank.
A recent innovation is a method based on eigenvectors called EigenAlign. The EigenAlign method uses the dominant eigenvector of a matrix related to the product-graph between the two networks in order to estimate the similarity. The eigenvector information is rounded into a matching between the vertices of the graphs by solving a maximum-weight bipartite matching problem on a dense bipartite graph [5]. The IsoRank method is also based on eigenvectors, or more specifically, the PageRank vector of the product-graph of the two networks was used for the same purpose [27]. In contrast, a key innovation of EigenAlign is that it explicitly models nodes that may not have a match in the network. In this way, it is able to provably align many simple graph models such as Erdős-Rényi when the graphs do not have too much noise. This gives it a firm theoretical basis although it still suffers from the quadratic memory requirement.
In our manuscript, we highlight a number of innovations that enable the EigenAlign methodology to work without the quadratic memory requirement. We first show that the EigenAlign solution can be expressed via low-rank factors, and we can compute these low-rank factors exactly and explicitly using a simple procedure. A challenge in using the low-rank information provided by our new method is that there are only a few ideas on how to use the low-rank structure of the similarity scores in the matching step [15, 22]. We contribute a new analysis of a simple idea to use the low-rank structure that gives a computable a-posteriori approximation guarantee. In practice, this approximation guarantee is extremely good: around 1.1. Such a procedure should enable further low-rank applications beyond just network alignment.
Our contributions.
We now review the state of network alignment algorithms and our specific setting and objective. A helpful illustration is shown in Figure 1.
For the network alignment problem, we are given two graphs GA and GB with adjacency matrices A and B. The goal is to produce a one-to-one mapping between nodes of GA and GB that preserves topological similarities between the networks [3]. In some cases we additionally receive information about which nodes in one network can be paired with nodes in the other. This additional information is presented in the form of a bipartite graph whose edge weights are stored in a matrix L; if Luv > 0, this indicates outside evidence that node u in GA should be matched to node v in GB . We call this outside evidence a prior on the alignment. When a prior is present, the prior and topological information are taken together to determine an alignment.
More formally, we seek a binary matrix P that encodes a matching between the nodes of the networks and maximizes one of a few possible objective functions discussed below. The matrix P encodes a matching when it satisfies the constraints
The classic formulation of the problem seeks a matrix P that maximizes the number of overlapping edges between GA and GB , i.e. the number of adjacent node pairs (iA , jA ) in GA that are mapped to an adjacent node pair $(i^{\prime }_B,j^{\prime }_B)$ in GB . This results in the following integer quadratic program:
One of the drawbacks to the previous objective functions is there is no downside to matches that do not produce an overlap, i.e. edges in GA that are mapped to non-edges in GB or vice versa. Neither do these objective functions consider the case where non-edges in GA are mapped to non-edges in GB . The first problem was recognized in [25] which proposed an SDP-based method to minimize the number of conflicting matches. More recently, the EigenAlign objective [5] included explicit terms for these three cases: overlaps, non-informative matches, and conflicts, see Figure 1. The alignment score corresponding to P in this case is
This objective can be expressed formally by first introducing a massive alignment matrix M defined as follows: for all pairs of nodes iA , jA in GA and all pairs ${i^{\prime }_B,j^{\prime }_B}$ in GB , if ${P}(i_A,i^{\prime }_B) = 1$ and ${P}(j_A,j^{\prime }_B) = 1$ , then
We are abusing notations a bit in this definition and using pairs $i_A^{}$ and $i_B^{\prime }$ to index the rows and columns of this matrix. For a straightforward, canonical ordering of these pairs $i_A, i_B^{\prime }$ , the matrix M can be rewritten in terms of the adjacency matrices of A and B:
Maximizing the alignment score (2) is then equivalent to the following quadratic assignment problem:
An empirically and theoretically successful method for optimizing this objective is to solve an eigenvector equation instead of the quadratic program. This is exactly the approach of EigenAlign, which computes network alignments using the following two steps:
Our contribution.. In our work we extend the foundation laid by EigenAlign by considering improvements to both steps. We first show that the similarity matrix X can be accurately represented through an exact low-rank factorization. This allows us to avoid the quadratic memory requirement of EigenAlign. We then present several new fast techniques for bipartite matching problems on low-rank matrices. Together these improvements yield a low-rank EigenAlign algorithm that is far more scalable in practice.
Our work shares a number of similarities with the Network Similarity Decomposition (NSD) [8], a technique based on a low-rank factorization of a different similarity matrix, the matrix used by the IsoRank algorithm [27]. The authors of [8] show that this decomposition can be obtained by performing calculations separately on the two graphs, which significantly speeds up the calculation of similarity scores between nodes. Another procedure designed for aligning networks without prior information is the the Graph Alignment tool (GRAAL) [12]. GRAAL computes the so-called graphlet degree signature for each node, a vector that generalizes node degree and represents the topological structure of a node's local neighborhood. The method measures distances between graphlet degrees to obtain similarity scores, and then uses a greedy seed and extend procedure for matching nodes across two networks based on the scores. A number of algorithms related to this method have been introduced, which extend the original technique by considering other measures of topological similarity as well as different approaches to rounding similarity scores into an alignment [13, 17, 18, 20]. The seed-and-extend alignment procedure was also employed by the GHOST algorithm [24], which computes topological similarity scores based on a novel spectral signature for each node. Recently, [21] introduced the notion of finding an alignment that maximizes the number of preserved higher order structures (such as triangles) across networks. This results in an integer programming problem that can be approximated by the Triangular Alignment algorithm (TAME), which obtains similarity scores by solving a tensor eigenvalue problem that relaxes the original objective.
Alternative approaches to improve network alignment include active methods that allow users to select matches from a host of potential near equal matches [16].
The first step of the EigenAlign algorithm is to compute the dominant eigenvector of the symmetric matrix M. Feizi et al. suggest obtaining a similarity matrix X by first forming M, performing a power iteration on this matrix, and reshaping the final output eigenvector x into X [5]. Because of the Kronecker structure in M, this can equivalently be formulated directly as the matrix X that satisfies:
Our first major contribution is to show that if the matrix X is estimated with the power-method starting from a rank 1 matrix, then the kth iteration of the power method results in a rank k + 1 matrix that we can explicitly and exactly compute.
In the matrix form of problem (4), one step of the power method corresponds to the iteration:
This form of the factorization is not yet helpful, because the matrix Uk is of dimension nA × 4 k . To show that this is indeed a rank k + 1 matrix, we show
To complete our derivation, we show Uk = SkCk again using induction. The base case k = 0 is immediate from a simple expansion of the initial definitions, so assume that the result holds for up to integer k. Then,
While this four factor decomposition is useful for revealing the rank of Xk , we do not wish to work with matrices Ck and Tk in practice since each has 4 k columns. We now show that their product ${C}_k {T}_k^T$ yields a simple-to-compute matrix Wk of size (k + 1) × (k + 1), giving us a three-factor decomposition (3FD):
This decomposition is a step closer to our final goal but suffers from poor scaling of numbers in the factors. Consequently, we can remedy this by using scaling diagonal matrices in order to present our final well-scaled three factor decomposition of Xk , which we present as a summarizing theorem:
If X 0 = uv T for vectors $\mathrm{u}\in \mathbb {R}^{n_A \times 1}$ and $\mathrm{v}\in \mathbb {R}^{n_B \times 1}$ , then the kth iteration of update (5) permits the following low-rank factorization:
The diagonal matrices in Theorem (3.1) are designed specifically to satisfy ${S}_k^{} {D}_u^{-1} = \tilde{{U}}_k^{}$ , ${R}_k {D}_v^{-1} = \tilde{{V}}_k$ , so the equivalence between the scaled and unscaled three factor decompositions is straightforward. Note that the result is still unnormalized. However, we can easily normalize in practice by scaling the matrix $\tilde{{W}}_k$ as we see fit.
Note that when computing this decomposition in practice, we do not simply construct S, R, and W and then scale with Du and Dv . Instead, we form the scaled factors recursively by noting the similarities between each factor at step k and the corresponding factor at step k + 1. A pseudo-code for our implementation that directly computes these is shown in Figure 2.
As we shall see in the next section, we would ultimately like to express Xk in terms of a just a left and a right low-rank factor in order to apply our techniques for low-rank bipartite matching. It is preferable for our purposes to produce two factors that have roughly equal scaling, so we accomplish this by factorizing $\tilde{{W}}_k$ using an SVD decomposition and splitting the pieces of $\tilde{{W}}_k$ into the left and right terms. The last steps of the Figure 2 accomplish this goal.
In this section, we consider the problem of solving a maximum weight bipartite matching problem on a low rank matrix with a useful a-posteriori approximation guarantee. In our network alignment routine, our algorithm will be used on the low-rank matrix from Figure 2. In this section, however, we proceed in terms of a general matrix Y with low rank factors Y = UVT . The matrix Y represents the edge-weights of a bipartite graph, and so the max-weight matching problem is:
We begin by considering optimal matchings for a rank-1 matrix Y = uv T where $\mathrm{u}, \mathrm{v}\in \mathbb {R}^n$ (these results are easily adapted for vectors of different lengths).
Case 1: $\mathrm{u}, \mathrm{v}\in \mathbb {R}^n_{\ge 0}$ or $\mathrm{u}, \mathrm{v}\in \mathbb {R}^n_{\le 0}$ . If u and v contain only non-negative entries, or both contain only non-positive entries, the procedure for finding the optimal matching is the same: we order the entries of both vectors by magnitude and pair up elements as they appear in the sorted list. If any pair contributes a 0 weight, we do not bother to match that pair since it doesn't improve the overall matching score. The optimality of this matching for these special cases can be seen as a direct result of the rearrangement inequality.
Case 2: General $\mathrm{u}, \mathrm{v}\in \mathbb {R}^n$ . If u and v have entries that can be positive, negative, or zero, we require a slightly more sophisticated method for finding the optimal matching on Y. In this case, define $\tilde{{Y}}$ to be the matrix obtained by copying Y and deleting all negative entries. To find the optimal matching of Y we would never pair elements giving a negative weight, so the optimal matching for $\tilde{{Y}}$ is the same as for Y. Now let u+ and u− be the vectors that contain the strictly positive and negative elements in u respectively, and define v+, and v− similarly for v. Then,
The set of nodes matched by M 1 will be disjoint from the set of nodes matched by M 2. The matching $\tilde{{M}}$ defined by combining these two matchings will be optimal for Y.
We will prove by contradiction that there are no conflicts between M 1 and M 2. Assume that M 1 contains the match (i, j) and M 2 contains a conflicting match (i, k). Since M 1 contains the match (i, j), $\tilde{{Y}}_1(i,j)$ must be nonzero, implying that u+(i) and v+(j) are both positive. Similarly, M 2 contains the pair (i, k), so u−(i) and v−(k) are both negative. This is a contradiction, since at least one of u+(i) and u−(i) must be zero.
We just need to show that $\tilde{M}$ is an optimal matching for Y. If this were not the case, there would exist some matching M such that ${M}\bullet \tilde{{Y}} {\gt} \tilde{{M}} \bullet \tilde{{Y}}$ . If such an M existed, we would have that
Now we address the problem of finding a good matching for a matrix Y = UVT , where ${Y}\in \mathbb {R}^{m\times n}$ , ${U}\in \mathbb {R}^{m\times k}$ , and ${V}\in \mathbb {R}^{n\times k}$ . Let u i and v i be the ith columns in U and V, and let ${Y}_i^{} = \mathrm{u}_i^{} \mathrm{v}_i^T$ , then ${Y}= \sum _{i=1}^{k} {Y}_i$ .
We can find the optimal matching on each Yi using the results from Section 4.1. Let Mi be the matching matrix corresponding to Yi , and let M * be a matching matrix that achieves an optimal maximum weight on Y. Note that M *•Yi ≤ Mi •Yi , and thus,
We can achieve a D-approximation for the bipartite matching problem by selecting an optimal matching for one of the low-rank factors of Y.
${M}^* \hspace{-1.66656pt}\bullet \hspace{-1.66656pt}{Y}\le \sum _{i =1}^{k} {M}_i \hspace{-1.66656pt}\bullet \hspace{-1.66656pt}{Y}_i \le \sum _{i =1}^{k} D {M}_{j^*} \hspace{-1.66656pt}\bullet \hspace{-1.66656pt}{Y}_i = D ({M}_{j^*} \hspace{-1.66656pt}\bullet \hspace{-1.66656pt}{Y}).$ □
This procedure (Figure 3) runs in $\mathcal {O}(k^2n + k n \log n)$ where k is the rank, and U and V have O(n) rows. The space requirement is $\mathcal {O}(nk)$ . In practice, the approximation factors D are less than 1.1 for our problems (see Figure 7). Figure 3 shows pseudocode to implement this matching algorithm.
Our method (Figure 3) can be improved without substantially changing its runtime or memory requirement. The key idea is to create a sparse max-weight bipartite matching problem that include the matching ${M}_{j^*}$ and other helpful edges. By optimally solving these, we will only improve the approximation. These incur the cost of solving those problems optimally, but sparse max-weight matching solvers are practical and fast for problems with millions of edges.
Union of matchings. The simplest improvement is to create a sparse graph based on the full set of matches M 1, …, Mk . We can do this by transforming the complete bipartite network defined by Y into a sparsified network $\hat{{Y}}$ where edge (j, k) is nonzero with weight Y j, k only if nodes (j, k) were matched by some Mi . Then, we solve a maximum bipartite matching problem on the sparse matrix $\hat{{Y}}$ with O(nk) non-zeros or edges. This only improves the approximation because we included the matching ${M}_{j^*}$ .
Expanding non-matchings on rank-1 factors. Since algorithm 3 relies on a sorting procedure when building Mi from the rank-1 factors, and since these numbers may very likely be close to each other, we can choose to expand the set of possible matchings and let each node pair up with c closest values to it. By way of example, if c = 3 and we had sorted indices
To evaluate our method, we first study the relationship between Low Rank EigenAlign and the original EigenAlign algorithm. The goal of these initial experiments is to show (i) that we need about 8 iterations, which gives a rank 9 matrix, to get equivalent results to EigenAlign (Figure 4), (ii) our method performs the same over a variety of graph models (Figure 5), (iii) the method scales better (Figure 6), and (iv) the computed approximation bounds are better than 1.1 (Figure 7). We also compare against other scalable techniques in Figure 8, and see that our approach is the best. Next, we use a test-set of networks with known alignments from biology [28] to evaluate our algorithms (Section 5.2). Finally, we end our experiments with a study on a collaboration network where we seek to align vertex neighborhoods (Section 5.3).
Our low-rank EigenAlign In all of these experiments, our low-rank techniques use the expanded matching with c = 3 (Section 4.3) and set the initial rank-1 factors to be all uniform: v = e, u = e. Let $\alpha = 1 + \frac{\text{nnz}(A) \text{nnz}(B)}{ \text{nnz}(A)(n_B^2 - \text{nnz}(B)) + \text{nnz}(B)(n_A^2 - \text{nnz}(A))}$ . This equals one plus the ratio of possible overlaps divided by possible conflicts. Let γ = 0.001, then sO = α + γ, sN = 1 + γ, sC = γ. These parameters correspond to those used in [5] as well. Finally, we set the number of iterations to be 8 for all experiments except those where we explicitly vary the number of iterations.
Theoretical runtime. When we combine our low-rank computation and the subsequent expanded low-rank matching, the runtime of our method is
EigenAlign baseline. For EigenAlign, we use the same set of parameters sO , sN , sC and use the power method with starting with the all ones vector. We run the power method with normalization as described in (5) until we reach an eigenvalue-eigenvector pair that achieves a residual value 10− 12. This usually occurs after a 15-20 iterations.
The goal of our first experiment is to assess the performance of our method compared to EigenAlign. These experiments are all done with respect to synthetic problems with a known alignment between the graphs. The metric we use to assess the performance is recovery [5], where we want large recovery values. Recovery is between 0 and 1 and is defined
Graph models. To generate the starting undirected network in the problem (GA ), we use either Erdős-Rényi with average degree ρ (where the edge probability is ρ/n) or preferential attachment with a random 6-node initial graph and adding θ edges with each vertex.
Noise Model. Given a network GA , we add some noise to generate our second network GB [5]. With probability $p_{e_1}$ , we remove an edge, and with probability $p_{e_2}$ we add an edge. Then, algebraically, B can be written as A○(1 − Q 1) + (1 − A)○Q 2, where Q 1 and Q 2 are undirected Erdős-Rényi graphs with density $p_{e_1}$ and $p_{e_2}$ respectively and ○ is the Hadamard (element-wise) product. We fix $p_{e_2} = pp_{e_1}/(1-p)$ where p is the density of GA . Because some algorithms have a bias in the presence of multiple possible solutions, after B is generated, we relabel the nodes in B in reverse order.
Eight iterations are enough. We first study the change in results with the number of iterations. We use Erdős-Rényi graphs with average degree 20 and analyze the performance of our method as iterations vary. Figure 4 shows the recovery (left) and overlap (right) relative to the EigenAlign result so a value of 1.0 means the same number as EigenAlign. After 8 iterations, the recovery stops increasing, and so we perform the rest of our experiments with only 8 iterations.
Our Low-rank EigenAlign matches EigenAlign for Erdős-Rényi and preferential attachment. We next test a variety of graphs as the noise level $p_{e_1}$ varies. For these experiments, we create Erdős-Rényi graphs with average degree 5 and 20 and preferential attachment graphs with θ = 4 and θ = 6 for graphs with 50 nodes. Figure 5 shows these results in terms of the recovery of the true alignment. In the figure, the experimental results over 200 trials are essentially indistinguishable.
Our low-rank method is far more scalable. We next consider what happens to the runtime of the two algorithms as the graphs get larger. Figure 6 shows these results where we let each method run up to two minutes. We look at preferential attachment graphs with θ = 4 and $p_{e_1} = 0.5/n$ . EigenAlign requires a little more than two minutes to solve a problem of size 1000, whereas our low rank formulation can solve a problem that is an order of magnitude bigger in the same amount of time.
Our matching approximations are high quality. We also evaluate the effectiveness of our D-approximation computed in Section 4.2. Here, we compare the computed bound D we get to the actual approximation value of our algorithm and to the actual approximation ration of a greedy matching algorithm. The greedy algorithm can be implemented in a memory scalable fashion with a O(n 3) runtime (or O(n 2log n) with quadratic memory) and guarantees a 2-approximation whereas our D value gives better theoretical bounds. Figure 7 shows these results. Our guaranteed approximation factors are always less than 1.1 when the low-rank factors arise from the problems in Figure 6. Surprisingly, greedy matching does exceptionally well in terms of approximation, prompting our next experiment.
Our matching greatly outperforms greedy matching and other low-rank techniques. NSD [8] is another network alignment algorithm which solves the network alignment problem via low-rank factors. In the previous experiment, we saw that greedy matching consistently gave better than expected approximation ratios. Here, we compare the low-rank EigenAlign formulations with our low-rank matching scheme to greedy matching in terms of recovery. The results are shown in Figure 8 and show that the low-rank EigenAlign strategy with our low-rank matching outperforms the other scalable alternatives.
The MultiMagna dataset is a test case in bioinformatics that involves network alignment [19, 28]. It consists of a base yeast network that has been modified in different ways to produce five related networks, which we can think of as different edge sets on the same set of 1004 nodes. This results in 15 pairs of networks to align (6 choose 2). One unique aspect of this data is that there is no side information provided to guide the alignment process, which is exactly where our methods are most useful. In Figure 9, we show results for aligning MultiMagna networks using low-rank EigenAlign, EigenAlign , belief propagation (BP) [2], and Klau's method [7] in terms of two biologically relevant measures:
F-Node Correctness (F-NC). This is the F-score (harmonic mean) of the precision and recall of the alignment.
NCV-Generalized S3. This shows how well the network structure correlates. Let M be a matching matrix for graphs with nA and nB nodes. The node coverage value of an alignment is NCV = 2nnz(M)/(nA + nB ), where nnz(M) counts the number of nonzero entries in M. Let EO be the set of overlapping edges for an alignment M and EC be the set of conflicts, and define GS3 = |EO |/(|EO | + |EC |). The NCV-GS3 score is the geometric mean of NCV and GS3.
In this experiment, we find that standard network alignment algorithms (BP and Klau) perform dreadfully (F-NC) without any guidance about which nodes might be good matches. Towards that end, we can take the output from our expanded matchings from the low-rank factors and run the Klau and BP methods on this restricted set of matchings. This enables them to run in a reasonable amount of time with improved results. The idea here is that we are treating Klau and BP as the matching algorithm rather than using bipartite matching for this step. This picks a matching that also yields a good alignment. Our results are comparable with the results in [19], which is a recent paper that uses a number of other algorithms on the same data. The timing results from these experiments are shown in Table 1.
Algorithm | Time (sec) | ||
---|---|---|---|
min | median | max | |
LR | 1.9553 | 2.1971 | 2.9173 |
EA | 83.6777 | 96.9938 | 194.363 |
BP | 1985.2 | 2216.3 | 2744.3 |
Klau | 3031.4 | 3856.0 | 4590.2 |
LR+BP | 174.06 | 182.58 | 190.44 |
LR+Klau | 257.59 | 301.86 | 318.83 |
We now use Low Rank EigenAlign to perform a study on a collaboration network to understand what would be possible in terms of a fully anonymized network problem. We show that we could use our network alignment technique to identify edges where the endpoints have a high Jaccard similarity. We do so by aligning the node neighborhoods of each of the end points of each edge and observe that a high overlap implies a high Jaccard similarity score.
In more detail, recall that the Jaccard similarity of two nodes (a and b) is defined as $\frac{|N(a) \cap N(b)|}{|N(a) \cup N(b)|}$ , where N(a) are the neighboring nodes of a. The vertex neighborhood of node a is the induced subgraph of the node and all of its neighbors. Given an edge (i, j), we then compute the Jaccard similarity between i and j, and also align the vertex neighborhood of i to the vertex neighborhood of j using our technique.
We use the DBLP collaboration network from [4] and consider pairs of nodes that have a sufficiently big neighborhood and are connected by an edge. Specifically, we consider nodes that have 100 or more neighbors. In total, we end up with 15187 such pairs. This is an easy experiment with our fast codes and it takes less than five minutes. The results are in Figure 10. We score the network alignments in terms of normalized overlap, which is the number of overlapped edges to the maximum possible number for a pair of neighborhoods. What we observe is that large Jaccard similarities and large overlap scores are equivalent. This means we could have identified these results without any information on the actual identity of the vertices.
The low-rank spectral network alignment framework we introduce here offers a number of exciting possibilities in new uses of network alignment methods. First, it enables a new level of high-quality results with a scalable, principled method as illustrated by our experiments. This is because it has near-linear runtime and memory requirements in the size of the input networks. Second, in the course of this application, we developed a novel matching routine with high-quality a-posteriori approximation guarantees that will likely be useful in other areas as well.
That said, there are a number of areas that merit further exploration. First, the resulting low-rank factorization uses the matrix Sk , which is related to graph diffusions. There are results in computational geometry that prove rigorous results about using diffusions to align manifolds [23]. There are likely to be useful connections to further explore here. Second, there are strong relationships between our low-rank methods and fast algorithms for Sylvester and multi-term matrix equations [26] of the form C 1 XD 1 + C 2 XD 2 + ⋅⋅⋅ = F. These connections offer new possibilities to improve our methods.
The authors were supported by NSF CCF-1149756, IIS-1422918, IIS-1546488, CCF-0939370, DARPA SIMPLEX, and the Sloan Foundation.
This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution.
WWW 2018, April 23–27, 2018, Lyon, France
© 2018; IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-5639-8/18/04.
DOI: https://doi.org/10.1145/3178876.3186128