-
ACRONYM: Augmented degree corrected, Community Reticulated Organized Network Yielding Model
Authors:
Benjamin Leinwand,
Vince Lyzinski
Abstract:
Modeling networks can serve as a means of summarizing high-dimensional complex systems. Adapting an approach devised for dense, weighted networks, we propose a new method for generating and estimating unweighted networks. This approach can describe a broader class of potential networks than existing models, including those where nodes in different subnetworks connect to one another via various att…
▽ More
Modeling networks can serve as a means of summarizing high-dimensional complex systems. Adapting an approach devised for dense, weighted networks, we propose a new method for generating and estimating unweighted networks. This approach can describe a broader class of potential networks than existing models, including those where nodes in different subnetworks connect to one another via various attachment mechanisms, inducing flexible and varied community structures. While unweighted edges provide less resolution than continuous weights, restricting to the binary case permits the use of likelihood-based estimation techniques, which can improve estimation of nodal features. The extra flexibility may contribute a different understanding of network generating structures, particularly for networks with heterogeneous densities in different regions.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Detection of Model-based Planted Pseudo-cliques in Random Dot Product Graphs by the Adjacency Spectral Embedding and the Graph Encoder Embedding
Authors:
Tong Qi,
Vince Lyzinski
Abstract:
In this paper, we explore the capability of both the Adjacency Spectral Embedding (ASE) and the Graph Encoder Embedding (GEE) for capturing an embedded pseudo-clique structure in the random dot product graph setting. In both theory and experiments, we demonstrate that this pairing of model and methods can yield worse results than the best existing spectral clique detection methods, demonstrating a…
▽ More
In this paper, we explore the capability of both the Adjacency Spectral Embedding (ASE) and the Graph Encoder Embedding (GEE) for capturing an embedded pseudo-clique structure in the random dot product graph setting. In both theory and experiments, we demonstrate that this pairing of model and methods can yield worse results than the best existing spectral clique detection methods, demonstrating at once the methods' potential inability to capture even modestly sized pseudo-cliques and the methods' robustness to the model contamination giving rise to the pseudo-clique structure. To further enrich our analysis, we also consider the Variational Graph Auto-Encoder (VGAE) model in our simulation and real data experiments.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Evaluating the effects of high-throughput structural neuroimaging predictors on whole-brain functional connectome outcomes via network-based vector-on-matrix regression
Authors:
Tong Lu,
Yuan Zhang,
Vince Lyzinski,
Chuan Bi,
Peter Kochunov,
Elliot Hong,
Shuo Chen
Abstract:
The joint analysis of multimodal neuroimaging data is critical in the field of brain research because it reveals complex interactive relationships between neurobiological structures and functions. In this study, we focus on investigating the effects of structural imaging (SI) features, including white matter micro-structure integrity (WMMI) and cortical thickness, on the whole brain functional con…
▽ More
The joint analysis of multimodal neuroimaging data is critical in the field of brain research because it reveals complex interactive relationships between neurobiological structures and functions. In this study, we focus on investigating the effects of structural imaging (SI) features, including white matter micro-structure integrity (WMMI) and cortical thickness, on the whole brain functional connectome (FC) network. To achieve this goal, we propose a network-based vector-on-matrix regression model to characterize the FC-SI association patterns. We have developed a novel multi-level dense bipartite and clique subgraph extraction method to identify which subsets of spatially specific SI features intensively influence organized FC sub-networks. The proposed method can simultaneously identify highly correlated structural-connectomic association patterns and suppress false positive findings while handling millions of potential interactions. We apply our method to a multimodal neuroimaging dataset of 4,242 participants from the UK Biobank to evaluate the effects of whole-brain WMMI and cortical thickness on the resting-state FC. The results reveal that the WMMI on corticospinal tracts and inferior cerebellar peduncle significantly affect functional connections of sensorimotor, salience, and executive sub-networks with an average correlation of 0.81 (p<0.001).
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Gotta match 'em all: Solution diversification in graph matching matched filters
Authors:
Zhirui Li,
Ben Johnson,
Daniel L. Sussman,
Carey E. Priebe,
Vince Lyzinski
Abstract:
We present a novel approach for finding multiple noisily embedded template graphs in a very large background graph. Our method builds upon the graph-matching-matched-filter technique proposed in Sussman et al., with the discovery of multiple diverse matchings being achieved by iteratively penalizing a suitable node-pair similarity matrix in the matched filter algorithm. In addition, we propose alg…
▽ More
We present a novel approach for finding multiple noisily embedded template graphs in a very large background graph. Our method builds upon the graph-matching-matched-filter technique proposed in Sussman et al., with the discovery of multiple diverse matchings being achieved by iteratively penalizing a suitable node-pair similarity matrix in the matched filter algorithm. In addition, we propose algorithmic speed-ups that greatly enhance the scalability of our matched-filter approach. We present theoretical justification of our methodology in the setting of correlated Erdos-Renyi graphs, showing its ability to sequentially discover multiple templates under mild model conditions. We additionally demonstrate our method's utility via extensive experiments both using simulated models and real-world dataset, include human brain connectomes and a large transactional knowledge base.
△ Less
Submitted 4 July, 2024; v1 submitted 25 August, 2023;
originally announced August 2023.
-
On seeded subgraph-to-subgraph matching: The ssSGM Algorithm and matchability information theory
Authors:
Lingyao Meng,
Mengqi Lou,
Jianyu Lin,
Vince Lyzinski,
Donniell E. Fishkind
Abstract:
The subgraph-subgraph matching problem is, given a pair of graphs and a positive integer $K$, to find $K$ vertices in the first graph, $K$ vertices in the second graph, and a bijection between them, so as to minimize the number of adjacency disagreements across the bijection; it is ``seeded" if some of this bijection is fixed. The problem is intractable, and we present the ssSGM algorithm, which u…
▽ More
The subgraph-subgraph matching problem is, given a pair of graphs and a positive integer $K$, to find $K$ vertices in the first graph, $K$ vertices in the second graph, and a bijection between them, so as to minimize the number of adjacency disagreements across the bijection; it is ``seeded" if some of this bijection is fixed. The problem is intractable, and we present the ssSGM algorithm, which uses Frank-Wolfe methodology to efficiently find an approximate solution. Then, in the context of a generalized correlated random Bernoulli graph model, in which the pair of graphs naturally have a core of $K$ matched pairs of vertices, we provide and prove mild conditions for the subgraph-subgraph matching problem solution to almost always be the correct $K$ matched pairs of vertices.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
Adversarial contamination of networks in the setting of vertex nomination: a new trimming method
Authors:
Sheyda Peyman,
Minh Tang,
Vince Lyzinski
Abstract:
As graph data becomes more ubiquitous, the need for robust inferential graph algorithms to operate in these complex data domains is crucial. In many cases of interest, inference is further complicated by the presence of adversarial data contamination. The effect of the adversary is frequently to change the data distribution in ways that negatively affect statistical and algorithmic performance. We…
▽ More
As graph data becomes more ubiquitous, the need for robust inferential graph algorithms to operate in these complex data domains is crucial. In many cases of interest, inference is further complicated by the presence of adversarial data contamination. The effect of the adversary is frequently to change the data distribution in ways that negatively affect statistical and algorithmic performance. We study this phenomenon in the context of vertex nomination, a semi-supervised information retrieval task for network data. Here, a common suite of methods relies on spectral graph embeddings, which have been shown to provide both good algorithmic performance and flexible settings in which regularization techniques can be implemented to help mitigate the effect of an adversary. Many current regularization methods rely on direct network trimming to effectively excise the adversarial contamination, although this direct trimming often gives rise to complicated dependency structures in the resulting graph. We propose a new trimming method that operates in model space which can address both block structure contamination and white noise contamination (contamination whose distribution is unknown). This model trimming is more amenable to theoretical analysis while also demonstrating superior performance in a number of simulations, compared to direct trimming.
△ Less
Submitted 20 August, 2022;
originally announced August 2022.
-
Lost in the Shuffle: Testing Power in the Presence of Errorful Network Vertex Labels
Authors:
Ayushi Saxena,
Vince Lyzinski
Abstract:
Two-sample network hypothesis testing is an important inference task with applications across diverse fields such as medicine, neuroscience, and sociology. Many of these testing methodologies operate under the implicit assumption that the vertex correspondence across networks is a priori known. This assumption is often untrue, and the power of the subsequent test can degrade when there are misalig…
▽ More
Two-sample network hypothesis testing is an important inference task with applications across diverse fields such as medicine, neuroscience, and sociology. Many of these testing methodologies operate under the implicit assumption that the vertex correspondence across networks is a priori known. This assumption is often untrue, and the power of the subsequent test can degrade when there are misaligned/label-shuffled vertices across networks. This power loss due to shuffling is theoretically explored in the context of random dot product and stochastic block model networks for a pair of hypothesis tests based on Frobenius norm differences between estimated edge probability matrices or between adjacency matrices. The loss in testing power is further reinforced by numerous simulations and experiments, both in the stochastic block model and in the random dot product graph model, where the power loss across multiple recently proposed tests in the literature is considered. Lastly, the impact that shuffling can have in real-data testing is demonstrated in a pair of examples from neuroscience and from social network analysis.
△ Less
Submitted 26 May, 2024; v1 submitted 18 August, 2022;
originally announced August 2022.
-
Clustered Graph Matching for Label Recovery and Graph Classification
Authors:
Zhirui Li,
Jesus Arroyo,
Konstantinos Pantazis,
Vince Lyzinski
Abstract:
Given a collection of vertex-aligned networks and an additional label-shuffled network, we propose procedures for leveraging the signal in the vertex-aligned collection to recover the labels of the shuffled network. We consider matching the shuffled network to averages of the networks in the vertex-aligned collection at different levels of granularity. We demonstrate both in theory and practice th…
▽ More
Given a collection of vertex-aligned networks and an additional label-shuffled network, we propose procedures for leveraging the signal in the vertex-aligned collection to recover the labels of the shuffled network. We consider matching the shuffled network to averages of the networks in the vertex-aligned collection at different levels of granularity. We demonstrate both in theory and practice that if the graphs come from different network classes, then clustering the networks into classes followed by matching the new graph to cluster-averages can yield higher fidelity matching performance than matching to the global average graph. Moreover, by minimizing the graph matching objective function with respect to each cluster average, this approach simultaneously classifies and recovers the vertex labels for the shuffled graph. These theoretical developments are further reinforced via an illuminating real data experiment matching human connectomes.
△ Less
Submitted 29 March, 2023; v1 submitted 6 May, 2022;
originally announced May 2022.
-
Signed and Unsigned Partial Information Decompositions of Continuous Network Interactions
Authors:
Jesse Milzman,
Vince Lyzinski
Abstract:
We investigate the partial information decomposition (PID) framework as a tool for edge nomination. We consider both the $I_{\cap}^{\text{min}}$ and $I_{\cap}^{\text{PM}}$ PIDs, from arXiv:1004.2515 and arXiv:1801.09010 respectively, and we both numerically and analytically investigate the utility of these frameworks for discovering significant edge interactions. In the course of our work, we exte…
▽ More
We investigate the partial information decomposition (PID) framework as a tool for edge nomination. We consider both the $I_{\cap}^{\text{min}}$ and $I_{\cap}^{\text{PM}}$ PIDs, from arXiv:1004.2515 and arXiv:1801.09010 respectively, and we both numerically and analytically investigate the utility of these frameworks for discovering significant edge interactions. In the course of our work, we extend both the $I_{\cap}^{\text{min}}$ and $I_{\cap}^{\text{PM}}$ PIDs to a general class of continuous trivariate systems. Moreover, we examine how each PID apportions information into redundant, synergistic, and unique information atoms within the source-bivariate PID framework. Both our simulation experiments and analytic inquiry indicate that the atoms of the $I_{\cap}^{\text{PM}}$ PID have a non-specific sensitivity to high predictor-target mutual information, regardless of whether or not the predictors are truly interacting. By contrast, the $I_{\cap}^{\text{min}}$ PID is quite specific, although simulations suggest that it lacks sensitivity.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
Leveraging semantically similar queries for ranking via combining representations
Authors:
Hayden S. Helm,
Marah Abdin,
Benjamin D. Pedigo,
Shweti Mahajan,
Vince Lyzinski,
Youngser Park,
Amitabh Basu,
Piali~Choudhury,
Christopher M. White,
Weiwei Yang,
Carey E. Priebe
Abstract:
In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of l…
▽ More
In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of labeled data available for a particular query can lead to a highly variable and ineffective ranking function. One way to mitigate the effect of the small amount of data is to leverage information from semantically similar queries. Indeed, as we demonstrate in simulation settings and real data examples, when semantically similar queries are available it is possible to gainfully use them when ranking with respect to a particular query. We describe and explore this phenomenon in the context of the bias-variance trade off and apply it to the data-scarce settings of a Bing navigational graph and the Drosophila larva connectome.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
The Phantom Alignment Strength Conjecture: Practical use of graph matching alignment strength to indicate a meaningful graph match
Authors:
Donniell E. Fishkind,
Felix Parker,
Hamilton Sawczuk,
Lingyao Meng,
Eric Bridgeford,
Avanti Athreya,
Carey E. Priebe,
Vince Lyzinski
Abstract:
The alignment strength of a graph matching is a quantity that gives the practitioner a measure of the correlation of the two graphs, and it can also give the practitioner a sense for whether the graph matching algorithm found the true matching. Unfortunately, when a graph matching algorithm fails to find the truth because of weak signal, there may be "phantom alignment strength" from meaningless m…
▽ More
The alignment strength of a graph matching is a quantity that gives the practitioner a measure of the correlation of the two graphs, and it can also give the practitioner a sense for whether the graph matching algorithm found the true matching. Unfortunately, when a graph matching algorithm fails to find the truth because of weak signal, there may be "phantom alignment strength" from meaningless matchings that, by random noise, have fewer disagreements than average (sometimes substantially fewer); this alignment strength may give the misleading appearance of significance. A practitioner needs to know what level of alignment strength may be phantom alignment strength and what level indicates that the graph matching algorithm obtained the true matching and is a meaningful measure of the graph correlation. The {\it Phantom Alignment Strength Conjecture} introduced here provides a principled and practical means to approach this issue. We provide empirical evidence for the conjecture, and explore its consequences.
△ Less
Submitted 23 August, 2021; v1 submitted 28 February, 2021;
originally announced March 2021.
-
Subgraph nomination: Query by Example Subgraph Retrieval in Networks
Authors:
Al-Fahad M. Al-Qadhi,
Carey E. Priebe,
Hayden S. Helm,
Vince Lyzinski
Abstract:
This paper introduces the subgraph nomination inference task, in which example subgraphs of interest are used to query a network for similarly interesting subgraphs. This type of problem appears time and again in real world problems connected to, for example, user recommendation systems and structural retrieval tasks in social and biological/connectomic networks. We formally define the subgraph no…
▽ More
This paper introduces the subgraph nomination inference task, in which example subgraphs of interest are used to query a network for similarly interesting subgraphs. This type of problem appears time and again in real world problems connected to, for example, user recommendation systems and structural retrieval tasks in social and biological/connectomic networks. We formally define the subgraph nomination framework with an emphasis on the notion of a user-in-the-loop in the subgraph nomination pipeline. In this setting, a user can provide additional post-nomination light supervision that can be incorporated into the retrieval task. After introducing and formalizing the retrieval task, we examine the nuanced effect that user-supervision can have on performance, both analytically and across real and simulated data examples.
△ Less
Submitted 19 December, 2022; v1 submitted 29 January, 2021;
originally announced January 2021.
-
Vertex nomination between graphs via spectral embedding and quadratic programming
Authors:
Runbing Zheng,
Vince Lyzinski,
Carey E. Priebe,
Minh Tang
Abstract:
Given a network and a subset of interesting vertices whose identities are only partially known, the vertex nomination problem seeks to rank the remaining vertices in such a way that the interesting vertices are ranked at the top of the list. An important variant of this problem is vertex nomination in the multi-graphs setting. Given two graphs $G_1, G_2$ with common vertices and a vertex of intere…
▽ More
Given a network and a subset of interesting vertices whose identities are only partially known, the vertex nomination problem seeks to rank the remaining vertices in such a way that the interesting vertices are ranked at the top of the list. An important variant of this problem is vertex nomination in the multi-graphs setting. Given two graphs $G_1, G_2$ with common vertices and a vertex of interest $x \in G_1$, we wish to rank the vertices of $G_2$ such that the vertices most similar to $x$ are ranked at the top of the list. The current paper addresses this problem and proposes a method that first applies adjacency spectral graph embedding to embed the graphs into a common Euclidean space, and then solves a penalized linear assignment problem to obtain the nomination lists. Since the spectral embedding of the graphs are only unique up to orthogonal transformations, we present two approaches to eliminate this potential non-identifiability. One approach is based on orthogonal Procrustes and is applicable when there are enough vertices with known correspondence between the two graphs. Another approach uses adaptive point set registration and is applicable when there are few or no vertices with known correspondence. We show that our nomination scheme leads to accurate nomination under a generative model for pairs of random graphs that are approximately low-rank and possibly with pairwise edge correlations. We illustrate our algorithm's performance through simulation studies on synthetic data as well as analysis of a high-school friendship network and analysis of transition rates between web pages on the Bing search engine.
△ Less
Submitted 27 March, 2022; v1 submitted 24 October, 2020;
originally announced October 2020.
-
The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks
Authors:
Konstantinos Pantazis,
Avanti Athreya,
Jesús Arroyo,
William N. Frost,
Evan S. Hill,
Vince Lyzinski
Abstract:
Spectral inference on multiple networks is a rapidly-developing subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multipl…
▽ More
Spectral inference on multiple networks is a rapidly-developing subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multiple network realizations, and even in this case, little attention has been paid to the induced network correlation in such joint embeddings. Here, we present a generalized omnibus embedding methodology and provide a detailed analysis of this embedding across both independent and correlated networks, the latter of which significantly extends the reach of such procedures. We describe how this omnibus embedding can itself induce correlation, leading us to distinguish between inherent correlation -- the correlation that arises naturally in multisample network data -- and induced correlation, which is an artifice of the joint embedding methodology. We show that the generalized omnibus embedding procedure is flexible and robust, and prove both consistency and a central limit theorem for the embedded points. We examine how induced and inherent correlation can impact inference for network time series data, and we provide network analogues of classical questions such as the effective sample size for more generally correlated data. Further, we show how an appropriately calibrated generalized omnibus embedding can detect changes in real biological networks that previous embedding procedures could not discern, confirming that the effect of inherent and induced correlation can be subtle and transformative, with import in theory and practice.
△ Less
Submitted 17 June, 2021; v1 submitted 31 July, 2020;
originally announced August 2020.
-
Vertex Nomination in Richly Attributed Networks
Authors:
Keith Levin,
Carey E. Priebe,
Vince Lyzinski
Abstract:
Vertex nomination is a lightly-supervised network information retrieval task in which vertices of interest in one graph are used to query a second graph to discover vertices of interest in the second graph. Similar to other information retrieval tasks, the output of a vertex nomination scheme is a ranked list of the vertices in the second graph, with the heretofore unknown vertices of interest ide…
▽ More
Vertex nomination is a lightly-supervised network information retrieval task in which vertices of interest in one graph are used to query a second graph to discover vertices of interest in the second graph. Similar to other information retrieval tasks, the output of a vertex nomination scheme is a ranked list of the vertices in the second graph, with the heretofore unknown vertices of interest ideally concentrating at the top of the list. Vertex nomination schemes provide a useful suite of tools for efficiently mining complex networks for pertinent information. In this paper, we explore, both theoretically and practically, the dual roles of content (i.e., edge and vertex attributes) and context (i.e., network topology) in vertex nomination. We provide necessary and sufficient conditions under which vertex nomination schemes that leverage both content and context outperform schemes that leverage only content or context separately. While the joint utility of both content and context has been demonstrated empirically in the literature, the framework presented in this paper provides a novel theoretical basis for understanding the potential complementary roles of network features and topology.
△ Less
Submitted 4 May, 2023; v1 submitted 29 April, 2020;
originally announced May 2020.
-
On a complete and sufficient statistic for the correlated Bernoulli random graph model
Authors:
Donniell E. Fishkind,
Avanti Athreya,
Lingyao Meng,
Vince Lyzinski,
Carey E. Priebe
Abstract:
Inference on vertex-aligned graphs is of wide theoretical and practical importance.There are, however, few flexible and tractable statistical models for correlated graphs, and even fewer comprehensive approaches to parametric inference on data arising from such graphs. In this paper, we consider the correlated Bernoulli random graph model (allowing different Bernoulli coefficients and edge correla…
▽ More
Inference on vertex-aligned graphs is of wide theoretical and practical importance.There are, however, few flexible and tractable statistical models for correlated graphs, and even fewer comprehensive approaches to parametric inference on data arising from such graphs. In this paper, we consider the correlated Bernoulli random graph model (allowing different Bernoulli coefficients and edge correlations for different pairs of vertices), and we introduce a new variance-reducing technique -- called \emph{balancing} -- that can refine estimators for model parameters. Specifically, we construct a disagreement statistic and show that it is complete and sufficient; balancing can be interpreted as Rao-Blackwellization with this disagreement statistic. We show that for unbiased estimators of functions of model parameters, balancing generates uniformly minimum variance unbiased estimators (UMVUEs). However, even when unbiased estimators for model parameters do {\em not} exist -- which, as we prove, is the case with both the heterogeneity correlation and the total correlation parameters -- balancing is still useful, and lowers mean squared error. In particular, we demonstrate how balancing can improve the efficiency of the alignment strength estimator for the total correlation, a parameter that plays a critical role in graph matchability and graph matching runtime complexity.
△ Less
Submitted 30 March, 2021; v1 submitted 23 February, 2020;
originally announced February 2020.
-
Graph matching between bipartite and unipartite networks: to collapse, or not to collapse, that is the question
Authors:
Jesús Arroyo,
Carey E. Priebe,
Vince Lyzinski
Abstract:
Graph matching consists of aligning the vertices of two unlabeled graphs in order to maximize the shared structure across networks; when the graphs are unipartite, this is commonly formulated as minimizing their edge disagreements. In this paper, we address the common setting in which one of the graphs to match is a bipartite network and one is unipartite. Commonly, the bipartite networks are coll…
▽ More
Graph matching consists of aligning the vertices of two unlabeled graphs in order to maximize the shared structure across networks; when the graphs are unipartite, this is commonly formulated as minimizing their edge disagreements. In this paper, we address the common setting in which one of the graphs to match is a bipartite network and one is unipartite. Commonly, the bipartite networks are collapsed or projected into a unipartite graph, and graph matching proceeds as in the classical setting. This potentially leads to noisy edge estimates and loss of information. We formulate the graph matching problem between a bipartite and a unipartite graph using an undirected graphical model, and introduce methods to find the alignment with this model without collapsing. We theoretically demonstrate that our methodology is consistent, and provide non-asymptotic conditions that ensure exact recovery of the matching solution. In simulations and real data examples, we show how our methods can result in a more accurate matching than the naive approach of transforming the bipartite networks into unipartite, and we demonstrate the performance gains achieved by our method in simulated and real data networks, including a co-authorship-citation network pair, and brain structural and functional data.
△ Less
Submitted 12 April, 2021; v1 submitted 5 February, 2020;
originally announced February 2020.
-
Multiplex graph matching matched filters
Authors:
Konstantinos Pantazis,
Daniel L. Sussman,
Youngser Park,
Zhirui Li,
Carey E. Priebe,
Vince Lyzinski
Abstract:
We consider the problem of detecting a noisy induced multiplex template network in a larger multiplex background network. Our approach, which extends the framework of Sussman et al. (2019) to the multiplex setting, leverages a multiplex analogue of the classical graph matching problem to use the template as a matched filter for efficiently searching the background for candidate template matches. T…
▽ More
We consider the problem of detecting a noisy induced multiplex template network in a larger multiplex background network. Our approach, which extends the framework of Sussman et al. (2019) to the multiplex setting, leverages a multiplex analogue of the classical graph matching problem to use the template as a matched filter for efficiently searching the background for candidate template matches. The effectiveness of our approach is demonstrated both theoretically and empirically, with particular attention paid to the potential benefits of considering multiple channels.
△ Less
Submitted 3 December, 2021; v1 submitted 22 July, 2019;
originally announced August 2019.
-
Vertex Nomination, Consistent Estimation, and Adversarial Modification
Authors:
Joshua Agterberg,
Youngser Park,
Jonathan Larson,
Christopher White,
Carey E. Priebe,
Vince Lyzinski
Abstract:
Given a pair of graphs $G_1$ and $G_2$ and a vertex set of interest in $G_1$, the vertex nomination (VN) problem seeks to find the corresponding vertices of interest in $G_2$ (if they exist) and produce a rank list of the vertices in $G_2$, with the corresponding vertices of interest in $G_2$ concentrating, ideally, at the top of the rank list. In this paper, we define and derive the analogue of B…
▽ More
Given a pair of graphs $G_1$ and $G_2$ and a vertex set of interest in $G_1$, the vertex nomination (VN) problem seeks to find the corresponding vertices of interest in $G_2$ (if they exist) and produce a rank list of the vertices in $G_2$, with the corresponding vertices of interest in $G_2$ concentrating, ideally, at the top of the rank list. In this paper, we define and derive the analogue of Bayes optimality for VN with multiple vertices of interest, and we define the notion of maximal consistency classes in vertex nomination. This theory forms the foundation for a novel VN adversarial contamination model, and we demonstrate with real and simulated data that there are VN schemes that perform effectively in the uncontaminated setting, and adversarial network contamination adversely impacts the performance of our VN scheme. We further define a network regularization method for mitigating the impact of the adversarial contamination, and we demonstrate the effectiveness of regularization in both real and synthetic data.
△ Less
Submitted 14 April, 2020; v1 submitted 5 May, 2019;
originally announced May 2019.
-
Maximum Likelihood Estimation and Graph Matching in Errorfully Observed Networks
Authors:
Jesús Arroyo,
Daniel L. Sussman,
Carey E. Priebe,
Vince Lyzinski
Abstract:
Given a pair of graphs with the same number of vertices, the inexact graph matching problem consists in finding a correspondence between the vertices of these graphs that minimizes the total number of induced edge disagreements. We study this problem from a statistical framework in which one of the graphs is an errorfully observed copy of the other. We introduce a corrupting channel model, and sho…
▽ More
Given a pair of graphs with the same number of vertices, the inexact graph matching problem consists in finding a correspondence between the vertices of these graphs that minimizes the total number of induced edge disagreements. We study this problem from a statistical framework in which one of the graphs is an errorfully observed copy of the other. We introduce a corrupting channel model, and show that in this model framework, the solution to the graph matching problem is a maximum likelihood estimator. Necessary and sufficient conditions for consistency of this MLE are presented, as well as a relaxed notion of consistency in which a negligible fraction of the vertices need not be matched correctly. The results are used to study matchability in several families of random graphs, including edge independent models, random regular graphs and small-world networks. We also use these results to introduce measures of matching feasibility, and experimentally validate the results on simulated and real-world networks.
△ Less
Submitted 2 July, 2020; v1 submitted 26 December, 2018;
originally announced December 2018.
-
Alignment Strength and Correlation for Graphs
Authors:
Donniell E. Fishkind,
Lingyao Meng,
Ao Sun,
Carey E. Priebe,
Vince Lyzinski
Abstract:
When two graphs have a correlated Bernoulli distribution, we prove that the alignment strength of their natural bijection strongly converges to a novel measure of graph correlation $ρ_T$ that neatly combines intergraph with intragraph distribution parameters. Within broad families of the random graph parameter settings, we illustrate that exact graph matching runtime and also matchability are both…
▽ More
When two graphs have a correlated Bernoulli distribution, we prove that the alignment strength of their natural bijection strongly converges to a novel measure of graph correlation $ρ_T$ that neatly combines intergraph with intragraph distribution parameters. Within broad families of the random graph parameter settings, we illustrate that exact graph matching runtime and also matchability are both functions of $ρ_T$, with thresholding behavior starkly illustrated in matchability.
△ Less
Submitted 17 January, 2020; v1 submitted 25 August, 2018;
originally announced August 2018.
-
On a 'Two Truths' Phenomenon in Spectral Graph Clustering
Authors:
Carey E. Priebe,
Youngser Park,
Joshua T. Vogelstein,
John M. Conroy,
Vince Lyzinski,
Minh Tang,
Avanti Athreya,
Joshua Cape,
Eric Bridgeford
Abstract:
Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical resu…
▽ More
Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical results provide new understanding of the problem and solutions, and lead us to a 'Two Truths' LSE vs. ASE spectral graph clustering phenomenon convincingly illustrated here via a diffusion MRI connectome data set: the different embedding methods yield different clustering results, with LSE capturing left hemisphere/right hemisphere affinity structure and ASE capturing gray matter/white matter core-periphery structure.
△ Less
Submitted 11 February, 2019; v1 submitted 23 August, 2018;
originally announced August 2018.
-
Tractable Graph Matching via Soft Seeding
Authors:
Fei Fang,
Daniel L. Sussman,
Vince Lyzinski
Abstract:
The graph matching problem aims to discover a latent correspondence between the vertex sets of two observed graphs. This problem has proven to be quite challenging, with few satisfying methods that are computationally tractable and widely applicable. The FAQ algorithm has proven to have good performance on benchmark problems and works with a indefinite relaxation of the problem. Due to the indefin…
▽ More
The graph matching problem aims to discover a latent correspondence between the vertex sets of two observed graphs. This problem has proven to be quite challenging, with few satisfying methods that are computationally tractable and widely applicable. The FAQ algorithm has proven to have good performance on benchmark problems and works with a indefinite relaxation of the problem. Due to the indefinite relaxation, FAQ is not guaranteed to find the global maximum. However, if prior information is known about the true correspondence, this can be leveraged to initialize the algorithm near the truth. We show that given certain properties of this initial matrix, with high probability the FAQ algorithm will converge in two steps to the truth under a flexible model for pairs of random graphs. Importantly, this result implies that there will be no local optima near the global optima, providing a method to assess performance.
△ Less
Submitted 24 July, 2018;
originally announced July 2018.
-
Matched Filters for Noisy Induced Subgraph Detection
Authors:
Daniel L. Sussman,
Youngser Park,
Carey E. Priebe,
Vince Lyzinski
Abstract:
The problem of finding the vertex correspondence between two noisy graphs with different number of vertices where the smaller graph is still large has many applications in social networks, neuroscience, and computer vision. We propose a solution to this problem via a graph matching matched filter: centering and padding the smaller adjacency matrix and applying graph matching methods to align it to…
▽ More
The problem of finding the vertex correspondence between two noisy graphs with different number of vertices where the smaller graph is still large has many applications in social networks, neuroscience, and computer vision. We propose a solution to this problem via a graph matching matched filter: centering and padding the smaller adjacency matrix and applying graph matching methods to align it to the larger network. The centering and padding schemes can be incorporated into any algorithm that matches using adjacency matrices. Under a statistical model for correlated pairs of graphs, which yields a noisy copy of the small graph within the larger graph, the resulting optimization problem can be guaranteed to recover the true vertex correspondence between the networks.
However, there are currently no efficient algorithms for solving this problem. To illustrate the possibilities and challenges of such problems, we use an algorithm that can exploit a partially known correspondence and show via varied simulations and applications to {\it Drosophila} and human connectomes that this approach can achieve good performance.
△ Less
Submitted 1 July, 2019; v1 submitted 6 March, 2018;
originally announced March 2018.
-
Vertex nomination: The canonical sampling and the extended spectral nomination schemes
Authors:
Jordan Yoder,
Li Chen,
Henry Pao,
Eric Bridgeford,
Keith Levin,
Donniell Fishkind,
Carey Priebe,
Vince Lyzinski
Abstract:
Suppose that one particular block in a stochastic block model is of interest, but block labels are only observed for a few of the vertices in the network. Utilizing a graph realized from the model and the observed block labels, the vertex nomination task is to order the vertices with unobserved block labels into a ranked nomination list with the goal of having an abundance of interesting vertices…
▽ More
Suppose that one particular block in a stochastic block model is of interest, but block labels are only observed for a few of the vertices in the network. Utilizing a graph realized from the model and the observed block labels, the vertex nomination task is to order the vertices with unobserved block labels into a ranked nomination list with the goal of having an abundance of interesting vertices near the top of the list. There are vertex nomination schemes in the literature, including the optimally precise canonical nomination scheme~$\mathcal{L}^C$ and the consistent spectral partitioning nomination scheme~$\mathcal{L}^P$. While the canonical nomination scheme $\mathcal{L}^C$ is provably optimally precise, it is computationally intractable, being impractical to implement even on modestly sized graphs.
With this in mind, an approximation of the canonical scheme---denoted the {\it canonical sampling nomination scheme} $\mathcal{L}^{CS}$---is introduced; $\mathcal{L}^{CS}$ relies on a scalable, Markov chain Monte Carlo-based approximation of $\mathcal{L}^{C}$, and converges to $\mathcal{L}^{C}$ as the amount of sampling goes to infinity. The spectral partitioning nomination scheme is also extended to the {\it extended spectral partitioning nomination scheme}, $\mathcal{L}^{EP}$, which introduces a novel semisupervised clustering framework to improve upon the precision of $\mathcal{L}^P$. Real-data and simulation experiments are employed to illustrate the precision of these vertex nomination schemes, as well as their empirical computational complexity.
Keywords: vertex nomination, Markov chain Monte Carlo, spectral partitioning, Mclust
MSC[2010]: 60J22, 65C40, 62H30, 62H25
△ Less
Submitted 22 January, 2020; v1 submitted 14 February, 2018;
originally announced February 2018.
-
On consistent vertex nomination schemes
Authors:
Vince Lyzinski,
Keith Levin,
Carey E. Priebe
Abstract:
Given a vertex of interest in a network $G_1$, the vertex nomination problem seeks to find the corresponding vertex of interest (if it exists) in a second network $G_2$. A vertex nomination scheme produces a list of the vertices in $G_2$, ranked according to how likely they are judged to be the corresponding vertex of interest in $G_2$. The vertex nomination problem and related information retriev…
▽ More
Given a vertex of interest in a network $G_1$, the vertex nomination problem seeks to find the corresponding vertex of interest (if it exists) in a second network $G_2$. A vertex nomination scheme produces a list of the vertices in $G_2$, ranked according to how likely they are judged to be the corresponding vertex of interest in $G_2$. The vertex nomination problem and related information retrieval tasks have attracted much attention in the machine learning literature, with numerous applications to social and biological networks. However, the current framework has often been confined to a comparatively small class of network models, and the concept of statistically consistent vertex nomination schemes has been only shallowly explored. In this paper, we extend the vertex nomination problem to a very general statistical model of graphs. Further, drawing inspiration from the long-established classification framework in the pattern recognition literature, we provide definitions for the key notions of Bayes optimality and consistency in our extended vertex nomination framework, including a derivation of the Bayes optimal vertex nomination scheme. In addition, we prove that no universally consistent vertex nomination schemes exist. Illustrative examples are provided throughout.
△ Less
Submitted 9 December, 2018; v1 submitted 15 November, 2017;
originally announced November 2017.
-
Statistical inference on random dot product graphs: a survey
Authors:
Avanti Athreya,
Donniell E. Fishkind,
Keith Levin,
Vince Lyzinski,
Youngser Park,
Yichen Qin,
Daniel L. Sussman,
Minh Tang,
Joshua T. Vogelstein,
Carey E. Priebe
Abstract:
The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graph…
▽ More
The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graphs, a paradigm centered on spectral embeddings of adjacency and Laplacian matrices. We examine the analogues, in graph inference, of several canonical tenets of classical Euclidean inference: in particular, we summarize a body of existing results on the consistency and asymptotic normality of the adjacency and Laplacian spectral embeddings, and the role these spectral embeddings can play in the construction of single- and multi-sample hypothesis tests for graph data. We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. We outline requisite background and current open problems in spectral graph inference.
△ Less
Submitted 16 September, 2017;
originally announced September 2017.
-
A central limit theorem for an omnibus embedding of multiple random graphs and implications for multiscale network inference
Authors:
Keith Levin,
Avanti Athreya,
Minh Tang,
Vince Lyzinski,
Youngser Park,
Carey E. Priebe
Abstract:
Performing statistical analyses on collections of graphs is of import to many disciplines, but principled, scalable methods for multi-sample graph inference are few. Here we describe an "omnibus" embedding in which multiple graphs on the same vertex set are jointly embedded into a single space with a distinct representation for each graph. We prove a central limit theorem for this embedding and de…
▽ More
Performing statistical analyses on collections of graphs is of import to many disciplines, but principled, scalable methods for multi-sample graph inference are few. Here we describe an "omnibus" embedding in which multiple graphs on the same vertex set are jointly embedded into a single space with a distinct representation for each graph. We prove a central limit theorem for this embedding and demonstrate how it streamlines graph comparison, obviating the need for pairwise subspace alignments. The omnibus embedding achieves near-optimal inference accuracy when graphs arise from a common distribution and yet retains discriminatory power as a test procedure for the comparison of different graphs. Moreover, this joint embedding and the accompanying central limit theorem are important for answering multiscale graph inference questions, such as the identification of specific subgraphs or vertices responsible for similarity or difference across networks. We illustrate this with a pair of analyses of connectome data derived from dMRI and fMRI scans of human subjects. In particular, we show that this embedding allows the identification of specific brain regions associated with population-level differences. Finally, we sketch how the omnibus embedding can be used to address pressing open problems, both theoretical and practical, in multisample graph inference.
△ Less
Submitted 25 June, 2019; v1 submitted 25 May, 2017;
originally announced May 2017.
-
Semiparametric spectral modeling of the Drosophila connectome
Authors:
Carey E. Priebe,
Youngser Park,
Minh Tang,
Avanti Athreya,
Vince Lyzinski,
Joshua T. Vogelstein,
Yichen Qin,
Ben Cocanougher,
Katharina Eichler,
Marta Zlatic,
Albert Cardona
Abstract:
We present semiparametric spectral modeling of the complete larval Drosophila mushroom body connectome. Motivated by a thorough exploratory data analysis of the network via Gaussian mixture modeling (GMM) in the adjacency spectral embedding (ASE) representation space, we introduce the latent structure model (LSM) for network modeling and inference. LSM is a generalization of the stochastic block m…
▽ More
We present semiparametric spectral modeling of the complete larval Drosophila mushroom body connectome. Motivated by a thorough exploratory data analysis of the network via Gaussian mixture modeling (GMM) in the adjacency spectral embedding (ASE) representation space, we introduce the latent structure model (LSM) for network modeling and inference. LSM is a generalization of the stochastic block model (SBM) and a special case of the random dot product graph (RDPG) latent position model, and is amenable to semiparametric GMM in the ASE representation space. The resulting connectome code derived via semiparametric GMM composed with ASE captures latent connectome structure and elucidates biologically relevant neuronal properties.
△ Less
Submitted 9 May, 2017;
originally announced May 2017.
-
Matchability of heterogeneous networks pairs
Authors:
Vince Lyzinski,
Daniel L. Sussman
Abstract:
We consider the problem of graph matchability in non-identically distributed networks. In a general class of edge-independent networks, we demonstrate that graph matchability can be lost with high probability when matching the networks directly. We further demonstrate that under mild model assumptions, matchability is almost perfectly recovered by centering the networks using Universal Singular Va…
▽ More
We consider the problem of graph matchability in non-identically distributed networks. In a general class of edge-independent networks, we demonstrate that graph matchability can be lost with high probability when matching the networks directly. We further demonstrate that under mild model assumptions, matchability is almost perfectly recovered by centering the networks using Universal Singular Value Thresholding before matching. These theoretical results are then demonstrated in both real and synthetic simulation settings. We also recover analogous core-matchability results in a very general core-junk network model, wherein some vertices do not correspond between the graph pair.
△ Less
Submitted 20 March, 2019; v1 submitted 5 May, 2017;
originally announced May 2017.
-
Vertex Nomination Via Seeded Graph Matching
Authors:
Heather G. Patsolic,
Youngser Park,
Vince Lyzinski,
Carey E. Priebe
Abstract:
Consider two networks on overlapping, non-identical vertex sets. Given vertices of interest in the first network, we seek to identify the corresponding vertices, if any exist, in the second network. While in moderately sized networks graph matching methods can be applied directly to recover the missing correspondences, herein we present a principled methodology appropriate for situations in which…
▽ More
Consider two networks on overlapping, non-identical vertex sets. Given vertices of interest in the first network, we seek to identify the corresponding vertices, if any exist, in the second network. While in moderately sized networks graph matching methods can be applied directly to recover the missing correspondences, herein we present a principled methodology appropriate for situations in which the networks are too large for brute-force graph matching. Our methodology identifies vertices in a local neighborhood of the vertices of interest in the first network that have verifiable corresponding vertices in the second network. Leveraging these known correspondences, referred to as seeds, we match the induced subgraphs in each network generated by the neighborhoods of these verified seeds, and rank the vertices of the second network in terms of the most likely matches to the original vertices of interest. We demonstrate the applicability of our methodology through simulations and real data examples.
△ Less
Submitted 5 November, 2019; v1 submitted 1 May, 2017;
originally announced May 2017.
-
Numerical tolerance for spectral decompositions of random matrices
Authors:
Avanti Athreya,
Michael Kane,
Bryan Lewis,
Zachary Lubberts,
Vince Lyzinski,
Youngser Park,
Carey E. Priebe,
Minh Tang
Abstract:
We precisely quantify the impact of statistical error in the quality of a numerical approximation to a random matrix eigendecomposition, and under mild conditions, we use this to introduce an optimal numerical tolerance for residual error in spectral decompositions of random matrices. We demonstrate that terminating an eigendecomposition algorithm when the numerical error and statistical error are…
▽ More
We precisely quantify the impact of statistical error in the quality of a numerical approximation to a random matrix eigendecomposition, and under mild conditions, we use this to introduce an optimal numerical tolerance for residual error in spectral decompositions of random matrices. We demonstrate that terminating an eigendecomposition algorithm when the numerical error and statistical error are of the same order results in computational savings with no loss of accuracy. We also repair a flaw in a ubiquitous termination condition, one in wide employ in several computational linear algebra implementations. We illustrate the practical consequences of our stopping criterion with an analysis of simulated and real networks. Our theoretical results and real-data examples establish that the tradeoff between statistical and numerical error is of significant import for data science.
△ Less
Submitted 30 January, 2020; v1 submitted 1 August, 2016;
originally announced August 2016.
-
On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching
Authors:
Vince Lyzinski,
Keith Levin,
Donniell E. Fishkind,
Carey E. Priebe
Abstract:
Given a graph in which a few vertices are deemed interesting a priori, the vertex nomination task is to order the remaining vertices into a nomination list such that there is a concentration of interesting vertices at the top of the list. Previous work has yielded several approaches to this problem, with theoretical results in the setting where the graph is drawn from a stochastic block model (SBM…
▽ More
Given a graph in which a few vertices are deemed interesting a priori, the vertex nomination task is to order the remaining vertices into a nomination list such that there is a concentration of interesting vertices at the top of the list. Previous work has yielded several approaches to this problem, with theoretical results in the setting where the graph is drawn from a stochastic block model (SBM), including a vertex nomination analogue of the Bayes optimal classifier. In this paper, we prove that maximum likelihood (ML)-based vertex nomination is consistent, in the sense that the performance of the ML-based scheme asymptotically matches that of the Bayes optimal scheme. We prove theorems of this form both when model parameters are known and unknown. Additionally, we introduce and prove consistency of a related, more scalable restricted-focus ML vertex nomination scheme. Finally, we incorporate vertex and edge features into ML-based vertex nomination and briefly explore the empirical effectiveness of this approach.
△ Less
Submitted 27 August, 2016; v1 submitted 5 July, 2016;
originally announced July 2016.
-
Information Recovery in Shuffled Graphs via Graph Matching
Authors:
Vince Lyzinski
Abstract:
While many multiple graph inference methodologies operate under the implicit assumption that an explicit vertex correspondence is known across the vertex sets of the graphs, in practice these correspondences may only be partially or errorfully known. Herein, we provide an information theoretic foundation for understanding the practical impact that errorfully observed vertex correspondences can hav…
▽ More
While many multiple graph inference methodologies operate under the implicit assumption that an explicit vertex correspondence is known across the vertex sets of the graphs, in practice these correspondences may only be partially or errorfully known. Herein, we provide an information theoretic foundation for understanding the practical impact that errorfully observed vertex correspondences can have on subsequent inference, and the capacity of graph matching methods to recover the lost vertex alignment and inferential performance. Working in the correlated stochastic blockmodel setting, we establish a duality between the loss of mutual information due to an errorfully observed vertex correspondence and the ability of graph matching algorithms to recover the true correspondence across graphs. In the process, we establish a phase transition for graph matchability in terms of the correlation across graphs, and we conjecture the analogous phase transition for the relative information loss due to shuffling vertex labels. We demonstrate the practical effect that graph shuffling---and matching---can have on subsequent inference, with examples from two sample graph hypothesis testing and joint spectral graph clustering.
△ Less
Submitted 27 September, 2017; v1 submitted 8 May, 2016;
originally announced May 2016.
-
Laplacian Eigenmaps from Sparse, Noisy Similarity Measurements
Authors:
Keith Levin,
Vince Lyzinski
Abstract:
Manifold learning and dimensionality reduction techniques are ubiquitous in science and engineering, but can be computationally expensive procedures when applied to large data sets or when similarities are expensive to compute. To date, little work has been done to investigate the tradeoff between computational resources and the quality of learned representations. We present both theoretical and e…
▽ More
Manifold learning and dimensionality reduction techniques are ubiquitous in science and engineering, but can be computationally expensive procedures when applied to large data sets or when similarities are expensive to compute. To date, little work has been done to investigate the tradeoff between computational resources and the quality of learned representations. We present both theoretical and experimental explorations of this question. In particular, we consider Laplacian eigenmaps embeddings based on a kernel matrix, and explore how the embeddings behave when this kernel matrix is corrupted by occlusion and noise. Our main theoretical result shows that under modest noise and occlusion assumptions, we can (with high probability) recover a good approximation to the Laplacian eigenmaps embedding based on the uncorrupted kernel matrix. Our results also show how regularization can aid this approximation. Experimentally, we explore the effects of noise and occlusion on Laplacian eigenmaps embeddings of two real-world data sets, one from speech processing and one from neuroscience, as well as a synthetic data set.
△ Less
Submitted 16 August, 2016; v1 submitted 12 March, 2016;
originally announced March 2016.
-
Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs
Authors:
Da Zheng,
Disa Mhembere,
Vince Lyzinski,
Joshua Vogelstein,
Carey E. Priebe,
Randal Burns
Abstract:
Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memor…
▽ More
Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks--PageRank, eigensolving, and non-negative matrix factorization--and show that our SEM implementations significantly advance the state of the art.
△ Less
Submitted 14 October, 2016; v1 submitted 9 February, 2016;
originally announced February 2016.
-
Scalable Out-of-Sample Extension of Graph Embeddings Using Deep Neural Networks
Authors:
Aren Jansen,
Gregory Sell,
Vince Lyzinski
Abstract:
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In…
▽ More
Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection.
△ Less
Submitted 14 June, 2016; v1 submitted 18 August, 2015;
originally announced August 2015.
-
A Joint Graph Inference Case Study: the C.elegans Chemical and Electrical Connectomes
Authors:
Li Chen,
Joshua T. Vogelstein,
Vince Lyzinski,
Carey E. Priebe
Abstract:
We investigate joint graph inference for the chemical and electrical connectomes of the \textit{Caenorhabditis elegans} roundworm. The \textit{C.elegans} connectomes consist of $253$ non-isolated neurons with known functional attributes, and there are two types of synaptic connectomes, resulting in a pair of graphs. We formulate our joint graph inference from the perspectives of seeded graph match…
▽ More
We investigate joint graph inference for the chemical and electrical connectomes of the \textit{Caenorhabditis elegans} roundworm. The \textit{C.elegans} connectomes consist of $253$ non-isolated neurons with known functional attributes, and there are two types of synaptic connectomes, resulting in a pair of graphs. We formulate our joint graph inference from the perspectives of seeded graph matching and joint vertex classification. Our results suggest that connectomic inference should proceed in the joint space of the two connectomes, which has significant neuroscientific implications.
△ Less
Submitted 5 August, 2015; v1 submitted 30 July, 2015;
originally announced July 2015.
-
Community Detection and Classification in Hierarchical Stochastic Blockmodels
Authors:
Vince Lyzinski,
Minh Tang,
Avanti Athreya,
Youngser Park,
Carey E. Priebe
Abstract:
We propose a robust, scalable, integrated methodology for community detection and community comparison in graphs. In our procedure, we first embed a graph into an appropriate Euclidean space to obtain a low-dimensional representation, and then cluster the vertices into communities. We next employ nonparametric graph inference techniques to identify structural similarity among these communities. Th…
▽ More
We propose a robust, scalable, integrated methodology for community detection and community comparison in graphs. In our procedure, we first embed a graph into an appropriate Euclidean space to obtain a low-dimensional representation, and then cluster the vertices into communities. We next employ nonparametric graph inference techniques to identify structural similarity among these communities. These two steps are then applied recursively on the communities, allowing us to detect more fine-grained structure. We describe a hierarchical stochastic blockmodel---namely, a stochastic blockmodel with a natural hierarchical structure---and establish conditions under which our algorithm yields consistent estimates of model parameters and motifs, which we define to be stochastically similar groups of subgraphs. Finally, we demonstrate the effectiveness of our algorithm in both simulated and real data. Specifically, we address the problem of locating similar subcommunities in a partially reconstructed Drosophila connectome and in the social network Friendster.
△ Less
Submitted 25 August, 2016; v1 submitted 6 March, 2015;
originally announced March 2015.
-
Fast Embedding for JOFC Using the Raw Stress Criterion
Authors:
Vince Lyzinski,
Youngser Park,
Carey E. Priebe,
Michael W. Trosset
Abstract:
The Joint Optimization of Fidelity and Commensurability (JOFC) manifold matching methodology embeds an omnibus dissimilarity matrix consisting of multiple dissimilarities on the same set of objects. One approach to this embedding optimizes the preservation of fidelity to each individual dissimilarity matrix together with commensurability of each given observation across modalities via iterative ma…
▽ More
The Joint Optimization of Fidelity and Commensurability (JOFC) manifold matching methodology embeds an omnibus dissimilarity matrix consisting of multiple dissimilarities on the same set of objects. One approach to this embedding optimizes the preservation of fidelity to each individual dissimilarity matrix together with commensurability of each given observation across modalities via iterative majorization of a raw stress error criterion by successive Guttman transforms. In this paper, we exploit the special structure inherent to JOFC to exactly and efficiently compute the successive Guttman transforms, and as a result we are able to greatly speed up the JOFC procedure for both in-sample and out-of-sample embedding. We demonstrate the scalability of our implementation on both real and simulated data examples.
△ Less
Submitted 31 October, 2016; v1 submitted 11 February, 2015;
originally announced February 2015.
-
A nonparametric two-sample hypothesis testing problem for random dot product graphs
Authors:
Minh Tang,
Avanti Athreya,
Daniel L. Sussman,
Vince Lyzinski,
Carey E. Priebe
Abstract:
We consider the problem of testing whether two finite-dimensional random dot product graphs have generating latent positions that are independently drawn from the same distribution, or distributions that are related via scaling or projection. We propose a test statistic that is a kernel-based function of the adjacency spectral embedding for each graph. We obtain a limiting distribution for our tes…
▽ More
We consider the problem of testing whether two finite-dimensional random dot product graphs have generating latent positions that are independently drawn from the same distribution, or distributions that are related via scaling or projection. We propose a test statistic that is a kernel-based function of the adjacency spectral embedding for each graph. We obtain a limiting distribution for our test statistic under the null and we show that our test procedure is consistent across a broad range of alternatives.
△ Less
Submitted 12 November, 2015; v1 submitted 8 September, 2014;
originally announced September 2014.
-
Graph Matching: Relax at Your Own Risk
Authors:
Vince Lyzinski,
Donniell Fishkind,
Marcelo Fiori,
Joshua T. Vogelstein,
Carey E. Priebe,
Guillermo Sapiro
Abstract:
Graph matching---aligning a pair of graphs to minimize their edge disagreements---has received wide-spread attention from both theoretical and applied communities over the past several decades, including combinatorics, computer vision, and connectomics. Its attention can be partially attributed to its computational difficulty. Although many heuristics have previously been proposed in the literatur…
▽ More
Graph matching---aligning a pair of graphs to minimize their edge disagreements---has received wide-spread attention from both theoretical and applied communities over the past several decades, including combinatorics, computer vision, and connectomics. Its attention can be partially attributed to its computational difficulty. Although many heuristics have previously been proposed in the literature to approximately solve graph matching, very few have any theoretical support for their performance. A common technique is to relax the discrete problem to a continuous problem, therefore enabling practitioners to bring gradient-descent-type algorithms to bear. We prove that an indefinite relaxation (when solved exactly) almost always discovers the optimal permutation, while a common convex relaxation almost always fails to discover the optimal permutation. These theoretical results suggest that initializing the indefinite algorithm with the convex optimum might yield improved practical performance. Indeed, experimental results illuminate and corroborate these theoretical findings, demonstrating that excellent results are achieved in both benchmark and real data problems by amalgamating the two approaches.
△ Less
Submitted 9 January, 2015; v1 submitted 13 May, 2014;
originally announced May 2014.
-
A semiparametric two-sample hypothesis testing problem for random dot product graphs
Authors:
Minh Tang,
Avanti Athreya,
Daniel L. Sussman,
Vince Lyzinski,
Carey E. Priebe
Abstract:
Two-sample hypothesis testing for random graphs arises naturally in neuroscience, social networks, and machine learning. In this paper, we consider a semiparametric problem of two-sample hypothesis testing for a class of latent position random graphs. We formulate a notion of consistency in this context and propose a valid test for the hypothesis that two finite-dimensional random dot product grap…
▽ More
Two-sample hypothesis testing for random graphs arises naturally in neuroscience, social networks, and machine learning. In this paper, we consider a semiparametric problem of two-sample hypothesis testing for a class of latent position random graphs. We formulate a notion of consistency in this context and propose a valid test for the hypothesis that two finite-dimensional random dot product graphs on a common vertex set have the same generating latent positions or have generating latent positions that are scaled or diagonal transformations of one another. Our test statistic is a function of a spectral decomposition of the adjacency matrix for each graph and our test procedure is consistent across a broad range of alternatives. We apply our test procedure to real biological data: in a test-retest data set of neural connectome graphs, we are able to distinguish between scans from different subjects; and in the {\em C.elegans} connectome, we are able to distinguish between chemical and electrical networks. The latter example is a concrete demonstration that our test can have power even for small sample sizes. We conclude by discussing the relationship between our test procedure and generalized likelihood ratio tests.
△ Less
Submitted 18 June, 2015; v1 submitted 27 March, 2014;
originally announced March 2014.
-
Strong Stationary Duality for Diffusion Processes
Authors:
James Allen Fill,
Vince Lyzinski
Abstract:
We develop the theory of strong stationary duality for diffusion processes on compact intervals. We analytically derive the generator and boundary behavior of the dual process and recover a central tenet of the classical Markov chain theory in the diffusion setting by linking the separation distance in the primal diffusion to the absorption time in the dual diffusion. We also exhibit our strong st…
▽ More
We develop the theory of strong stationary duality for diffusion processes on compact intervals. We analytically derive the generator and boundary behavior of the dual process and recover a central tenet of the classical Markov chain theory in the diffusion setting by linking the separation distance in the primal diffusion to the absorption time in the dual diffusion. We also exhibit our strong stationary dual as the natural limiting process of the strong stationary dual sequence of a well chosen sequence of approximating birth-and-death Markov chains, allowing for simultaneous numerical simulations of our primal and dual diffusion processes. Lastly, we show how our new definition of diffusion duality allows the spectral theory of cutoff phenomena to extend naturally from birth-and-death Markov chains to the present diffusion context.
△ Less
Submitted 17 April, 2015; v1 submitted 23 February, 2014;
originally announced February 2014.
-
Seeded Graph Matching Via Joint Optimization of Fidelity and Commensurability
Authors:
Heather Patsolic,
Sancar Adali,
Joshua T. Vogelstein,
Youngser Park,
Carey E. Friebe,
Gongkai Li,
Vince Lyzinski
Abstract:
We present a novel approximate graph matching algorithm that incorporates seeded data into the graph matching paradigm. Our Joint Optimization of Fidelity and Commensurability (JOFC) algorithm embeds two graphs into a common Euclidean space where the matching inference task can be performed. Through real and simulated data examples, we demonstrate the versatility of our algorithm in matching graph…
▽ More
We present a novel approximate graph matching algorithm that incorporates seeded data into the graph matching paradigm. Our Joint Optimization of Fidelity and Commensurability (JOFC) algorithm embeds two graphs into a common Euclidean space where the matching inference task can be performed. Through real and simulated data examples, we demonstrate the versatility of our algorithm in matching graphs with various characteristics--weightedness, directedness, loopiness, many-to-one and many-to-many matchings, and soft seedings.
△ Less
Submitted 8 December, 2019; v1 submitted 15 January, 2014;
originally announced January 2014.
-
Vertex nomination schemes for membership prediction
Authors:
D. E. Fishkind,
V. Lyzinski,
H. Pao,
L. Chen,
C. E. Priebe
Abstract:
Suppose that a graph is realized from a stochastic block model where one of the blocks is of interest, but many or all of the vertices' block labels are unobserved. The task is to order the vertices with unobserved block labels into a ``nomination list'' such that, with high probability, vertices from the interesting block are concentrated near the list's beginning. We propose several vertex nomin…
▽ More
Suppose that a graph is realized from a stochastic block model where one of the blocks is of interest, but many or all of the vertices' block labels are unobserved. The task is to order the vertices with unobserved block labels into a ``nomination list'' such that, with high probability, vertices from the interesting block are concentrated near the list's beginning. We propose several vertex nomination schemes. Our basic - but principled - setting and development yields a best nomination scheme (which is a Bayes-Optimal analogue), and also a likelihood maximization nomination scheme that is practical to implement when there are a thousand vertices, and which is empirically near-optimal when the number of vertices is small enough to allow comparison to the best nomination scheme. We then illustrate the robustness of the likelihood maximization nomination scheme to the modeling challenges inherent in real data, using examples which include a social network involving human trafficking, the Enron Graph, a worm brain connectome and a political blog network.
△ Less
Submitted 17 November, 2015; v1 submitted 9 December, 2013;
originally announced December 2013.
-
Spectral Clustering for Divide-and-Conquer Graph Matching
Authors:
Vince Lyzinski,
Daniel L. Sussman,
Donniell E. Fishkind,
Henry Pao,
Li Chen,
Joshua T. Vogelstein,
Youngser Park,
Carey E. Priebe
Abstract:
We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through ou…
▽ More
We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through our divide-and-conquer procedure. We also demonstrate the effectiveness of our approach in matching very large graphs in simulated and real data examples, showing up to a factor of 8 improvement in runtime with minimal sacrifice in accuracy.
△ Less
Submitted 12 March, 2015; v1 submitted 4 October, 2013;
originally announced October 2013.
-
Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding
Authors:
Vince Lyzinski,
Daniel Sussman,
Minh Tang,
Avanti Athreya,
Carey Priebe
Abstract:
Vertex clustering in a stochastic blockmodel graph has wide applicability and has been the subject of extensive research. In thispaper, we provide a short proof that the adjacency spectral embedding can be used to obtain perfect clustering for the stochastic blockmodel and the degree-corrected stochastic blockmodel. We also show an analogous result for the more general random dot product graph mod…
▽ More
Vertex clustering in a stochastic blockmodel graph has wide applicability and has been the subject of extensive research. In thispaper, we provide a short proof that the adjacency spectral embedding can be used to obtain perfect clustering for the stochastic blockmodel and the degree-corrected stochastic blockmodel. We also show an analogous result for the more general random dot product graph model.
△ Less
Submitted 15 January, 2015; v1 submitted 1 October, 2013;
originally announced October 2013.
-
A central limit theorem for scaled eigenvectors of random dot product graphs
Authors:
Avanti Athreya,
Vince Lyzinski,
David J. Marchette,
Carey E. Priebe,
Daniel L. Sussman,
Minh Tang
Abstract:
We prove a central limit theorem for the components of the largest eigenvectors of the adjacency matrix of a finite-dimensional random dot product graph whose true latent positions are unknown. In particular, we follow the methodology outlined in \citet{sussman2012universally} to construct consistent estimates for the latent positions, and we show that the appropriately scaled differences between…
▽ More
We prove a central limit theorem for the components of the largest eigenvectors of the adjacency matrix of a finite-dimensional random dot product graph whose true latent positions are unknown. In particular, we follow the methodology outlined in \citet{sussman2012universally} to construct consistent estimates for the latent positions, and we show that the appropriately scaled differences between the estimated and true latent positions converge to a mixture of Gaussian random variables. As a corollary, we obtain a central limit theorem for the first eigenvector of the adjacency matrix of an Erdös-Renyi random graph.
△ Less
Submitted 23 December, 2013; v1 submitted 31 May, 2013;
originally announced May 2013.
-
Seeded graph matching for correlated Erdős-Rényi graphs
Authors:
Vince Lyzinski,
Donniell E. Fishkind,
Carey E. Priebe
Abstract:
Graph matching is an important problem in machine learning and pattern recognition. Herein, we present theoretical and practical results on the consistency of graph matching for estimating a latent alignment function between the vertex sets of two graphs, as well as subsequent algorithmic implications when the latent alignment is partially observed. In the correlated Erdős-Rényi graph setting, we…
▽ More
Graph matching is an important problem in machine learning and pattern recognition. Herein, we present theoretical and practical results on the consistency of graph matching for estimating a latent alignment function between the vertex sets of two graphs, as well as subsequent algorithmic implications when the latent alignment is partially observed. In the correlated Erdős-Rényi graph setting, we prove that graph matching provides a strongly consistent estimate of the latent alignment in the presence of even modest correlation. We then investigate a tractable, restricted-focus version of graph matching, which is only concerned with adjacency involving vertices in a partial observation of the latent alignment; we prove that a logarithmic number of vertices whose alignment is known is sufficient for this restricted-focus version of graph matching to yield a strongly consistent estimate of the latent alignment of the remaining vertices. We show how Frank-Wolfe methodology for approximate graph matching, when there is a partially observed latent alignment, inherently incorporates this restricted focus graph matching. Lastly, we illustrate the relationship between seeded graph matching and restricted-focus graph matching by means of an illuminating example from human connectomics.
△ Less
Submitted 1 August, 2014; v1 submitted 29 April, 2013;
originally announced April 2013.