Collective graph identification
ACM Transactions on Knowledge Discovery from Data (TKDD), 2016•dl.acm.org
Data describing networks—such as communication networks, transaction networks, disease
transmission networks, collaboration networks, etc.—are becoming increasingly available.
While observational data can be useful, it often only hints at the actual underlying process
that governs interactions and attributes. For example, an email communication network
provides insight into its users and their relationships, but is not the same as the “real”
underlying social network. In this article, we introduce the problem of graph identification, ie …
transmission networks, collaboration networks, etc.—are becoming increasingly available.
While observational data can be useful, it often only hints at the actual underlying process
that governs interactions and attributes. For example, an email communication network
provides insight into its users and their relationships, but is not the same as the “real”
underlying social network. In this article, we introduce the problem of graph identification, ie …
Data describing networks—such as communication networks, transaction networks, disease transmission networks, collaboration networks, etc.—are becoming increasingly available. While observational data can be useful, it often only hints at the actual underlying process that governs interactions and attributes. For example, an email communication network provides insight into its users and their relationships, but is not the same as the “real” underlying social network. In this article, we introduce the problem of graph identification, i.e., discovering the latent graph structure underlying an observed network. We cast the problem as a probabilistic inference task, in which we must infer the nodes, edges, and node labels of a hidden graph, based on evidence. This entails solving several canonical problems in network analysis: entity resolution (determining when two observations correspond to the same entity), link prediction (inferring the existence of links), and node labeling (inferring hidden attributes). While each of these subproblems has been well studied in isolation, here we consider them as a single, collective task. We present a simple, yet novel, approach to address all three subproblems simultaneously. Our approach, which we refer to as C3, consists of a collection of Coupled Collective Classifiers that are applied iteratively to propagate inferred information among the subproblems. We consider variants of C3 using different learning and inference techniques and empirically demonstrate that C3 is superior, both in terms of predictive accuracy and running time, to state-of-the-art probabilistic approaches on four real problems.
