Abstract
This study addresses the issue of summarizing a static graph, known as graph summarization, effectively and efficiently. The resulting compact graph is referred to as a summary graph. Based on the minimum description length principle (MDL), we propose a novel graph summarization algorithm called the graph summarization with latent variable probabilistic models (GSL) for a static graph. MDL asserts that the best statistical decision strategy is the one that best compresses the data. The key idea of GSL is encoding the original and summary graphs simultaneously using latent variable probabilistic models with two-part coding, that is, first encoding a summary graph, then encoding the original graph given the summary graph. When encoding these graphs, we can use various latent variable probabilistic models. Therefore, we can encode a more complex graph structure than the conventional graph summarization algorithms. We demonstrate the effectiveness of GSL on both synthetic and real datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Beg, M.A., Ahmad, M., Zaman, A., Khan, I.: Scalable approximation algorithm for graph summarization. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2018, pp. 502–514 (2018)
Fukushima, S., Yamanishi, K.: Hierarchical change detection in latent variable models. In: Proceedings of 2020 IEEE International Conference on Data Mining, ICDM2020, pp. 1128–1134 (2020)
Grünwald, P.: The Minimum Description Length Principle. MIT Press (2007)
Hric, D., Peixoto, T.P., Fortunato, S.: Network structure, metadata, and the prediction of missing nodes and annotations. Phys. Rev. X 6(3), 031038 (2016)
Kontkanen, P., Myllymäki, P.: A linear-time algorithm for computing the multinomial stochastic complexity. Inf. Process. Lett. 103(6), 227–233 (2007)
Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: VOG: summarizing and understanding large graphs. In: Proceedings of the 2014 SIAM International Conference on Data Mining, SDM 2014, pp. 91–99 (2014)
Lee, K., Jo, H., Ko, J., Lim, S., Shin, K.: SSumM: sparse summarization of massive graphs. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020, pp. 144–154 (2020)
LeFevre, K., Terzi, E.: GraSS: graph structure summarization. In: Proceedings of the 2010 SIAM International Conference on Data Mining, SDM 2010, pp. 454–465 (2010)
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2005, pp. 177–187 (2005)
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1(1), 2-es (2007)
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2009)
Liu, Y., Safavi, T., Dighe, A., Koutra, D.: Graph summarization methods and applications: a survey. ACM Comput. Surv. 51(3), 62:1-62:34 (2018)
Mariadassou, M., Robin, S., Vacher, C.: Uncovering latent structure in valued graphs: a variational approach. Ann. Appl. Stat. 4(2), 715–742 (2010)
McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: Proceedings of the 25th Advances in Neural Information Processing Systems, NIPS 2012, pp. 539–547 (2012)
Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 419–432 (2008)
Peixoto, T.P.: Entropy of stochastic blockmodel ensembles. Phys. Rev. E 85(5), 056122 (2012)
Peixoto, T.P.: Parsimonious module inference in large networks. Phys. Rev. Lett. 110, 148701 (2013)
Peixoto, T.P.: Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models. Phys. Rev. E 89, 012804 (2014)
Peixoto, T.P.: Hierarchical block structures and high-resolution model selection in large networks. Phys. Rev. X 4, 011047 (2014)
Peixoto, T.P.: Model selection and hypothesis testing for large-scale network models with overlapping groups. Phys. Rev. X 5, 011033 (2015)
Riondato, M., García-Soriano, D., Bonchi, F.: Graph summarization with quality guarantees. Data Min. Knowl. Disc. 31(2), 314–349 (2016). https://doi.org/10.1007/s10618-016-0468-8
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–431 (1983)
Rissanen, J.: Optimal Estimation of Parameters (2012)
Shtar’kov, Y.M.: Universal sequential coding of single messages. Probl. Inf. Transm. 23(3), 3–17 (1987)
Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classif. 14(1), 75–100 (1997)
Catalá, T.V., Peixoto, T.P., Pardo, M.S., Guimerá, R.: Consistencies and inconsistencies between model selection and link prediction in networks. Phys. Rev. E 97, 062316 (2018)
Wu, T., Sugawara, S., Yamanishi, K.: Decomposed normalized maximum likelihood codelength criterion for selecting hierarchical latent variable models. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017, pp. 1165–1174 (2017)
Wu, Y., Zhong, Z., Xiong, W., Jing, N.: Graph summarization for attributed graphs. In: Proceedings of 2014 International Conference on Information Science, Electronics and Electrical Engineering, ISEEE 2014, vol. 1, pp. 503–507 (2014)
Yamanishi, K., Fukushima, S.: Model change detection with the MDL principle. IEEE Trans. Inf. Theor. 9(64), 6115–6126 (2018)
Yamanishi, K., Miyaguchi, K.: Detecting gradual changes from data stream using MDL-change statistics. In: Proceedings of 2016 IEEE International Conference on BigData, BigData 2016, pp. 156–163 (2016)
Yamanishi, K., Wu, T., Sugawara, S., Okada, M.: The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models. Data Min. Knowl. Disc. 33(4), 1017–1058 (2019). https://doi.org/10.1007/s10618-019-00624-4
Zhou, H., Liu, S., Lee, K., Shin, K., Shen, H., Cheng, X.: DPGS: degree-preserving graph summarization. In: Proceedings of the 2021 SIAM International Conference on Data Mining, SDM 2021, pp. 280–288 (2021)
Acknowledgements
This work was partially supported by JST KAKENHI 191400000190 and JST-AIP JPMJCR19U4.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
A MDL Principle
A MDL Principle
Two-Part Coding. Let Y be a random variable and y be a realization. Assume that we observe a set of data \(y^{n} = y_{1}, \dots , y_{n} \in \mathcal {Y}^{n}\), where \(y_{i} \in \mathbb {R}^{d} \, (i=1, \dots , n)\), and \(\mathcal {Y}\) denotes the data domain. The fundamental idea of the MDL principle is to select the best model that minimizes the total code-lengths of the model and data. Let us consider the following parametric model families:
where \(\mathcal {M}\) is a model space and M is a model. Here, in the case of SBM, a model refers to a family of models with the fixed number of groups (clusters), and \(\mathcal {M}\) is a set of the family of models with various numbers of groups (clusters). Let us consider how to encode data \(y^{n}\) and model M, simultaneously. When we use a prefix code-length L, the total code-length required to encode data \(y^{n}\) and M is decomposed into a sum of the code-length of the data given the model and that of the model itself as
where \(L(y^{n} : M)\), \(L(y^{n} |M)\), and L(M) denote the total code-length required to encode \(y^{n}\) and M, the code-length to encode \(y^{n}\) for the given M, and M, respectively. The prefix code-length refers to the code-length encoded with the prefix coding that satisfies the following Kraft’s inequalities [24]: \(\sum _{ y \in \mathcal {Y} } 2^{-L(y | M) } \le 1\) and \(\sum _{M \in \mathcal {M}} 2^{-L(M)} \le 1\). Then, the MDL criterion asserts that the total code-length \(L(y^{n}: M)\) should be minimized with respect to M:
Normalized Maximum Likelihood (NML) Code-Length. We consider how to achieve the shortest code-length for a family of distributions in Eq. (13). Here, we introduce the normalized maximum likelihood (NML) code-length. The NML code-length is the optimal code-length that achieves Shtarkov’s minimax regret [25]. We consider a model class in Eq. (13). The NML code-length achieves the following Shtarkov’s minimax regret:
where g is called the normalized maximum likelihood (NML) distribution, a distribution that achieves the minimum regret. The NML distribution \(f_{\mathrm {NML}}\) is described as follows: \(f_{\mathrm {NML}}(y^{n}) = \frac{ f(y^{n}; \hat{\theta }(y^{n}), M) }{ \int f(Y^{n}; \hat{\theta }(Y^{n}), M) \, d{Y^{n}} }\), where \(\hat{\theta }(y^{n})\) is the following maximum likelihood estimator: . It is possible to encode \(y^{n}\) with the following code-length using the NML distribution:
The code-length \(L_{\mathrm {NML}}(y^{n}; M)\) in Eq. (15) is called the NML code-length and the last term in the right-hand side is called the parametric complexity to \(\mathcal {F}\) in Eq. (13) with data length n.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fukushima, S., Kanai, R., Yamanishi, K. (2022). Graph Summarization with Latent Variable Probabilistic Models. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds) Complex Networks & Their Applications X. COMPLEX NETWORKS 2021. Studies in Computational Intelligence, vol 1073. Springer, Cham. https://doi.org/10.1007/978-3-030-93413-2_36
Download citation
DOI: https://doi.org/10.1007/978-3-030-93413-2_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93412-5
Online ISBN: 978-3-030-93413-2
eBook Packages: EngineeringEngineering (R0)