Graph Summarization with Latent Variable Probabilistic Models

Fukushima, Shintaro; Kanai, Ryoga; Yamanishi, Kenji

doi:10.1007/978-3-030-93413-2_36

Shintaro Fukushima^8,9,
Ryoga Kanai⁸ &
Kenji Yamanishi⁸

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1073))

Included in the following conference series:

International Conference on Complex Networks and Their Applications

3714 Accesses
4 Citations

Abstract

This study addresses the issue of summarizing a static graph, known as graph summarization, effectively and efficiently. The resulting compact graph is referred to as a summary graph. Based on the minimum description length principle (MDL), we propose a novel graph summarization algorithm called the graph summarization with latent variable probabilistic models (GSL) for a static graph. MDL asserts that the best statistical decision strategy is the one that best compresses the data. The key idea of GSL is encoding the original and summary graphs simultaneously using latent variable probabilistic models with two-part coding, that is, first encoding a summary graph, then encoding the original graph given the summary graph. When encoding these graphs, we can use various latent variable probabilistic models. Therefore, we can encode a more complex graph structure than the conventional graph summarization algorithms. We demonstrate the effectiveness of GSL on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Are Edge Weights in Summary Graphs Useful? - A Comparative Study

Summarizing Labeled Multi-graphs

Reducing large graphs to small supergraphs: a unified approach

Article 10 March 2018

References

Beg, M.A., Ahmad, M., Zaman, A., Khan, I.: Scalable approximation algorithm for graph summarization. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2018, pp. 502–514 (2018)
Google Scholar
Fukushima, S., Yamanishi, K.: Hierarchical change detection in latent variable models. In: Proceedings of 2020 IEEE International Conference on Data Mining, ICDM2020, pp. 1128–1134 (2020)
Google Scholar
Grünwald, P.: The Minimum Description Length Principle. MIT Press (2007)
Google Scholar
Hric, D., Peixoto, T.P., Fortunato, S.: Network structure, metadata, and the prediction of missing nodes and annotations. Phys. Rev. X 6(3), 031038 (2016)
Google Scholar
Kontkanen, P., Myllymäki, P.: A linear-time algorithm for computing the multinomial stochastic complexity. Inf. Process. Lett. 103(6), 227–233 (2007)
Article MathSciNet MATH Google Scholar
Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: VOG: summarizing and understanding large graphs. In: Proceedings of the 2014 SIAM International Conference on Data Mining, SDM 2014, pp. 91–99 (2014)
Google Scholar
Lee, K., Jo, H., Ko, J., Lim, S., Shin, K.: SSumM: sparse summarization of massive graphs. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2020, pp. 144–154 (2020)
Google Scholar
LeFevre, K., Terzi, E.: GraSS: graph structure summarization. In: Proceedings of the 2010 SIAM International Conference on Data Mining, SDM 2010, pp. 454–465 (2010)
Google Scholar
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2005, pp. 177–187 (2005)
Google Scholar
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1(1), 2-es (2007)
Article Google Scholar
Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2009)
Article MathSciNet MATH Google Scholar
Liu, Y., Safavi, T., Dighe, A., Koutra, D.: Graph summarization methods and applications: a survey. ACM Comput. Surv. 51(3), 62:1-62:34 (2018)
Google Scholar
Mariadassou, M., Robin, S., Vacher, C.: Uncovering latent structure in valued graphs: a variational approach. Ann. Appl. Stat. 4(2), 715–742 (2010)
Article MathSciNet MATH Google Scholar
McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: Proceedings of the 25th Advances in Neural Information Processing Systems, NIPS 2012, pp. 539–547 (2012)
Google Scholar
Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 419–432 (2008)
Google Scholar
Peixoto, T.P.: Entropy of stochastic blockmodel ensembles. Phys. Rev. E 85(5), 056122 (2012)
Article Google Scholar
Peixoto, T.P.: Parsimonious module inference in large networks. Phys. Rev. Lett. 110, 148701 (2013)
Article Google Scholar
Peixoto, T.P.: Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models. Phys. Rev. E 89, 012804 (2014)
Article Google Scholar
Peixoto, T.P.: Hierarchical block structures and high-resolution model selection in large networks. Phys. Rev. X 4, 011047 (2014)
Google Scholar
Peixoto, T.P.: Model selection and hypothesis testing for large-scale network models with overlapping groups. Phys. Rev. X 5, 011033 (2015)
Google Scholar
Riondato, M., García-Soriano, D., Bonchi, F.: Graph summarization with quality guarantees. Data Min. Knowl. Disc. 31(2), 314–349 (2016). https://doi.org/10.1007/s10618-016-0468-8
Article MathSciNet MATH Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Article MATH Google Scholar
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–431 (1983)
Article MathSciNet MATH Google Scholar
Rissanen, J.: Optimal Estimation of Parameters (2012)
Google Scholar
Shtar’kov, Y.M.: Universal sequential coding of single messages. Probl. Inf. Transm. 23(3), 3–17 (1987)
MATH Google Scholar
Snijders, T.A.B., Nowicki, K.: Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classif. 14(1), 75–100 (1997)
Article MathSciNet MATH Google Scholar
Catalá, T.V., Peixoto, T.P., Pardo, M.S., Guimerá, R.: Consistencies and inconsistencies between model selection and link prediction in networks. Phys. Rev. E 97, 062316 (2018)
Article Google Scholar
Wu, T., Sugawara, S., Yamanishi, K.: Decomposed normalized maximum likelihood codelength criterion for selecting hierarchical latent variable models. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2017, pp. 1165–1174 (2017)
Google Scholar
Wu, Y., Zhong, Z., Xiong, W., Jing, N.: Graph summarization for attributed graphs. In: Proceedings of 2014 International Conference on Information Science, Electronics and Electrical Engineering, ISEEE 2014, vol. 1, pp. 503–507 (2014)
Google Scholar
Yamanishi, K., Fukushima, S.: Model change detection with the MDL principle. IEEE Trans. Inf. Theor. 9(64), 6115–6126 (2018)
Article MathSciNet MATH Google Scholar
Yamanishi, K., Miyaguchi, K.: Detecting gradual changes from data stream using MDL-change statistics. In: Proceedings of 2016 IEEE International Conference on BigData, BigData 2016, pp. 156–163 (2016)
Google Scholar
Yamanishi, K., Wu, T., Sugawara, S., Okada, M.: The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models. Data Min. Knowl. Disc. 33(4), 1017–1058 (2019). https://doi.org/10.1007/s10618-019-00624-4
Article MathSciNet MATH Google Scholar
Zhou, H., Liu, S., Lee, K., Shin, K., Shen, H., Cheng, X.: DPGS: degree-preserving graph summarization. In: Proceedings of the 2021 SIAM International Conference on Data Mining, SDM 2021, pp. 280–288 (2021)
Google Scholar

Download references

Acknowledgements

This work was partially supported by JST KAKENHI 191400000190 and JST-AIP JPMJCR19U4.

Author information

Authors and Affiliations

The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo, Japan
Shintaro Fukushima, Ryoga Kanai & Kenji Yamanishi
Toyota Motor Corporation, Otemachi 1-6-1, Chiyoda-ku, Tokyo, Japan
Shintaro Fukushima

Authors

Shintaro Fukushima
View author publications
You can also search for this author in PubMed Google Scholar
Ryoga Kanai
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Yamanishi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas Universidad Politécnica de Madrid, Universidad Politécnica de Madrid, Madrid, Madrid, Spain
Rosa Maria Benito
IUT Lumière, University of Lyon, Bron Cedex, France
Chantal Cherifi
LE2I, UFR Sciences et Techniques, University of Burgundy, Dijon, France
Hocine Cherifi
Grupo Interdisciplinar de Sistemas Compl, Universidad Carlos III de Madrid, Leganés, Madrid, Spain
Esteban Moro
Center for Complex Networks and Systems Research School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
Luis M. Rocha
Department of Chemical Engineering, Universitat Rovira i Virgili, Tarragona, Tarragona, Spain
Marta Sales-Pardo

A MDL Principle

Two-Part Coding. Let Y be a random variable and y be a realization. Assume that we observe a set of data $y^{n} = y_{1}, \dots , y_{n} \in \mathcal {Y}^{n}$, where $y_{i} \in \mathbb {R}^{d} \, (i=1, \dots , n)$, and $\mathcal {Y}$ denotes the data domain. The fundamental idea of the MDL principle is to select the best model that minimizes the total code-lengths of the model and data. Let us consider the following parametric model families:

$$\begin{aligned} \mathcal {F} = \{ f(Y^{n} ; \theta , M \in \mathcal {M}) : \theta \in \varTheta \}, \end{aligned}$$

(13)

where $\mathcal {M}$ is a model space and M is a model. Here, in the case of SBM, a model refers to a family of models with the fixed number of groups (clusters), and $\mathcal {M}$ is a set of the family of models with various numbers of groups (clusters). Let us consider how to encode data $y^{n}$ and model M, simultaneously. When we use a prefix code-length L, the total code-length required to encode data $y^{n}$ and M is decomposed into a sum of the code-length of the data given the model and that of the model itself as

$$\begin{aligned} L(y^{n} : M) = L(y^{n} | M) + L(M), \end{aligned}$$

(14)

where $L(y^{n} : M)$, $L(y^{n} |M)$, and L(M) denote the total code-length required to encode $y^{n}$ and M, the code-length to encode $y^{n}$ for the given M, and M, respectively. The prefix code-length refers to the code-length encoded with the prefix coding that satisfies the following Kraft’s inequalities [24]: $\sum _{ y \in \mathcal {Y} } 2^{-L(y | M) } \le 1$ and $\sum _{M \in \mathcal {M}} 2^{-L(M)} \le 1$. Then, the MDL criterion asserts that the total code-length $L(y^{n}: M)$ should be minimized with respect to M:

$$\begin{aligned} L(y^{n} : M) = L(y^{n} | M) + L(M) \quad \Longrightarrow \min . \, \, \mathrm {w.r.t.} \, M. \end{aligned}$$

Normalized Maximum Likelihood (NML) Code-Length. We consider how to achieve the shortest code-length for a family of distributions in Eq. (13). Here, we introduce the normalized maximum likelihood (NML) code-length. The NML code-length is the optimal code-length that achieves Shtarkov’s minimax regret [25]. We consider a model class in Eq. (13). The NML code-length achieves the following Shtarkov’s minimax regret:

$$\begin{aligned} \min _{g} \max _{y^{n} \in \mathcal {Y}^{n}} \left\{ -\log { g(y^{n}) } - \min _{\theta } (-\log { f(y^{n} ; \theta , M) } ) \right\} , \end{aligned}$$

where g is called the normalized maximum likelihood (NML) distribution, a distribution that achieves the minimum regret. The NML distribution $f_{\mathrm {NML}}$ is described as follows: $f_{\mathrm {NML}}(y^{n}) = \frac{ f(y^{n}; \hat{\theta }(y^{n}), M) }{ \int f(Y^{n}; \hat{\theta }(Y^{n}), M) \, d{Y^{n}} }$, where $\hat{\theta }(y^{n})$ is the following maximum likelihood estimator: . It is possible to encode $y^{n}$ with the following code-length using the NML distribution:

(15)

The code-length $L_{\mathrm {NML}}(y^{n}; M)$ in Eq. (15) is called the NML code-length and the last term in the right-hand side is called the parametric complexity to $\mathcal {F}$ in Eq. (13) with data length n.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fukushima, S., Kanai, R., Yamanishi, K. (2022). Graph Summarization with Latent Variable Probabilistic Models. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds) Complex Networks & Their Applications X. COMPLEX NETWORKS 2021. Studies in Computational Intelligence, vol 1073. Springer, Cham. https://doi.org/10.1007/978-3-030-93413-2_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-93413-2_36
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93412-5
Online ISBN: 978-3-030-93413-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Graph Summarization with Latent Variable Probabilistic Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others