Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Distributed Pseudo-Likelihood Method for Community Detection in Large-Scale Networks

Published: 19 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    This paper proposes a distributed pseudo-likelihood method (DPL) to conveniently identify the community structure of large-scale networks. Specifically, we first propose a block-wise splitting method to divide large-scale network data into several subnetworks and distribute them among multiple workers. For simplicity, we assume the classical stochastic block model. Then, the DPL algorithm is iteratively implemented for the distributed optimization of the sum of the local pseudo-likelihood functions. At each iteration, the worker updates its local community labels and communicates with the master. The master then broadcasts the combined estimator to each worker for the new iterative steps. Based on the distributed system, DPL significantly reduces the computational complexity of the traditional pseudo-likelihood method using a single machine. Furthermore, to ensure statistical accuracy, we theoretically discuss the requirements of the worker sample size. Moreover, we extend the DPL method to estimate degree-corrected stochastic block models. The superior performance of the proposed distributed algorithm is demonstrated through extensive numerical studies and real data analysis.

    References

    [1]
    Sameer Agarwal, Jongwoo Lim, Lihi Zelnik-Manor, Pietro Perona, David Kriegman, and Serge Belongie. 2005. Beyond pairwise clustering. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2. IEEE, 838–845.
    [2]
    Edo M. Airoldi, David Blei, Stephen Fienberg, and Eric Xing. 2008. Mixed membership stochastic blockmodels. Advances in Neural Information Processing Systems 21 (2008).
    [3]
    Arash A. Amini, Aiyou Chen, Peter J. Bickel, Elizaveta Levina, et al. 2013. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics 41, 4 (2013), 2097–2122.
    [4]
    Arash A. Amini and Elizaveta Levina. 2018. On semidefinite relaxations for the block model. The Annals of Statistics 46, 1 (2018), 149–179.
    [5]
    Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. 2018. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics 46, 3 (2018), 1352.
    [6]
    T. Tony Cai and Xiaodong Li. 2015. Robust and computationally feasible community detection in the presence of arbitrary outlier nodes. The Annals of Statistics 43, 3 (2015), 1027–1059.
    [7]
    Kamalika Chaudhuri, Fan Chung, and Alexander Tsiatas. 2012. Spectral clustering of graphs with general degrees in the extended planted partition model. In Conference on Learning Theory. 35–1.
    [8]
    Jiecao Chen, He Sun, David Woodruff, and Qin Zhang. 2016. Communication-optimal distributed clustering. Advances in Neural Information Processing Systems 29 (2016).
    [9]
    Mingming Chen, Konstantin Kuzmin, and Boleslaw K. Szymanski. 2014b. Community detection via maximization of modularity and its variants. IEEE Transactions on Computational Social Systems 1, 1 (2014), 46–65.
    [10]
    Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang. 2010. Parallel spectral clustering in distributed systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 3 (2010), 568–586.
    [11]
    Xi Chen, Weidong Liu, Xiaojun Mao, and Zhuoyi Yang. 2020. Distributed high-dimensional regression under a quantile loss function. Journal of Machine Learning Research 21 (2020).
    [12]
    Xi Chen, Weidong Liu, and Yichen Zhang. 2021. First-order Newton-type estimator for distributed estimation and inference. J. Amer. Statist. Assoc. (2021), 1–17.
    [13]
    Xueying Chen and Min-ge Xie. 2014. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica (2014), 1655–1684.
    [14]
    Yudong Chen, Ali Jalali, Sujay Sanghavi, and Huan Xu. 2014a. Clustering partially observed graphs via convex optimization. The Journal of Machine Learning Research 15, 1 (2014), 2213–2238.
    [15]
    Jiayi Deng, Yi Ding, Yingqiu Zhu, Danyang Huang, Bingyi Jing, and Bo Zhang. 2021. Subsampling spectral clustering for large-scale social networks. arXiv preprint arXiv:2110.13613 (2021).
    [16]
    Rui Duan, Yang Ning, and Yong Chen. 2022. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 109, 1 (2022), 67–83.
    [17]
    Jianqing Fan, Yongyi Guo, and Kaizheng Wang. 2021. Communication-efficient accurate statistical estimation. J. Amer. Statist. Assoc. (2021), 1–11.
    [18]
    Jianqing Fan, Dong Wang, Kaizheng Wang, and Ziwei Zhu. 2019. Distributed estimation of principal eigenspaces. The Annals of Statistics 47, 6 (2019), 3009.
    [19]
    Dan Garber, Ohad Shamir, and Nathan Srebro. 2017. Communication-efficient algorithms for distributed stochastic principal component analysis. In International Conference on Machine Learning. PMLR, 1203–1212.
    [20]
    Ankit Garg, Tengyu Ma, and Huy Nguyen. 2014. On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems 27 (2014).
    [21]
    Michelle Girvan and Mark E. J. Newman. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 12 (2002), 7821–7826.
    [22]
    Wassily Hoeffding. 1963. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 301 (1963), 13–30.
    [23]
    Peter D. Hoff, Adrian E. Raftery, and Mark S. Handcock. 2002. Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97, 460 (2002), 1090–1098.
    [24]
    Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. 1983. Stochastic blockmodels: First steps. Social Networks 5, 2 (1983), 109–137.
    [25]
    Jianwei Hu, Hong Qin, Ting Yan, and Yunpeng Zhao. 2020. Corrected Bayesian information criterion for stochastic block models. J. Amer. Statist. Assoc. 115, 532 (2020), 1771–1783.
    [26]
    Jiashun Jin. 2015. Fast community detection by score. The Annals of Statistics 43, 1 (2015), 57–89.
    [27]
    Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. 2021. Optimal adaptivity of signed-polygon statistics for network testing. The Annals of Statistics 49, 6 (2021), 3408–3433.
    [28]
    Michael I. Jordan, Jason D. Lee, and Yun Yang. 2019. Communication-efficient distributed statistical inference. J. Amer. Statist. Assoc. 114, 526 (2019), 668–681.
    [29]
    Brian Karrer and Mark E. J. Newman. 2011. Stochastic blockmodels and community structure in networks. Physical Review E 83, 1 (2011), 016107.
    [30]
    Jason D. Lee, Yuekai Sun, Qiang Liu, and Jonathan E. Taylor. 2015. Communication-efficient sparse regression: A one-shot approach. arXiv preprint arXiv:1503.04337 (2015).
    [31]
    Jing Lei, Alessandro Rinaldo, et al. 2015. Consistency of spectral clustering in stochastic block models. The Annals of Statistics 43, 1 (2015), 215–237.
    [32]
    Hongmin Li, Xiucai Ye, Akira Imakura, and Tetsuya Sakurai. 2022b. Divide-and-conquer based large-scale spectral clustering. Neurocomputing 501 (2022), 664–678.
    [33]
    Qi Li, Zehong Cao, Weiping Ding, and Qing Li. 2020. A multi-objective adaptive evolutionary algorithm to extract communities in networks. Swarm and Evolutionary Computation 52 (2020), 100629.
    [34]
    Tianxi Li, Lihua Lei, Sharmodeep Bhattacharyya, Koen Van den Berge, Purnamrita Sarkar, Peter J. Bickel, and Elizaveta Levina. 2022a. Hierarchical community detection by recursive partitioning. J. Amer. Statist. Assoc. 117, 538 (2022), 951–968.
    [35]
    Chenlong Liu, Jing Liu, and Zhongzhou Jiang. 2014. A multiobjective evolutionary algorithm based on similarity for community detection from signed social networks. IEEE Transactions on Cybernetics 44, 12 (2014), 2274–2287.
    [36]
    Vince Lyzinski, Minh Tang, Avanti Athreya, Youngser Park, and Carey E. Priebe. 2016. Community detection and classification in hierarchical stochastic blockmodels. IEEE Transactions on Network Science and Engineering 4, 1 (2016), 13–26.
    [37]
    Catherine Matias and Vincent Miele. 2017. Statistical clustering of temporal networks through a dynamic stochastic block model. Journal of the Royal Statistical Society Series B: Statistical Methodology 79, 4 (2017), 1119–1141.
    [38]
    Soumendu Sundar Mukherjee, Purnamrita Sarkar, and Peter J. Bickel. 2021. Two provably consistent divide-and-conquer clustering algorithms for large networks. Proceedings of the National Academy of Sciences 118, 44 (2021), e2100482118.
    [39]
    Tamás Nepusz, Haiyuan Yu, and Alberto Paccanaro. 2012. Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods 9, 5 (2012), 471–472.
    [40]
    Mark E. J. Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical Review E 69, 2 (2004), 026113.
    [41]
    Andrew Ng, Michael Jordan, and Yair Weiss. 2001. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 (2001).
    [42]
    Clara Pizzuti. 2017. Evolutionary computation for community detection in networks: A review. IEEE Transactions on Evolutionary Computation 22, 3 (2017), 464–483.
    [43]
    Karl Rohe, Sourav Chatterjee, and Bin Yu. 2011. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics 39, 4 (2011), 1878–1915.
    [44]
    Daniel K. Sewell and Yuguo Chen. 2015. Latent space models for dynamic networks. J. Amer. Statist. Assoc. 110, 512 (2015), 1646–1657.
    [45]
    Ohad Shamir, Nati Srebro, and Tong Zhang. 2014. Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning, Vol. 32. PMLR, 1000–1008.
    [46]
    Jianbo Shi and Jitendra Malik. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 8 (2000), 888–905.
    [47]
    Yansen Su, Kefei Zhou, Xingyi Zhang, Ran Cheng, and Chunhou Zheng. 2021. A parallel multi-objective evolutionary algorithm for community detection in large-scale complex networks. Information Sciences 576 (2021), 374–392.
    [48]
    He Sun and Luca Zanetti. 2019. Distributed graph clustering and sparsification. ACM Transactions on Parallel Computing (TOPC) 6, 3 (2019), 1–23.
    [49]
    Stanislav Volgushev, Shih-Kang Chao, and Guang Cheng. 2019. Distributed inference for quantile regression processes. The Annals of Statistics 47, 3 (2019), 1634–1662.
    [50]
    Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17 (2007), 395–416.
    [51]
    Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. 2017. Efficient distributed learning with sparsity. In International Conference on Machine Learning, Vol. 70. PMLR, 3636–3645.
    [52]
    Jiangzhou Wang, Jingfei Zhang, Binghui Liu, Ji Zhu, and Jianhua Guo. 2021. Fast network community detection with profile-pseudo likelihood methods. J. Amer. Statist. Assoc. (2021), 1–14.
    [53]
    Shusen Wang, Fred Roosta, Peng Xu, and Michael W. Mahoney. 2018. Giant: Globally improved approximate Newton method for distributed optimization. Advances in Neural Information Processing Systems 31 (2018).
    [54]
    Shihao Wu, Zhe Li, and Xuening Zhu. 2023. A distributed community detection algorithm for large scale networks under stochastic block models. Computational Statistics & Data Analysis 187 (2023), 107794.
    [55]
    Bo Yang, Xueyan Liu, Yang Li, and Xuehua Zhao. 2017. Stochastic blockmodeling and variational Bayes learning for signed network analysis. IEEE Transactions on Knowledge and Data Engineering 29, 9 (2017), 2026–2039.
    [56]
    Bo Yang, Xuehua Zhao, and Xueyan Liu. 2015. Bayesian approach to modeling and detecting communities in signed network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29.
    [57]
    Wenzhuo Yang and Huan Xu. 2015. A divide and conquer framework for distributed graph clustering. In International Conference on Machine Learning. PMLR, 504–513.
    [58]
    Y. Y. Yao. 2003. Information-theoretic measures for knowledge discovery and data mining. Entropy Measures, Maximum Entropy Principle and Emerging Applications (2003), 115–136.
    [59]
    Ying Yin, Yuhai Zhao, He Li, and Xiangjun Dong. 2021. Multi-objective evolutionary clustering for large-scale dynamic community detection. Information Sciences 549 (2021), 269–287.
    [60]
    Jianping Zeng and Hongfeng Yu. 2018. A scalable distributed Louvain algorithm for large-scale graph community detection. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 268–278.
    [61]
    Hai Zhang, Xiao Guo, and Xiangyu Chang. 2022a. Randomized spectral clustering in large-scale stochastic block models. Journal of Computational and Graphical Statistics 31, 3 (2022), 887–906.
    [62]
    Sheng Zhang, Rui Song, Wenbin Lu, and Ji Zhu. 2022b. Distributed community detection in large networks. arXiv preprint arXiv:2203.06509 (2022).
    [63]
    Xingyi Zhang, Kefei Zhou, Hebin Pan, Lei Zhang, Xiangxiang Zeng, and Yaochu Jin. 2018. A network reduction-based multiobjective evolutionary algorithm for community detection in large-scale complex networks. IEEE Transactions on Cybernetics 50, 2 (2018), 703–716.
    [64]
    Yuchen Zhang, Martin J. Wainwright, and John C. Duchi. 2013. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research 14 (2013), 3321–3363.
    [65]
    Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. 2011. Community extraction for social networks. Proceedings of the National Academy of Sciences 108, 18 (2011), 7321–7326.
    [66]
    Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. 2012. Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics 40, 4 (2012), 2266–2292.
    [67]
    Xuening Zhu, Feng Li, and Hansheng Wang. 2021. Least-square approximation for a distributed system. Journal of Computational and Graphical Statistics (2021), 1–15.

    Cited By

    View all

    Index Terms

    1. Distributed Pseudo-Likelihood Method for Community Detection in Large-Scale Networks

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 7
        August 2024
        505 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/3613689
        • Editor:
        • Jian Pei
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 June 2024
        Online AM: 16 April 2024
        Accepted: 29 March 2024
        Revised: 20 March 2024
        Received: 16 February 2023
        Published in TKDD Volume 18, Issue 7

        Check for updates

        Author Tags

        1. Community detection
        2. computational efficiency
        3. distributed algorithm
        4. large-scale networks
        5. pseudo-likelihood

        Qualifiers

        • Research-article

        Funding Sources

        • National Natural Science Foundation of China
        • MOE Project of Key Research Institute of Humanities and Social Sciences

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 146
          Total Downloads
        • Downloads (Last 12 months)146
        • Downloads (Last 6 weeks)29
        Reflects downloads up to 12 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media