Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3627673.3680018acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A Multi-Node Multi-GPU Distributed GNN Training Framework for Large-Scale Online Advertising

Published: 21 October 2024 Publication History

Abstract

Graph Neural Networks (GNNs) have become critical in various domains such as online advertising but face scalability challenges due to the growing size of graph data, leading to the needs for advanced distributed GPU computation strategies across multiple nodes. This paper presents PGLBox-Cluster, a robust distributed graph learning framework constructed atop the PaddlePaddle platform, implemented to efficiently process graphs comprising billions of nodes and edges. Through strategic partitioning of the model, node attributes, and graph data and leveraging industrial-grade RPC and NCCL for communication, PGLBox-Cluster facilitates effective distributed computation. The extensive experimental results confirm that PGLBox-Cluster achieves a 1.94x to 2.93x speedup over the single-node configuration, significantly elevating graph neural network scalability and efficiency by handling datasets exceeding 3 billion nodes and 120 billion edges with its novel asynchronous communication and graph partitioning techniques. The repository is released at This Link.

References

[1]
Jiang Bian, Jizhou Huang, Shilei Ji, Yuan Liao, Xuhong Li, Qingzhong Wang, Jingbo Zhou, Dejing Dou, Yaqing Wang, and Haoyi Xiong. 2023. Feynman: Federated Learning-based Advertising for Ecosystems-Oriented Mobile Apps Recommendation. IEEE Transactions on Services Computing (2023).
[2]
Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan Yu. 2021. DGCL: an efficient communication library for distributed GNN training. In Proceedings of the Sixteenth European Conference on Computer Systems. 130--144.
[3]
Swapnil Gandhi and Anand Padmanabha Iyer. 2021. P3: Distributed deep graph learning at scale. In 15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21). 551--568.
[4]
Chen Gao, Yu Zheng, Nian Li, Yinfeng Li, Yingrong Qin, Jinghua Piao, Yuhan Quan, Jianxin Chang, Depeng Jin, Xiangnan He, et al. 2023. A survey of graph neural networks for recommender systems: Challenges, methods, and directions. ACM Transactions on Recommender Systems 1, 1 (2023), 1--51.
[5]
Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the accuracy, scalability, and performance of graph neural networks with roc. Proceedings of Machine Learning and Systems 2 (2020), 187--198.
[6]
Xuewu Jiao, Weibin Li, Xinxuan Wu, Wei Hu, Miao Li, Jiang Bian, Siming Dai, Xinsheng Luo, Mingqing Hu, Zhengjie Huang, et al. 2023. PGLBox: Multi-GPU Graph Learning Framework for Web-Scale Recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4262--4272.
[7]
Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. 2020. Pagraph: Scaling gnn training on large graphs via computation-aware caching. In Proceedings of the 11th ACM Symposium on Cloud Computing. 401--415.
[8]
Husong Liu, Shengliang Lu, Xinyu Chen, and Bingsheng He. 2020. G3: when graph neural networks meet parallel graph processing systems on GPUs. Proceedings of the VLDB Endowment 13, 12 (2020), 2813--2816.
[9]
Hao Ma, Irwin King, and Michael R Lyu. 2011. Mining web graphs for recommendations. IEEE Transactions on Knowledge and Data Engineering 24, 6 (2011), 1051--1064.
[10]
Lingxiao Ma, Zhi Yang, Youshan Miao, Jilong Xue, Ming Wu, Lidong Zhou, and Yafei Dai. 2019. {NeuGraph}: Parallel Deep Neural Network Computation on Large Graphs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 443--458.
[11]
Patrick MacArthur, Qian Liu, Robert D Russell, Fabrice Mizero, Malathi Veeraraghavan, and John M Dennis. 2017. An integrated tutorial on InfiniBand, verbs, and MPI. IEEE Communications Surveys & Tutorials 19, 4 (2017), 2894--2926.
[12]
Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas, Alexander Heinecke, Dhiraj Kalamkar, Nesreen K Ahmed, and Sasikanth Avancha. 2021. Distgnn: Scalable distributed training for large-scale graph neural networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.
[13]
Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. Fennel: Streaming graph partitioning for massive scale graphs. In Proceedings of the 7th ACM international conference on Web search and data mining. 333--342.
[14]
Cheng Wan, Youjie Li, Cameron R Wolfe, Anastasios Kyrillidis, Nam Sung Kim, and Yingyan Lin. 2022. Pipegcn: Efficient full-graph training of graph convolutional networks with pipelined feature communication. arXiv preprint arXiv:2203.10428 (2022).
[15]
Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. 2020. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 1, 1 (2020), 396--413.
[16]
LeiWang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen,Wenyuan Yu, Zihang Yao, and Jingren Zhou. 2021. FlexGraph: a flexible and efficient distributed framework for GNN training. In Proceedings of the Sixteenth European Conference on Computer Systems. 67--82.
[17]
Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph neural networks in recommender systems: a survey. Comput. Surveys 55, 5 (2022), 1--37.
[18]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4--24.
[19]
Cong Xie, Ling Yan, Wu-Jun Li, and Zhihua Zhang. 2014. Distributed powerlaw graph computing: Theoretical and empirical analysis. Advances in neural information processing systems 27 (2014).
[20]
Guoyi Zhao, Tian Zhou, and Lixin Gao. 2021. CM-GCN: A distributed framework for graph convolutional networks using cohesive mini-batches. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 153--163.
[21]
Chenguang Zheng, Hongzhi Chen, Yuxuan Cheng, Zhezheng Song, Yifan Wu, Changji Li, James Cheng, Hao Yang, and Shuai Zhang. 2022. ByteGNN: efficient graph neural network training at large scale. Proceedings of the VLDB Endowment 15, 6 (2022), 1228--1242.
[22]
Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang, and George Karypis. 2020. Distdgl: distributed graph neural network training for billion-scale graphs. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, 36--44.
[23]
Rong Zhu, Kun Zhao, Hongxia Yang,Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou. 2019. Aligraph: a comprehensive graph neural network platform. arXiv preprint arXiv:1902.08730 (2019).

Index Terms

  1. A Multi-Node Multi-GPU Distributed GNN Training Framework for Large-Scale Online Advertising

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
    October 2024
    5705 pages
    ISBN:9798400704369
    DOI:10.1145/3627673
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GNN
    2. distributed GPU computation
    3. graph learning

    Qualifiers

    • Research-article

    Conference

    CIKM '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 92
      Total Downloads
    • Downloads (Last 12 months)92
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 31 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media