research-article

Open access

Just Accepted

Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration

Authors:

Baoping HaoAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization

Accepted on 07 December 2024

https://doi.org/10.1145/3711820

Online AM: 10 January 2025 Publication History

Abstract

Random walk is a powerful tool for large-scale graph learning, but its high computational demand presents a challenge. While GPUs can accelerate random walk tasks, current frameworks fail to fully utilize GPU parallelism due to memory-to-compute bandwidth imbalance. In this paper, CoWalker, an efficient GPU framework, is proposed to facilitate concurrent execution of random walks for high overall throughput. CoWalker features three novel designs. First, it incorporates a multi-level execution model that effectively orchestrates diverse walk tasks and reduces GPU stalls based on multiple graph characteristics. Second, it collaboratively manages graph data and streaming multiprocessors to minimize memory access interference and maximize core utilization under concurrent tasks. Finally, a multi-dimensional scheduler selects compatible random walk task combinations based on memory footprints to achieve maximum throughput. CoWalker significantly improves throughput over state-of-the-art baselines by mitigating concurrency overheads and effectively harnessing GPU parallelism. Our extensive evaluations on real-world workloads demonstrate that CoWalker achieves 2.75 × higher overall system throughput compared with commercial tools and 1.56 × over the SOTA academic system.

References

[1]

2023. Ali Cloud. https://www.alibabacloud.com/zh. [Online; accessed 3-March-2023].

[2]

2023. Bytedance. https://www.bytedance.com/en/. [Online; accessed 3-March-2023].

[3]

2023. Facebook. http://www.facebook.com/. [Online; accessed 3-March-2023].

[4]

2023. Google. http://www.google.com/. [Online; accessed 3-March-2023].

[5]

Albert Amor-Amoros, Paolo Federico, and Silvia Miksch. 2014. TimeGraph: A data management framework for visual analytics of large multivariate time-oriented networks. In 9th IEEE Conference on Visual Analytics Science and Technology, IEEE VAST 2014.

[6]

Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In Proc. of the Thirteenth International World Wide Web Conference, WWW 2004.

Digital Library

[7]

Dorra Boughzala, Laurent Lefèvre, and Anne-Cécile Orgerie. 2020. Predicting the Energy Consumption of CUDA Kernels using SimGrid. In 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020.

[8]

Hongzheng Chen, Minghua Shen, Nong Xiao, and Yutong Lu. 2021. Krill: A Compiler and Runtime System for Concurrent Graph Processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, NY, USA, Article 51, 16 pages. https://doi.org/10.1145/3458817.3476159

Digital Library

[9]

Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA.2016.13

Digital Library

[10]

Peng Fang, Arijit Khan, Siqiang Luo, Fang Wang, Dan Feng, Zhenli Li, Wei Yin, and Yuchao Cao. 2023. Distributed Graph Embedding with Information-Oriented Random Walks. Proc. VLDB Endow. 16, 7 (2023), 1643–1656. https://doi.org/10.14778/3587136.3587140

Digital Library

[11]

D. Fogaras and B. Rácz. 2004. Towards Scaling Fully Personalized PageRank. In WAW.

[12]

Prasun Gera, Hyojong Kim, Piyush Sao, Hyesoon Kim, and David A. Bader. 2020. Traversing Large Graphs on GPUs with Unified Memory. Proc. VLDB Endow. 13, 7 (2020), 1119–1133. https://doi.org/10.14778/3384345.3384358

Digital Library

[13]

A. Grimshaw and Duane Merrill. 2012. Parallel Scan for Stream Architectures.

[14]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.

Digital Library

[15]

Wei Han, Daniel Mawhirter, Bo Wu, and Matthew Buland. 2017. Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU. In 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017.

[16]

Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In 37th International Symposium on Computer Architecture, ISCA 2010.

Digital Library

[17]

Xiaofeng Hou, Luoyao Hao, Chao Li, Quan Chen, Wenli Zheng, and Minyi Guo. 2018. Power Grab in Aggressively Provisioned Data Centers: What is the Risk and What Can Be Done About It. In 2018 IEEE 36th International Conference on Computer Design (ICCD). 26–34. https://doi.org/10.1109/ICCD.2018.00015

[18]

Xiaofeng Hou, Chao Li, Jiacheng Liu, Lu Zhang, Yang Hu, and Minyi Guo. 2020. ANT-Man: Towards Agile Power Management in the Microservice Era. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41405.2020.00082

[19]

Xiaofeng Hou, Peng Tang, Tongqiao Xu, Cheng Xu, Chao Li, and Minyi Guo. 2024. CPM: A Cross-layer Power Management Facility to Enable QoS-Aware AIoT Systems. In 2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS). 1–10. https://doi.org/10.1109/IWQoS61813.2024.10682859

[20]

Chengying Huan, Yongchao Liu, Heng Zhang, Shuaiwen Song, Santosh Pandey, Shiyang Chen, Xiangfei Fang, Yue Jin, Baptiste Lepers, Yanjun Wu, and Hang Liu. 2024. TEA+: A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture. ACM Trans. Archit. Code Optim. 21, 2 (2024), 37. https://doi.org/10.1145/3652604

Digital Library

[21]

Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[22]

Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini. 2020. NextDoor: GPU-Based Graph Sampling for Graph Machine Learning. ArXiv (2020).

[23]

Hai Jin, Cong Liu, Haikun Liu, Ruikun Luo, Jiahong Xu, Fubing Mao, and Xiaofei Liao. 2022. ReHy: A ReRAM-Based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training. IEEE Transactions on Parallel and Distributed Systems (2022).

[24]

Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim. 2016. GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016.

Digital Library

[25]

Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012.

[26]

Meng-Chieh Lee, Lingxiao Zhao, and Leman Akoglu. 2024. Descriptive Kernel Convolution Network with Improved Random Walk Kernel. In Proceedings of the ACM Web Conference 2024 (Singapore, Singapore) (WWW ’24). Association for Computing Machinery, New York, NY, USA, 457–468. https://doi.org/10.1145/3589334.3645405

Digital Library

[27]

Q. Liu, Zhenguo Li, J. Lui, and Jiefeng Cheng. 2016. PowerWalk: Scalable Personalized PageRank via Random Walks with Vertex-Centric Decomposition. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (2016).

Digital Library

[28]

Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph: Efficient GPU-accelerated Graph Processing on a Single Machine with Balanced Replication. In USENIX Annual Technical Conference, USENIX ATC 2017.

[29]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010.

Digital Library

[30]

Junyi Mei, Shixuan Sun, Chao Li, Cheng Xu, Cheng Chen, Yibo Liu, Jing Wang, Cheng Zhao, Xiaofeng Hou, Minyi Guo, Bingsheng He, and Xiaoliang Cong. 2024. FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework. Proc. VLDB Endow. 17, 8 (2024), 1788–1801. https://www.vldb.org/pvldb/vol17/p1788-mei.pdf

Digital Library

[31]

Nvidia. [n. d.]. Nvidia. Multi-Instance GPU. https://docs.nvidia.com/cuda/mig/index.html.

[32]

NVIDIA Corporation. 2019. Multi-process service.

[33]

Peitian Pan and Chao Li. 2017. Congra: Towards Efficient Processing of Concurrent Graph Queries on Shared-Memory Machines. In 2017 IEEE International Conference on Computer Design, ICCD 2017, Boston, MA, USA, November 5-8, 2017. 217–224. https://doi.org/10.1109/ICCD.2017.40

[34]

Peitian Pan, Chao Li, and Minyi Guo. 2019. CongraPlus: Towards Efficient Processing of Concurrent Graph Queries on NUMA Machines. IEEE Trans. Parallel Distrib. Syst.(2019).

[35]

Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S. Li, and Hang Liu. 2020. C-SAW: A Framework for Graph Sampling and Random Walk on GPUs.

[36]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic Resource Management for Efficient Utilization of Multitasking GPUs. SIGPLAN Not. (2017).

[37]

Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. [n. d.]. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[38]

Shixuan Sun, Yuhang Chen, Shengliang Lu, Bingsheng He, and Yuchen Li. 2021. ThunderRW: An In-Memory Graph Random Walk Engine. Proc. VLDB Endow. (2021).

Digital Library

[39]

Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, B. Zhao, and D. Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(2018).

[40]

Jing Wang, Hanzhang Yang, Chao Li, Yiming Zhuansun, Wang Yuan, Cheng Xu, Xiaofeng Hou, Minyi Guo, Yang Hu, and Yaqian Zhao. 2024. Boosting Data Center Performance via Intelligently Managed Multi-backend Disaggregated Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Atlanta, GA, USA) (SC ’24). IEEE Press, Article 37, 18 pages.

Digital Library

[41]

Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. arXiv (2019).

[42]

Pengyu Wang, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Jingwen Leng, Quan Chen, and Minyi Guo. 2021. Skywalker: Efficient Alias-Method-Based Graph Sampling and Random Walk on GPUs. In 30th International Conference on Parallel Architectures and Compilation Techniques, PACT 2021.

[43]

Pengyu Wang, Jing Wang, Chao Li, Jianzong Wang, Haojin Zhu, and Minyi Guo. 2021. Grus: Toward Unified-Memory-Efficient High-Performance Graph Processing on GPU. ACM Trans. Archit. Code Optim. 18, 2 (2021).

Digital Library

[44]

Pengyu Wang, Cheng Xu, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Xiaofeng Hou, and Minyi Guo. 2023. Optimizing GPU-Based Graph Sampling and Random Walk for Efficiency and Scalability. IEEE Trans. Comput. 72, 09 (Sept. 2023), 2508–2521. https://doi.org/10.1109/TC.2023.3251860

Digital Library

[45]

Pengyu Wang, Lu Zhang, Chao Li, and Minyi Guo. 2019. Excavating the Potential of GPU for Accelerating Graph Traversal. In 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019.

[46]

Rui Wang, Y. Li, H. Xie, Yinlong Xu, and J. Lui. 2020. GraphWalker: An I/O-Efficient and Resource-Friendly Graph Analytic System for Fast and Scalable Random Walks. In USENIX Annual Technical Conference 2020.

[47]

Shuke Wang, Mingxing Zhang, Ke Yang, Kang Chen, Shaonan Ma, Jinlei Jiang, and Yongwei Wu. 2023. NosWalker: A Decoupled Architecture for Out-of-Core Random Walk Processing. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA.

Digital Library

[48]

Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: a high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015.

Digital Library

[49]

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2021. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems (2021).

[50]

Cheng Xu, Chao Li, Pengyu Wang, Xiaofeng Hou, Jing Wang, Shixuan Sun, Minyi Guo, Hanqing Wu, Dongbai Chen, and Xiangwen Liu. 2023. High-Throughput GPU Random Walk with Fine-Tuned Concurrent Query Processing. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP.

Digital Library

[51]

Jilong Xue, Zhi Yang, Shian Hou, and Yafei Dai. 2017. Processing Concurrent Graph Analytics with Decoupled Computation Model. IEEE Trans. Computers(2017).

Digital Library

[52]

Jilong Xue, Zhi Yang, Zhi Qu, Shian Hou, and Yafei Dai. 2014. Seraph: an efficient, low-cost system for concurrent graph processing. In The 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC’14.

Digital Library

[53]

Jaewon Yang and Jure Leskovec. 2012. Defining and Evaluating Network Communities Based on Ground-Truth. In 12th IEEE International Conference on Data Mining, ICDM 2012.

[54]

Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, and Jingren Zhou. 2022. GNNLab: A Factored System for Sample-Based GNN Training over GPUs. In Proceedings of the Seventeenth European Conference on Computer Systems (Rennes, France) (EuroSys ’22). Association for Computing Machinery, New York, NY, USA, 417–434. https://doi.org/10.1145/3492321.3519557

Digital Library

[55]

Ke Yang, Xiaosong Ma, Saravanan Thirumuruganathan, Kang Chen, and Yongwei Wu. 2021. Random Walks on Huge Graphs at Cache Efficiency. In SOSP ’21: ACM SIGOPS 28th Symposium on Operating Systems Principles.

Digital Library

[56]

Ke Yang, Mingxing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, and Yong Jiang. 2019. KnightKing: a fast distributed graph random walk engine. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019.

Digital Library

[57]

Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, Rudolf Eigenmann, and Timothy G. Rogers. 2017. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2017.

Digital Library

[58]

Wei Zhang, Quan Chen, Ningxin Zheng, Weihao Cui, Kaihua Fu, and Minyi Guo. 2022. Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs. IEEE Trans. Comput. (2022).

[59]

Yu Zhang, Xiaofei Liao, Hai Jin, Lin Gu, Ligang He, Bingsheng He, and Haikun Liu. 2018. CGraph: A Correlations-aware Approach for Efficient Concurrent Iterative Graph Processing. In 2018 USENIX Annual Technical Conference, USENIX ATC 2018.

[60]

Jin Zhao, Yu Zhang, Xiaofei Liao, Ligang He, Bingsheng He, Hai Jin, and Haikun Liu. 2021. LCCG: a locality-centric hardware accelerator for high throughput of concurrent graph processing. In SC ’21: The International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[61]

Jin Zhao, Yu Zhang, Xiaofei Liao, Ligang He, Bingsheng He, Hai Jin, Haikun Liu, and Yicheng Chen. 2019. GraphM: an efficient storage system for high throughput of concurrent graph processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019.

Digital Library

[62]

Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. HSM: A Hybrid Slowdown Model for Multitasking GPUs. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[63]

Jianlong Zhong and Bingsheng He. 2014. Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling. IEEE Trans. Parallel Distributed Syst.(2014).

Index Terms

Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent algorithms
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

High-Throughput GPU Random Walk with Fine-Tuned Concurrent Query Processing
PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Random walk serves as a powerful tool in dealing with large-scale graphs, reducing data size while preserving structural information. Unfortunately, existing system frameworks all focus on the execution of a single walker task in serial. We propose ...
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
Neural acceleration for GPU throughput processors
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Just Accepted

EISSN:1544-3973

Table of Contents

Copyright © 2025 Copyright held by the owner/author(s).

This work is licensed under Creative Commons Attribution International 4.0.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 10 January 2025

Accepted: 07 December 2024

Revised: 05 December 2024

Received: 09 September 2024

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
167
Total Downloads

Downloads (Last 12 months)167
Downloads (Last 6 weeks)167

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media