Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access
Just Accepted

Enhancing High-Throughput GPU Random Walks Through Multi-Task Concurrency Orchestration

Online AM: 10 January 2025 Publication History

Abstract

Random walk is a powerful tool for large-scale graph learning, but its high computational demand presents a challenge. While GPUs can accelerate random walk tasks, current frameworks fail to fully utilize GPU parallelism due to memory-to-compute bandwidth imbalance. In this paper, CoWalker, an efficient GPU framework, is proposed to facilitate concurrent execution of random walks for high overall throughput. CoWalker features three novel designs. First, it incorporates a multi-level execution model that effectively orchestrates diverse walk tasks and reduces GPU stalls based on multiple graph characteristics. Second, it collaboratively manages graph data and streaming multiprocessors to minimize memory access interference and maximize core utilization under concurrent tasks. Finally, a multi-dimensional scheduler selects compatible random walk task combinations based on memory footprints to achieve maximum throughput. CoWalker significantly improves throughput over state-of-the-art baselines by mitigating concurrency overheads and effectively harnessing GPU parallelism. Our extensive evaluations on real-world workloads demonstrate that CoWalker achieves 2.75 × higher overall system throughput compared with commercial tools and 1.56 × over the SOTA academic system.

References

[1]
2023. Ali Cloud. https://www.alibabacloud.com/zh. [Online; accessed 3-March-2023].
[2]
2023. Bytedance. https://www.bytedance.com/en/. [Online; accessed 3-March-2023].
[3]
2023. Facebook. http://www.facebook.com/. [Online; accessed 3-March-2023].
[4]
2023. Google. http://www.google.com/. [Online; accessed 3-March-2023].
[5]
Albert Amor-Amoros, Paolo Federico, and Silvia Miksch. 2014. TimeGraph: A data management framework for visual analytics of large multivariate time-oriented networks. In 9th IEEE Conference on Visual Analytics Science and Technology, IEEE VAST 2014.
[6]
Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. In Proc. of the Thirteenth International World Wide Web Conference, WWW 2004.
[7]
Dorra Boughzala, Laurent Lefèvre, and Anne-Cécile Orgerie. 2020. Predicting the Energy Consumption of CUDA Kernels using SimGrid. In 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2020.
[8]
Hongzheng Chen, Minghua Shen, Nong Xiao, and Yutong Lu. 2021. Krill: A Compiler and Runtime System for Concurrent Graph Processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, NY, USA, Article 51, 16 pages. https://doi.org/10.1145/3458817.3476159
[9]
Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). https://doi.org/10.1109/ISCA.2016.13
[10]
Peng Fang, Arijit Khan, Siqiang Luo, Fang Wang, Dan Feng, Zhenli Li, Wei Yin, and Yuchao Cao. 2023. Distributed Graph Embedding with Information-Oriented Random Walks. Proc. VLDB Endow. 16, 7 (2023), 1643–1656. https://doi.org/10.14778/3587136.3587140
[11]
D. Fogaras and B. Rácz. 2004. Towards Scaling Fully Personalized PageRank. In WAW.
[12]
Prasun Gera, Hyojong Kim, Piyush Sao, Hyesoon Kim, and David A. Bader. 2020. Traversing Large Graphs on GPUs with Unified Memory. Proc. VLDB Endow. 13, 7 (2020), 1119–1133. https://doi.org/10.14778/3384345.3384358
[13]
A. Grimshaw and Duane Merrill. 2012. Parallel Scan for Stream Architectures.
[14]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.
[15]
Wei Han, Daniel Mawhirter, Bo Wu, and Matthew Buland. 2017. Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU. In 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017.
[16]
Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In 37th International Symposium on Computer Architecture, ISCA 2010.
[17]
Xiaofeng Hou, Luoyao Hao, Chao Li, Quan Chen, Wenli Zheng, and Minyi Guo. 2018. Power Grab in Aggressively Provisioned Data Centers: What is the Risk and What Can Be Done About It. In 2018 IEEE 36th International Conference on Computer Design (ICCD). 26–34. https://doi.org/10.1109/ICCD.2018.00015
[18]
Xiaofeng Hou, Chao Li, Jiacheng Liu, Lu Zhang, Yang Hu, and Minyi Guo. 2020. ANT-Man: Towards Agile Power Management in the Microservice Era. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41405.2020.00082
[19]
Xiaofeng Hou, Peng Tang, Tongqiao Xu, Cheng Xu, Chao Li, and Minyi Guo. 2024. CPM: A Cross-layer Power Management Facility to Enable QoS-Aware AIoT Systems. In 2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS). 1–10. https://doi.org/10.1109/IWQoS61813.2024.10682859
[20]
Chengying Huan, Yongchao Liu, Heng Zhang, Shuaiwen Song, Santosh Pandey, Shiyang Chen, Xiangfei Fang, Yue Jin, Baptiste Lepers, Yanjun Wu, and Hang Liu. 2024. TEA+: A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture. ACM Trans. Archit. Code Optim. 21, 2 (2024), 37. https://doi.org/10.1145/3652604
[21]
Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).
[22]
Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini. 2020. NextDoor: GPU-Based Graph Sampling for Graph Machine Learning. ArXiv (2020).
[23]
Hai Jin, Cong Liu, Haikun Liu, Ruikun Luo, Jiahong Xu, Fubing Mao, and Xiaofei Liao. 2022. ReHy: A ReRAM-Based Digital/Analog Hybrid PIM Architecture for Accelerating CNN Training. IEEE Transactions on Parallel and Distributed Systems (2022).
[24]
Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim. 2016. GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016.
[25]
Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2012.
[26]
Meng-Chieh Lee, Lingxiao Zhao, and Leman Akoglu. 2024. Descriptive Kernel Convolution Network with Improved Random Walk Kernel. In Proceedings of the ACM Web Conference 2024 (Singapore, Singapore) (WWW ’24). Association for Computing Machinery, New York, NY, USA, 457–468. https://doi.org/10.1145/3589334.3645405
[27]
Q. Liu, Zhenguo Li, J. Lui, and Jiefeng Cheng. 2016. PowerWalk: Scalable Personalized PageRank via Random Walks with Vertex-Centric Decomposition. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (2016).
[28]
Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph: Efficient GPU-accelerated Graph Processing on a Single Machine with Balanced Replication. In USENIX Annual Technical Conference, USENIX ATC 2017.
[29]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010.
[30]
Junyi Mei, Shixuan Sun, Chao Li, Cheng Xu, Cheng Chen, Yibo Liu, Jing Wang, Cheng Zhao, Xiaofeng Hou, Minyi Guo, Bingsheng He, and Xiaoliang Cong. 2024. FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework. Proc. VLDB Endow. 17, 8 (2024), 1788–1801. https://www.vldb.org/pvldb/vol17/p1788-mei.pdf
[31]
Nvidia. [n. d.]. Nvidia. Multi-Instance GPU. https://docs.nvidia.com/cuda/mig/index.html.
[32]
NVIDIA Corporation. 2019. Multi-process service.
[33]
Peitian Pan and Chao Li. 2017. Congra: Towards Efficient Processing of Concurrent Graph Queries on Shared-Memory Machines. In 2017 IEEE International Conference on Computer Design, ICCD 2017, Boston, MA, USA, November 5-8, 2017. 217–224. https://doi.org/10.1109/ICCD.2017.40
[34]
Peitian Pan, Chao Li, and Minyi Guo. 2019. CongraPlus: Towards Efficient Processing of Concurrent Graph Queries on NUMA Machines. IEEE Trans. Parallel Distrib. Syst.(2019).
[35]
Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S. Li, and Hang Liu. 2020. C-SAW: A Framework for Graph Sampling and Random Walk on GPUs.
[36]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dynamic Resource Management for Efficient Utilization of Multitasking GPUs. SIGPLAN Not. (2017).
[37]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. [n. d.]. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[38]
Shixuan Sun, Yuhang Chen, Shengliang Lu, Bingsheng He, and Yuchen Li. 2021. ThunderRW: An In-Memory Graph Random Walk Engine. Proc. VLDB Endow. (2021).
[39]
Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, B. Zhao, and D. Lee. 2018. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(2018).
[40]
Jing Wang, Hanzhang Yang, Chao Li, Yiming Zhuansun, Wang Yuan, Cheng Xu, Xiaofeng Hou, Minyi Guo, Yang Hu, and Yaqian Zhao. 2024. Boosting Data Center Performance via Intelligently Managed Multi-backend Disaggregated Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Atlanta, GA, USA) (SC ’24). IEEE Press, Article 37, 18 pages.
[41]
Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. arXiv (2019).
[42]
Pengyu Wang, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Jingwen Leng, Quan Chen, and Minyi Guo. 2021. Skywalker: Efficient Alias-Method-Based Graph Sampling and Random Walk on GPUs. In 30th International Conference on Parallel Architectures and Compilation Techniques, PACT 2021.
[43]
Pengyu Wang, Jing Wang, Chao Li, Jianzong Wang, Haojin Zhu, and Minyi Guo. 2021. Grus: Toward Unified-Memory-Efficient High-Performance Graph Processing on GPU. ACM Trans. Archit. Code Optim. 18, 2 (2021).
[44]
Pengyu Wang, Cheng Xu, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Xiaofeng Hou, and Minyi Guo. 2023. Optimizing GPU-Based Graph Sampling and Random Walk for Efficiency and Scalability. IEEE Trans. Comput. 72, 09 (Sept. 2023), 2508–2521. https://doi.org/10.1109/TC.2023.3251860
[45]
Pengyu Wang, Lu Zhang, Chao Li, and Minyi Guo. 2019. Excavating the Potential of GPU for Accelerating Graph Traversal. In 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019.
[46]
Rui Wang, Y. Li, H. Xie, Yinlong Xu, and J. Lui. 2020. GraphWalker: An I/O-Efficient and Resource-Friendly Graph Analytic System for Fast and Scalable Random Walks. In USENIX Annual Technical Conference 2020.
[47]
Shuke Wang, Mingxing Zhang, Ke Yang, Kang Chen, Shaonan Ma, Jinlei Jiang, and Yongwei Wu. 2023. NosWalker: A Decoupled Architecture for Out-of-Core Random Walk Processing. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA.
[48]
Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: a high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015.
[49]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2021. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems (2021).
[50]
Cheng Xu, Chao Li, Pengyu Wang, Xiaofeng Hou, Jing Wang, Shixuan Sun, Minyi Guo, Hanqing Wu, Dongbai Chen, and Xiangwen Liu. 2023. High-Throughput GPU Random Walk with Fine-Tuned Concurrent Query Processing. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP.
[51]
Jilong Xue, Zhi Yang, Shian Hou, and Yafei Dai. 2017. Processing Concurrent Graph Analytics with Decoupled Computation Model. IEEE Trans. Computers(2017).
[52]
Jilong Xue, Zhi Yang, Zhi Qu, Shian Hou, and Yafei Dai. 2014. Seraph: an efficient, low-cost system for concurrent graph processing. In The 23rd International Symposium on High-Performance Parallel and Distributed Computing, HPDC’14.
[53]
Jaewon Yang and Jure Leskovec. 2012. Defining and Evaluating Network Communities Based on Ground-Truth. In 12th IEEE International Conference on Data Mining, ICDM 2012.
[54]
Jianbang Yang, Dahai Tang, Xiaoniu Song, Lei Wang, Qiang Yin, Rong Chen, Wenyuan Yu, and Jingren Zhou. 2022. GNNLab: A Factored System for Sample-Based GNN Training over GPUs. In Proceedings of the Seventeenth European Conference on Computer Systems (Rennes, France) (EuroSys ’22). Association for Computing Machinery, New York, NY, USA, 417–434. https://doi.org/10.1145/3492321.3519557
[55]
Ke Yang, Xiaosong Ma, Saravanan Thirumuruganathan, Kang Chen, and Yongwei Wu. 2021. Random Walks on Huge Graphs at Cache Efficiency. In SOSP ’21: ACM SIGOPS 28th Symposium on Operating Systems Principles.
[56]
Ke Yang, Mingxing Zhang, Kang Chen, Xiaosong Ma, Yang Bai, and Yong Jiang. 2019. KnightKing: a fast distributed graph random walk engine. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019.
[57]
Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, Rudolf Eigenmann, and Timothy G. Rogers. 2017. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2017.
[58]
Wei Zhang, Quan Chen, Ningxin Zheng, Weihao Cui, Kaihua Fu, and Minyi Guo. 2022. Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs. IEEE Trans. Comput. (2022).
[59]
Yu Zhang, Xiaofei Liao, Hai Jin, Lin Gu, Ligang He, Bingsheng He, and Haikun Liu. 2018. CGraph: A Correlations-aware Approach for Efficient Concurrent Iterative Graph Processing. In 2018 USENIX Annual Technical Conference, USENIX ATC 2018.
[60]
Jin Zhao, Yu Zhang, Xiaofei Liao, Ligang He, Bingsheng He, Hai Jin, and Haikun Liu. 2021. LCCG: a locality-centric hardware accelerator for high throughput of concurrent graph processing. In SC ’21: The International Conference for High Performance Computing, Networking, Storage and Analysis.
[61]
Jin Zhao, Yu Zhang, Xiaofei Liao, Ligang He, Bingsheng He, Hai Jin, Haikun Liu, and Yicheng Chen. 2019. GraphM: an efficient storage system for high throughput of concurrent graph processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019.
[62]
Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. HSM: A Hybrid Slowdown Model for Multitasking GPUs. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems.
[63]
Jianlong Zhong and Bingsheng He. 2014. Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling. IEEE Trans. Parallel Distributed Syst.(2014).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization Just Accepted
EISSN:1544-3973
Table of Contents
This work is licensed under Creative Commons Attribution International 4.0.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 10 January 2025
Accepted: 07 December 2024
Revised: 05 December 2024
Received: 09 September 2024

Check for updates

Author Tags

  1. Graph computing
  2. Random Walk
  3. GPU

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 167
    Total Downloads
  • Downloads (Last 12 months)167
  • Downloads (Last 6 weeks)167
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media