research-article

Open access

AsynGraph: Maximizing Data Parallelism for Efficient Iterative Graph Processing on GPUs

Authors:

Bingsheng HeAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 17, Issue 4

Article No.: 29, Pages 1 - 21

https://doi.org/10.1145/3416495

Published: 30 September 2020 Publication History

All formats PDF

Abstract

Recently, iterative graph algorithms are proposed to be handled by GPU-accelerated systems. However, in iterative graph processing, the parallelism of GPU is still underutilized by existing GPU-based solutions. In fact, because of the power-law property of the natural graphs, the paths between a small set of important vertices (e.g., high-degree vertices) play a more important role in iterative graph processing’s convergence speed. Based on this fact, for faster iterative graph processing on GPUs, this article develops a novel system, called AsynGraph, to maximize its data parallelism. It first proposes an efficient structure-aware asynchronous processing way. It enables the state propagations of most vertices to be effectively conducted on the GPUs in a concurrent way to get a higher GPU utilization ratio through efficiently handling the paths between the important vertices. Specifically, a graph sketch (consisting of the paths between the important vertices) is extracted from the original graph to serve as a fast bridge for most state propagations. Through efficiently processing this sketch more times within each round of graph processing, higher parallelism of GPU can be utilized to accelerate most state propagations. In addition, a forward-backward intra-path processing way is also adopted to asynchronously handle the vertices on each path, aiming to further boost propagations along paths and also ensure smaller data access cost. In comparison with existing GPU-based systems, i.e., Gunrock, Groute, Tigr, and DiGraph, AsynGraph can speed up iterative graph processing by 3.06–11.52, 2.47–5.40, 2.23–9.65, and 1.41–4.05 times, respectively.

References

[1]

2019. Stanford Large Network Dataset Collection. http://snap.stanford.edu/data/index.html.

[2]

Zhiyuan Ai, Mingxing Zhang, Yongwei Wu, Xuehai Qian, Kang Chen, and Weimin Zheng. 2017. Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk I/O. In Proceedings of the 2017 USENIX Annual Technical Conference. 125--137.

Digital Library

[3]

Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the 2009 Conference on High Performance Graphics. 145--149.

Digital Library

[4]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An asynchronous multi-GPU programming model for irregular computations. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 235--248.

Digital Library

[5]

Elisabetta Bergamini, Michele Borassi, Pierluigi Crescenzi, Andrea Marino, and Henning Meyerhenke. 2019. Computing top-k closeness centrality faster in unweighted graphs. ACM Transactions on Knowledge Discovery from Data 13, 5 (2019), 1--40.

[6]

Hanhua Chen, Hai Jin, and Xiaolong Cui. 2017. Hybrid followee recommendation in microblogging systems. Science China Information Sciences 60, 012102 (2017), 1--14.

[7]

Abdullah Gharaibeh, Lauro Beltro Costa, Elizeu Santos-Neto, and Matei Ripeanu. 2012. A yoke of oxen and a thousand chickens for heavy lifting graph processing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 345--354.

Digital Library

[8]

Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation. 17--30.

Digital Library

[9]

Wei Han, Daniel Mawhirter, Bo Wu, and Matthew Buland. 2017. Graphie: Large-scale asynchronous graph traversals on just a GPU. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques. 233--245.

[10]

Loc Hoang, Matteo Pontecorvi, Roshan Dathathri, Gurbinder Gill, and Vijaya Ramachandran. 2019. A round-efficient distributed betweenness centrality algorithm. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 272--286.

Digital Library

[11]

Changwan Hong, Aravind Sukumaranrajam, Jinsung Kim, and P. Sadayappan. 2017. MultiGraph: Efficient graph processing on GPUs. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques. 27--40.

[12]

Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018. Tigr: Transforming irregular graphs for GPU-friendly graph processing. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. 622--636.

Digital Library

[13]

Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez, and Alex Aiken. 2017. A distributed multi-GPU system for fast graph processing. Proceedings of the VLDB Endowment 11, 3 (2017), 297--310.

Digital Library

[14]

Ruoming Jin, Ning Ruan, Bo You, and Haixun Wang. 2013. Hub-accelerator: Fast and exact shortest path computation in large social networks. In arXiv. 1--12.

[15]

Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan. 2014. CuSha: Vertex-centric graph processing on GPUs. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. 239--252.

[16]

Min Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim. 2016. GTS: A fast and scalable graph processing method based on streaming topology to GPUs. In Proceedings of the 2016 International Conference on Management of Data. 447--461.

Digital Library

[17]

Amlan Kusum, Keval Vora, Rajiv Gupta, and Iulian Neamtiu. 2016. Efficient processing of large graphs via input reduction. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 245--257.

Digital Library

[18]

Xue Li, Mingxing Zhang, Kang Chen, and Yongwei Wu. 2018. ReGraph: A graph processing framework that alternately shrinks and repartitions the graph. In Proceedings of the 2018 International Conference on Supercomputing. 172--183.

Digital Library

[19]

Hang Liu and H. Howie Huang. 2015. Enterprise: Breadth-first graph traversal on GPUs. In Proceedings of the 2005 International Conference for High Performance Computing, Networking, Storage and Analysis. 68:1–68:12.

[20]

Hang Liu and H. Howie Huang. 2019. SIMD-X: Programming and processing of graph algorithms on GPUs. In Proceedings of the 2019 USENIX Annual Technical Conference. 411--428.

[21]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning in the cloud. Proceedings of the VLDB Endowment 5, 8 (2012), 716--727.

Digital Library

[22]

Xinqiao Lv, Wei Xiao, Yu Zhang, Xiaofei Liao, Hai Jin, and Qiangsheng Hua. 2019. An effective framework for asynchronous incremental graph processing. Frontiers of Computer Science 13, 3 (2019), 539--551.

Digital Library

[23]

Lingxiao Ma, Zhi Yang, Han Chen, Jilong Xue, and Yafei Dai. 2017. Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication. In Proceedings of the 2017 USENIX Annual Technical Conference. 195--207.

Digital Library

[24]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 135--146.

Digital Library

[25]

Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. 2018. Exploiting locality in graph analytics through hardware-accelerated traversal scheduling. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. 1--14.

Digital Library

[26]

Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. 2017. Multi-GPU graph analytics. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium. 479--490.

[27]

Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John. 2018. Start late or finish early: A distributed graph processing system with redundancy reduction. Proceedings of the VLDB Endowment 12, 2 (2018), 154--168.

Digital Library

[28]

Keval Vora, Chen Tian, Rajiv Gupta, and Ziang Hu. 2017. CoRAL: Confined recovery in distributed asynchronous graph processing. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. 223--236.

Digital Library

[29]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 11:1–11:12.

[30]

Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, and John D. Owens. 2017. Gunrock: GPU graph analytics. ACM Transactions on Parallel Computing 4, 2 (2017), 39:1–39:50.

Digital Library

[31]

Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, and Kang Chen. 2018. Wonderland: A novel abstraction-based out-of-core graph processing system. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. 608--621.

Digital Library

[32]

Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. 2014. Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation. IEEE Transactions on Parallel and Distributed Systems 25, 8 (2014), 2091--2100.

[33]

Yu Zhang, Xiaofei Liao, Hai Jin, Lin Gu, Guang Tan, and Bing Bing Zhou. 2017. HotGraph: Efficient asynchronous processing for real-world graphs. IEEE Trans. Comput. 66, 5 (2017), 799--809.

Digital Library

[34]

Yu Zhang, Xiaofei Liao, Hai Jin, Lin Gu, and Bing Bing Zhou. 2018. FBSGraph: Accelerating asynchronous graph processing via forward and backward sweeping. IEEE Transactions on Knowledge and Data Engineering 30, 5 (2018), 895--907.

[35]

Yu Zhang, Xiaofei Liao, Hai Jin, Bingsheng He, Haikun Liu, and Lin Gu. 2019. DiGraph: An efficient path-based iterative directed graph processing system on multiple GPUs. In Proceedings of the 2019 Architectural Support for Programming Languages and Operating Systems. 601--614.

Digital Library

[36]

Yu Zhang, Xiaofei Liao, Hai Jin, Li Lin, and Feng Lu. 2014. An adaptive switching scheme for iterative computing in the cloud. Frontiers of Computer Science 8, 6 (2014), 872--884.

Digital Library

[37]

Long Zheng, Xiaofei Liao, and Hai Jin. 2018. Efficient and scalable graph parallel processing with symbolic execution. ACM Transactions on Architecture and Code Optimization 15, 1 (2018), 3:1–3:25.

Digital Library

[38]

Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1543--1552.

Digital Library

Cited By

Qi HWu YHe LZhang YLuo KCai MJin HZhang ZZhao J(2024)LSGraph: A Locality-centric High-performance Streaming Graph EngineProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650076(33-49)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3650076
Tang SCui MQi LXu X(2023)Edge Intelligence with Distributed Processing of DNNs: A SurveyComputer Modeling in Engineering & Sciences10.32604/cmes.2023.023684136:1(5-42)Online publication date: 2023
https://doi.org/10.32604/cmes.2023.023684
Zhang YLi YChen BLi EZheng KChi KZhu Y(2023)Design of an RFID-Based Self-Jamming Identification and Sensing PlatformIEEE Transactions on Mobile Computing10.1109/TMC.2023.328094223:5(3802-3816)Online publication date: 29-May-2023
https://dl.acm.org/doi/10.1109/TMC.2023.3280942
Show More Cited By

Index Terms

AsynGraph: Maximizing Data Parallelism for Efficient Iterative Graph Processing on GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Special purpose systems
    2. Parallel architectures
      1. Single instruction, multiple data

Recommendations

GraphPEG: Accelerating Graph Processing on GPUs

Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. ...
DiGraph: An Efficient Path-based Iterative Directed Graph Processing System on Multiple GPUs
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Many systems are recently proposed for large-scale iterative graph analytics on a single machine with GPU accelerators. Despite of many research efforts, for iterative directed graph processing over GPUs, existing solutions suffer from slow convergence ...
A survey of graph processing on graphics processing units

Graphics processing units (GPUs) have become popular high-performance computing platforms for a wide range of applications. The trend of processing graph structures on modern GPUs has also attracted an increasing interest in recent years. This article ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 17, Issue 4

December 2020

430 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3427420

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Accepted: 01 August 2020

Revised: 01 July 2020

Received: 01 April 2020

Published in TACO Volume 17, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
955
Total Downloads

Downloads (Last 12 months)169
Downloads (Last 6 weeks)19

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qi HWu YHe LZhang YLuo KCai MJin HZhang ZZhao J(2024)LSGraph: A Locality-centric High-performance Streaming Graph EngineProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650076(33-49)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3650076
Tang SCui MQi LXu X(2023)Edge Intelligence with Distributed Processing of DNNs: A SurveyComputer Modeling in Engineering & Sciences10.32604/cmes.2023.023684136:1(5-42)Online publication date: 2023
https://doi.org/10.32604/cmes.2023.023684
Zhang YLi YChen BLi EZheng KChi KZhu Y(2023)Design of an RFID-Based Self-Jamming Identification and Sensing PlatformIEEE Transactions on Mobile Computing10.1109/TMC.2023.328094223:5(3802-3816)Online publication date: 29-May-2023
https://dl.acm.org/doi/10.1109/TMC.2023.3280942
Jin HQi HZhao JJiang XHuang YGui CWang QShen XZhang YHu AChen DLiu CLiu HHe HYe XWang RYuan JYao PZhang YZheng LLiao X(2022)Software Systems Implementation and Domain-Specific Architectures towards Graph AnalyticsIntelligent Computing10.34133/2022/98067582022Online publication date: 29-Oct-2022
https://doi.org/10.34133/2022/9806758
Cook SGarcia P(2022)Arbitrarily Parallelizable Code: A Model of Computation Evaluated on a Message-Passing Many-Core SystemComputers10.3390/computers1111016411:11(164)Online publication date: 18-Nov-2022
https://doi.org/10.3390/computers11110164
Wallinger MArchambault DAuber DNöllenburg MPeltonen J(2022)Edge-Path Bundling: A Less Ambiguous Edge Bundling ApproachIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2021.311479528:1(313-323)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1109/TVCG.2021.3114795
Liao XZhao JZhang YHe BHe LJin HGu L(2022)A Structure-Aware Storage Optimization for Out-of-Core Concurrent Graph ProcessingIEEE Transactions on Computers10.1109/TC.2021.309897671:7(1612-1625)Online publication date: 1-Jul-2022
https://doi.org/10.1109/TC.2021.3098976
Si BLiang YZhao JZhang YLiao XJin HLiu HGu L(2022)GGraph: An Efficient Structure-Aware Approach for Iterative Graph ProcessingIEEE Transactions on Big Data10.1109/TBDATA.2020.30196418:5(1182-1194)Online publication date: 1-Oct-2022
https://doi.org/10.1109/TBDATA.2020.3019641
Zhang YPeng DLiao XJin HLiu HGu LHe B(2021)LargeGraphACM Transactions on Architecture and Code Optimization10.1145/347760318:4(1-24)Online publication date: 29-Sep-2021
https://dl.acm.org/doi/10.1145/3477603
Wang PWang JLi CWang JZhu HGuo M(2021)GrusACM Transactions on Architecture and Code Optimization10.1145/344484418:2(1-25)Online publication date: 9-Feb-2021
https://dl.acm.org/doi/10.1145/3444844
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents