research-article

High-Performance and Scalable GPU Graph Traversal

Authors:

Michael Garland,

Andrew GrimshawAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 1, Issue 2

Article No.: 14, Pages 1 - 30

https://doi.org/10.1145/2717511

Published: 18 February 2015 Publication History

Abstract

Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter.

We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations on both CPU and GPU platforms.

References

[1]

Virat Agarwal, Fabrizio Petrini, Davide Pasetto, and David A. Bader. 2010. Scalable graph exploration on multicore processors. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). 1--11.

Digital Library

[2]

David A. Bader, Guojing Cong, and John Feo. 2005. On the architectural requirements for efficient execution of graph algorithms. In Proceedings of the International Conference on Parallel Processing (ICPP'05). 547--556.

Digital Library

[3]

David A. Bader and Kamesh Madduri. 2006a. Designing multithreaded algorithms for breadth-first search and ST-connectivity on the Cray MTA-2. In Proceedings of the International Conference on Parallel Processing (ICPP'06). 523--530.

Digital Library

[4]

David A. Bader and Kamesh Madduri. 2006b. GT graph: A synthetic graph generator suite. http://www.cse.psu.edu/~madduri/software/GTgraph.

[5]

Scott Beamer, Krste Asanović, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society Press, 12:1--12:10.

Digital Library

[6]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC'09). ACM Press, New York, 18:1--18:11.

Digital Library

[7]

Guy E. Blelloch. 1989. Scans as primitive parallel operations. IEEE Trans. Comput. 38, 11, 1526--1538.

Digital Library

[8]

Guy E. Blelloch. 1990. Prefix sums and their applications. In Synthesis of Parallel Algorithms. Morgan Kaufmann, San Francisco, 35--60.

[9]

Siddhartha Chatterjee, Guy E. Blelloch, and Marco Zagha. 1990. Scan primitives for vector computers. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing'90). IEEE Computer Society Press, 666--675.

Digital Library

[10]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'09). 44--54.

Digital Library

[11]

Fabio Checconi, Fabrizio Petrini, Jeremiah Willcock, Andrew Lumsdaine, Anamitra Roy Choudhury,and Yogish Sabharwal. 2012. Breaking the speed and scalability barriers for graph explorationon distributed-memory machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society Press, 13:1--13:12.

Digital Library

[12]

Fabio Checconi and Fabrizio Petrini. 2014. Traversing trillions of edges in real time: Graph exploration on large-scale parallel machines. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS'14). 425--434.

Digital Library

[13]

C. J. Cheney. 1970. A nonrecursive list compacting algorithm. Comm. ACM 13, 11, 677--678.

Digital Library

[14]

Jatin Chhugani, Nadathur Satish, Changkyu Kim, Jason Sewall, and Pradeep K. Dubey. 2012. Fast and efficient graph traversal algorithm for cpus: Maximizing single-node efficiency. In Proceedings of the 26th IEEE International Parallel Distributed Processing Symposium (IPDPS'12). 2012. 378--389.

Digital Library

[15]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001.Introduction to Algorithms, 2nd Ed. MIT Press.

Digital Library

[16]

Tim Davis and Yifan Hu. University of florida sparse matrix collection. http://www.cise.ufl.edu/research/sparse/matrices/.

Digital Library

[17]

Yangdong Deng, Bo David Wang, and Shuai Mu. 2009. Taming irregular EDA applications on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'09). ACM Press, New York, 539--546.

Digital Library

[18]

DIMACS. 2011. 9th DIMACS implementation challenge. http://www.dis.uniroma1.it/~challenge9/download.shtml.

[19]

DIMACS. 2012. 10th DIMACS implementation challenge. http://www.cc.gatech.edu/dimacs10/index.shtml.

[20]

Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. 2008. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). ACM Press, New York, 205--213.

Digital Library

[21]

Michael Garland. 2008. Sparse matrix computations on manycore GPU's. In Proceedings of the 45th Annual Design Automation Conference (DAC'08). ACM Press, New York, 2--6.

Digital Library

[22]

Joseph Gonzalez, Yucheng Low, and Carlos Guestrin. 2009. Residual splash for optimally parallelizing belief propagation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS'09). 177--184.

[23]

Graph List. 2011. The graph 500 list. http://www.graph500.org/.

[24]

Pawan Harish and P. J. Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the 14th International Conference on High Performance Computing (HiPC'07). 197--208.

Digital Library

[25]

W. Daniel Hillis and Guy L. Steele. 1986. Data parallel algorithms. Comm. ACM 29, 12, 1170--1183.

Digital Library

[26]

Takaaki Hiragushi and Daisuke Takahashi. 2013. Efficient hybrid breadth-first search on GPUs. In Proceedings of the 13th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP'13), Rocco Aversa, Joanna Kołodziej, Jun Zhang, Flora Amato, and Giancarlo Fortino, Eds. Lecture Notes in Computer Science, vol. 8286. Springer, 40--50.

Digital Library

[27]

Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011a. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP'11). ACM Press, New York, 267--276.

Digital Library

[28]

Sungpack Hong, Tayo Oguntebi, and Knule Olukotun. 2011b. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'11). 78--88.

Digital Library

[29]

Mohamed Hussein, Amitabh Varshney, and Larry Davis. 2007. On implementing graph cuts on CUDA. In Proceedings of the 1st Workshop on General Purpose Processing on Graphics Processing Units (GPGPU'07).

[30]

Charles E. Leiserson and Tao B. Schardl. 2010. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'10). ACM Press, New York,303--314.

Digital Library

[31]

Lijuan Luo, Martin Wong, and Wen-Mei Hwu. 2010. An effective GPU implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference (DAC'10). ACM Press, New York, 52--55.

Digital Library

[32]

Duane Merrill. 2011. Back40 computing: Fast and efficient software primitives for GPU computing. http://code.google.com/p/back40computing/.

[33]

Duane Merrill and Andrew Grimshaw. 2009. Parallel scan for stream architectures. Tech. rep. CS2009-14, Department of Computer Science, University of Virginia.

[34]

Duane Merrill and Andrew Grimshaw. 2011. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Process. Lett. 21, 2, 245--272.

[35]

Mark Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Phys. Rev. E69, 2.

[36]

John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. 2008. GPU computing. Proc. IEEE 96, 5, 879--899.

[37]

Nadathur Satish, Changkyu Kim, Jatin Chhugani, and Pradeep Dubey. 2012. Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society Press, 14:1--14:11.

Digital Library

[38]

Daniele P. Scarpazza, Oreste Villa, and Fabrizio Petrini. 2008. Efficient breadth-first search on the cell/be processor. IEEE Trans. Parallel Distrib. Syst. 19, 10, 1381--1395.

Digital Library

[39]

Shubhabrata Sengupta, Mark Harris, and Michael Garland. 2008. Efficient parallel scan algorithms for GPUs. https://research.nvidia.com/sites/default/files/publications/nvr-2008-003.pdf.

[40]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Tech. rep. IMPACT-12-01, Center for Reliable and High-Performance Computing. http://impact.crhc.illinois.edu/Shared/Docs/impact-12-01.parboil.pdf.

[41]

Jeffery Ullman and Mihalis Yannakakis. 1990. High-probability parallel transitive closure algorithms. In Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'90). 200--209.

Digital Library

[42]

Yinglong Xia and Viktor K. Prasanna. 2009. Topologically adaptive parallel breadth-first search on multicore processors. In Proceedings of the 21st International Conference on Parallel and Distributed Computing and Systems (PDCS'09).

[43]

Andy Yoo, Edmond Chow, Keith Henderson, William Mclendon, Bruce Hendrickson, and Umit Catalyurek.2005. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'05). 25.

Digital Library

Cited By

Kim DChoi HSeo S(2024)Parallel Implementation of SPHINCS+ With GPUsIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.337080271:6(2810-2823)Online publication date: Jun-2024
https://doi.org/10.1109/TCSI.2024.3370802
Ziche FBombieri NBusato FGiugno R(2024)GPU-Accelerated BFS for Dynamic NetworksEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_6(74-87)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69583-4_6
Wang HYang WOuyang RHu RLi KLi K(2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3584373
Show More Cited By

Index Terms

High-Performance and Scalable GPU Graph Traversal

Recommendations

Scalable GPU graph traversal
PPOPP '12

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Scalable GPU graph traversal
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Using the Intel Many Integrated Core to accelerate graph traversal

Data-intensive applications have drawn more and more attention in the last few years. The basic graph traversal algorithm, the breadth-first search (BFS), a typical data-intensive application, is widely used and the Graph 500 benchmark uses it to rank ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 1, Issue 2

Special Issue on PPOPP 2012

January 2015

224 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/2737841

Editor:
Phillip B. Gibbons
Intel Labs, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2015 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015

Accepted: 01 November 2014

Revised: 01 November 2014

Received: 01 June 2013

Published in TOPC Volume 1, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NVIDIA Graduate Fellowship

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
1,829
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)15

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim DChoi HSeo S(2024)Parallel Implementation of SPHINCS+ With GPUsIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.337080271:6(2810-2823)Online publication date: Jun-2024
https://doi.org/10.1109/TCSI.2024.3370802
Ziche FBombieri NBusato FGiugno R(2024)GPU-Accelerated BFS for Dynamic NetworksEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_6(74-87)Online publication date: 26-Aug-2024
https://doi.org/10.1007/978-3-031-69583-4_6
Wang HYang WOuyang RHu RLi KLi K(2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3584373
Gera PKim H(2023)Traversing Large Compressed Graphs on GPUs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00013(25-35)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00013
Wang QAi XZhang YChen JYu G(2023)HyTGraph: GPU-Accelerated Graph Processing with Hybrid Transfer Management2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00049(558-571)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00049
Sikorski AWrosz ILewandowski M(2023)A Generalized Parallel Prefix Sums Algorithm for Arbitrary Size ArraysParallel Processing and Applied Mathematics10.1007/978-3-031-30442-2_3(30-39)Online publication date: 28-Apr-2023
https://doi.org/10.1007/978-3-031-30442-2_3
Zhang HLi LLiu HZhuang DLiu RHuan CSong STao DLiu YHe CWu YSong SRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Bring orders into uncertaintyProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532379(1-14)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532379
Segura AArnau JGonzalez A(2022)Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory PartitionsIEEE Transactions on Computers10.1109/TC.2021.310474971:7(1711-1723)Online publication date: 1-Jul-2022
https://doi.org/10.1109/TC.2021.3104749
Azami NBurtscher M(2022)Compressed In-memory Graphs for Accelerating GPU-based Analytics2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA356718.2022.00011(32-40)Online publication date: Nov-2022
https://doi.org/10.1109/IA356718.2022.00011
Brahmakshatriya AAmarasinghe S(2022)GraphIt to CUDA Compiler in 2021 LOC: A Case for High-Performance DSL Implementation via Staging with BuilDSL2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO53902.2022.9741280(53-65)Online publication date: 2-Apr-2022
https://doi.org/10.1109/CGO53902.2022.9741280
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents