Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

High-Performance and Scalable GPU Graph Traversal

Published: 18 February 2015 Publication History

Abstract

Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter.
We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations on both CPU and GPU platforms.

References

[1]
Virat Agarwal, Fabrizio Petrini, Davide Pasetto, and David A. Bader. 2010. Scalable graph exploration on multicore processors. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). 1--11.
[2]
David A. Bader, Guojing Cong, and John Feo. 2005. On the architectural requirements for efficient execution of graph algorithms. In Proceedings of the International Conference on Parallel Processing (ICPP'05). 547--556.
[3]
David A. Bader and Kamesh Madduri. 2006a. Designing multithreaded algorithms for breadth-first search and ST-connectivity on the Cray MTA-2. In Proceedings of the International Conference on Parallel Processing (ICPP'06). 523--530.
[4]
David A. Bader and Kamesh Madduri. 2006b. GT graph: A synthetic graph generator suite. http://www.cse.psu.edu/~madduri/software/GTgraph.
[5]
Scott Beamer, Krste Asanović, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society Press, 12:1--12:10.
[6]
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC'09). ACM Press, New York, 18:1--18:11.
[7]
Guy E. Blelloch. 1989. Scans as primitive parallel operations. IEEE Trans. Comput. 38, 11, 1526--1538.
[8]
Guy E. Blelloch. 1990. Prefix sums and their applications. In Synthesis of Parallel Algorithms. Morgan Kaufmann, San Francisco, 35--60.
[9]
Siddhartha Chatterjee, Guy E. Blelloch, and Marco Zagha. 1990. Scan primitives for vector computers. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing'90). IEEE Computer Society Press, 666--675.
[10]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'09). 44--54.
[11]
Fabio Checconi, Fabrizio Petrini, Jeremiah Willcock, Andrew Lumsdaine, Anamitra Roy Choudhury,and Yogish Sabharwal. 2012. Breaking the speed and scalability barriers for graph explorationon distributed-memory machines. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society Press, 13:1--13:12.
[12]
Fabio Checconi and Fabrizio Petrini. 2014. Traversing trillions of edges in real time: Graph exploration on large-scale parallel machines. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS'14). 425--434.
[13]
C. J. Cheney. 1970. A nonrecursive list compacting algorithm. Comm. ACM 13, 11, 677--678.
[14]
Jatin Chhugani, Nadathur Satish, Changkyu Kim, Jason Sewall, and Pradeep K. Dubey. 2012. Fast and efficient graph traversal algorithm for cpus: Maximizing single-node efficiency. In Proceedings of the 26th IEEE International Parallel Distributed Processing Symposium (IPDPS'12). 2012. 378--389.
[15]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001.Introduction to Algorithms, 2nd Ed. MIT Press.
[16]
Tim Davis and Yifan Hu. University of florida sparse matrix collection. http://www.cise.ufl.edu/research/sparse/matrices/.
[17]
Yangdong Deng, Bo David Wang, and Shuai Mu. 2009. Taming irregular EDA applications on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD'09). ACM Press, New York, 539--546.
[18]
DIMACS. 2011. 9th DIMACS implementation challenge. http://www.dis.uniroma1.it/~challenge9/download.shtml.
[19]
DIMACS. 2012. 10th DIMACS implementation challenge. http://www.cc.gatech.edu/dimacs10/index.shtml.
[20]
Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. 2008. Fast scan algorithms on graphics processors. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS'08). ACM Press, New York, 205--213.
[21]
Michael Garland. 2008. Sparse matrix computations on manycore GPU's. In Proceedings of the 45th Annual Design Automation Conference (DAC'08). ACM Press, New York, 2--6.
[22]
Joseph Gonzalez, Yucheng Low, and Carlos Guestrin. 2009. Residual splash for optimally parallelizing belief propagation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS'09). 177--184.
[23]
Graph List. 2011. The graph 500 list. http://www.graph500.org/.
[24]
Pawan Harish and P. J. Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the 14th International Conference on High Performance Computing (HiPC'07). 197--208.
[25]
W. Daniel Hillis and Guy L. Steele. 1986. Data parallel algorithms. Comm. ACM 29, 12, 1170--1183.
[26]
Takaaki Hiragushi and Daisuke Takahashi. 2013. Efficient hybrid breadth-first search on GPUs. In Proceedings of the 13th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP'13), Rocco Aversa, Joanna Kołodziej, Jun Zhang, Flora Amato, and Giancarlo Fortino, Eds. Lecture Notes in Computer Science, vol. 8286. Springer, 40--50.
[27]
Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011a. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP'11). ACM Press, New York, 267--276.
[28]
Sungpack Hong, Tayo Oguntebi, and Knule Olukotun. 2011b. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'11). 78--88.
[29]
Mohamed Hussein, Amitabh Varshney, and Larry Davis. 2007. On implementing graph cuts on CUDA. In Proceedings of the 1st Workshop on General Purpose Processing on Graphics Processing Units (GPGPU'07).
[30]
Charles E. Leiserson and Tao B. Schardl. 2010. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'10). ACM Press, New York,303--314.
[31]
Lijuan Luo, Martin Wong, and Wen-Mei Hwu. 2010. An effective GPU implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference (DAC'10). ACM Press, New York, 52--55.
[32]
Duane Merrill. 2011. Back40 computing: Fast and efficient software primitives for GPU computing. http://code.google.com/p/back40computing/.
[33]
Duane Merrill and Andrew Grimshaw. 2009. Parallel scan for stream architectures. Tech. rep. CS2009-14, Department of Computer Science, University of Virginia.
[34]
Duane Merrill and Andrew Grimshaw. 2011. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Process. Lett. 21, 2, 245--272.
[35]
Mark Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Phys. Rev. E69, 2.
[36]
John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. 2008. GPU computing. Proc. IEEE 96, 5, 879--899.
[37]
Nadathur Satish, Changkyu Kim, Jatin Chhugani, and Pradeep Dubey. 2012. Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society Press, 14:1--14:11.
[38]
Daniele P. Scarpazza, Oreste Villa, and Fabrizio Petrini. 2008. Efficient breadth-first search on the cell/be processor. IEEE Trans. Parallel Distrib. Syst. 19, 10, 1381--1395.
[39]
Shubhabrata Sengupta, Mark Harris, and Michael Garland. 2008. Efficient parallel scan algorithms for GPUs. https://research.nvidia.com/sites/default/files/publications/nvr-2008-003.pdf.
[40]
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Tech. rep. IMPACT-12-01, Center for Reliable and High-Performance Computing. http://impact.crhc.illinois.edu/Shared/Docs/impact-12-01.parboil.pdf.
[41]
Jeffery Ullman and Mihalis Yannakakis. 1990. High-probability parallel transitive closure algorithms. In Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'90). 200--209.
[42]
Yinglong Xia and Viktor K. Prasanna. 2009. Topologically adaptive parallel breadth-first search on multicore processors. In Proceedings of the 21st International Conference on Parallel and Distributed Computing and Systems (PDCS'09).
[43]
Andy Yoo, Edmond Chow, Keith Henderson, William Mclendon, Bruce Hendrickson, and Umit Catalyurek.2005. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'05). 25.

Cited By

View all
  • (2024)Parallel Implementation of SPHINCS+ With GPUsIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.337080271:6(2810-2823)Online publication date: Jun-2024
  • (2024)GPU-Accelerated BFS for Dynamic NetworksEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_6(74-87)Online publication date: 26-Aug-2024
  • (2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 1, Issue 2
Special Issue on PPOPP 2012
January 2015
224 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2737841
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 February 2015
Accepted: 01 November 2014
Revised: 01 November 2014
Received: 01 June 2013
Published in TOPC Volume 1, Issue 2

Check for updates

Author Tags

  1. Breadth-first search
  2. GPU
  3. graph algorithms
  4. graph traversal
  5. parallel algorithms
  6. prefix sum
  7. sparse graphs

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NVIDIA Graduate Fellowship

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)104
  • Downloads (Last 6 weeks)15
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Parallel Implementation of SPHINCS+ With GPUsIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.337080271:6(2810-2823)Online publication date: Jun-2024
  • (2024)GPU-Accelerated BFS for Dynamic NetworksEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_6(74-87)Online publication date: 26-Aug-2024
  • (2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
  • (2023)Traversing Large Compressed Graphs on GPUs2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00013(25-35)Online publication date: May-2023
  • (2023)HyTGraph: GPU-Accelerated Graph Processing with Hybrid Transfer Management2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00049(558-571)Online publication date: Apr-2023
  • (2023)A Generalized Parallel Prefix Sums Algorithm for Arbitrary Size ArraysParallel Processing and Applied Mathematics10.1007/978-3-031-30442-2_3(30-39)Online publication date: 28-Apr-2023
  • (2022)Bring orders into uncertaintyProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532379(1-14)Online publication date: 28-Jun-2022
  • (2022)Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory PartitionsIEEE Transactions on Computers10.1109/TC.2021.310474971:7(1711-1723)Online publication date: 1-Jul-2022
  • (2022)Compressed In-memory Graphs for Accelerating GPU-based Analytics2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)10.1109/IA356718.2022.00011(32-40)Online publication date: Nov-2022
  • (2022)GraphIt to CUDA Compiler in 2021 LOC: A Case for High-Performance DSL Implementation via Staging with BuilDSL2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO53902.2022.9741280(53-65)Online publication date: 2-Apr-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media