research-article

Parallel breadth-first search on distributed memory systems

Authors:

Kamesh MadduriAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 65, Pages 1 - 12

https://doi.org/10.1145/2063384.2063471

Published: 12 November 2011 Publication History

Abstract

Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

References

[1]

V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. Scalable graph exploration on multicore processors. In Proc. ACM/IEEE Conference on Supercomputing (SC10), November 2010.

Digital Library

[2]

D. Ajwani, R. Dementiev, and U. Meyer. A computational study of external-memory BFS algorithms. In Proc. 17th annual ACM-SIAM Symposium on Discrete Algorithms (SODA '06), pages 601--610, January 2006.

Digital Library

[3]

D. Ajwani and U. Meyer. Design and engineering of external memory traversal algorithms for general graphs. In J. Lerner, D. Wagner, and K. A. Zweig, editors, Algorithmics of Large and Complex Networks: Design, Analysis, and Simulation, pages 1--33. Springer, 2009.

Digital Library

[4]

D. A. Bader and K. Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In Proc. 35th Int'l. Conf. on Parallel Processing (ICPP 2006), pages 523--530, August 2006.

Digital Library

[5]

J. Barnat, L. Brim, and J. Chaloupka. Parallel breadth-first search LTL model-checking. In Proc. 18th IEEE Int'l. Conf. on Automated Software Engineering, pages 106--115, October 2003.

Digital Library

[6]

P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In Proc. 13th Int'l. World Wide Web Conference (WWW 2004), pages 595--601, 2004.

Digital Library

[7]

A. Buluç and J. R. Gilbert. On the representation and multiplication of hypersparse matrices. In Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS 2008), pages 1--11. IEEE Computer Society, 2008.

[8]

A. Buluç and J. R. Gilbert. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications, Online first, 2011.

Digital Library

[9]

A. Buluç and K. Madduri. Parallel breadth-first search on distributed memory systems. Technical Report LBNL-4769E, Lawrence Berkeley National Laboratory, 2011.

[10]

D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proc. 4th SIAM Intl. Conf. on Data Mining (SDM), Orlando, FL, April 2004. SIAM.

[11]

A. Chan, F. Dehne, and R. Taylor. CGMGRAPH/CGMLIB: Implementing and testing CGM graph algorithms on PC clusters and shared memory machines. Int'l. Journal of High Performance Comput. Appl., 19(1):81--97, 2005.

Digital Library

[12]

G. Cong, G. Almasi, and V. Saraswat. Fast PGAS implementation of distributed graph algorithms. In Proc. ACM/IEEE Conference on Supercomputing (SC10), November 2010.

Digital Library

[13]

G. Cong and K. Makarychev. Improving memory access locality for large-scale graph analysis applications. In Proc. 22nd Intl. Parallel and Distributed Computing and Communication Systems (PDCCS 2009), pages 121--127, September 2009.

[14]

T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Inc., Cambridge, MA, 1990.

Digital Library

[15]

E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proc. 24th ACM Annual Conf./Annual Meeting, pages 157--172, 1969.

Digital Library

[16]

N. Edmonds, J. Willcock, T. Hoefler, and A. Lumsdaine. Design of a large-scale hybrid-parallel graph library. In International Conference on High Performance Computing, Student Research Symposium, Goa, India, December 2010. To appear.

[17]

H. Gazit and G. L. Miller. An improved parallel algorithm that computes the BFS numbering of a directed graph. Information Processing Letters, 28(2):61--65, 1988.

Digital Library

[18]

J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in Matlab: Design and implementation. SIAM Journal of Matrix Analysis and Applications, 13(1):333--356, 1992.

Digital Library

[19]

J. R. Gilbert, S. Reinhardt, and V. B. Shah. A unified framework for numerical and combinatorial computing. Computing in Science and Engineering, 10(2):20--25, 2008.

Digital Library

[20]

The Graph 500 List. http://www.graph500.org, last accessed April 2011.

[21]

D. Gregor and A. Lumsdaine. Lifting sequential graph algorithms for distributed-memory parallel computation. In Proc. 20th ACM SIGPLAN Conf. on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 423--437, October 2005.

Digital Library

[22]

P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the GPU using CUDA. In Proc. 14th Int'l. Conf. on High-Performance Computing (HiPC), pages 197--208, dec 2007.

Digital Library

[23]

B. Hendrickson, R. W. Leland, and S. Plimpton. An efficient parallel algorithm for matrix-vector multiplication. International Journal of High Speed Computing, 7(1):73--88, 1995.

[24]

R. E. Korf and P. Schultze. Large-scale parallel breadth-first search. In Proc. 20th National Conf. on Artificial Intelligence (AAAI'05), pages 1380--1385, July 2005.

Digital Library

[25]

C. E. Leiserson and T. B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In Proc. 22nd ACM Symp. on Parallism in Algorithms and Architectures (SPAA '10), pages 303--314, June 2010.

Digital Library

[26]

A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17:5--20, 2007.

[27]

L. Luo, M. Wong, and W m. Hwu. An effective GPU implementation of breadth-first search. In Proc. 47th Design Automation Conference (DAC '10), pages 52--55, June 2010.

Digital Library

[28]

G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. Int'l. Conf. on Management of Data (SIGMOD '10), pages 135--146, June 2010.

Digital Library

[29]

D. Mizell and K. Maschhoff. Early experiences with large-scale XMT systems. In Proc. Workshop on Multithreaded Architectures and Applications (MTAAP'09), May 2009.

Digital Library

[30]

R. Pearce, M. Gokhale, and N. M. Amato. Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In Proc. 2010 ACM/IEEE Int'l. Conf. for High Performance Computing, Networking, Storage and Analysis (SC'10), pages 1--11, 2010.

Digital Library

[31]

M. J. Quinn and N. Deo. Parallel graph algorithms. ACM Comput. Surv., 16(3):319--348, 1984.

Digital Library

[32]

A. E. Reghbati and D. G. Corneil. Parallel computations in graph theory. SIAM Journal of Computing, 2(2):230--237, 1978.

[33]

D. P. Scarpazza, O. Villa, and F. Petrini. Efficient Breadth-First Search on the Cell/BE processor. IEEE Transactions on Parallel and Distributed Systems, 19(10):1381--1395, 2008.

Digital Library

[34]

G. R. Schreiber and O. C. Martin. Cut size statistics of graph bisection heuristics. SIAM Journal on Optimization, 10(1):231--251, 1999.

Digital Library

[35]

J. Ullman and M. Yannakakis. High-probability parallel transitive closure algorithms. In Proc. 2nd Annual Symp. on Parallel Algorithms and Architectures (SPAA 1990), pages 200--209, July 1990.

Digital Library

[36]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178--194, 2009.

Digital Library

[37]

Y. Xia and V. K. Prasanna. Topologically adaptive parallel breadth-first search on multicore processors. In Proc. 21st Int'l. Conf. on Parallel and Distributed Computing Systems (PDCS'09), November 2009.

[38]

A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and Ü. V. Çatalyürek. A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In Proc. ACM/IEEE Conf. on High Performance Computing (SC2005), November 2005.

Digital Library

[39]

K. You, J. Chong, Y. Yi, E. Gonina, C. Hughes, Y-K. Chen, W. Sung, and K. Kuetzer. Parallel scalability in speech recognition: Inference engine in large vocabulary continuous speech recognition. IEEE Signal Processing Magazine, 26(6):124--135, 2009.

Cited By

Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Lu XLong BChen XHan YSun XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache OptimizationsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651381(71-85)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651381
Guo HWang HChen WZhang CHan YZhu SZhang DGuo YShang JWan TLi QWu G(2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-2Online publication date: 30-May-2024
https://doi.org/10.1007/s11227-024-06234-2
Show More Cited By

Parallel breadth-first search on distributed memory systems

Recommendations

A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

We have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. Our PBFS program on a single processor runs as quickly as a standar. C++ breadth-first search implementation. PBFS ...
Massively parallel breadth first search using a tree-structured memory model
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

Analysis of massive graphs has emerged as an important area for massively parallel computation. In this paper, it is shown how the Fresh Breeze trees-of-chunks memory model may be used to perform breadth-first search of large undirected graphs. Overall, ...
Compiling shared-memory applications for distributed-memory systems

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
1,540
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)5

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Lu XLong BChen XHan YSun XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache OptimizationsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651381(71-85)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651381
Guo HWang HChen WZhang CHan YZhu SZhang DGuo YShang JWan TLi QWu G(2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-2Online publication date: 30-May-2024
https://doi.org/10.1007/s11227-024-06234-2
Shen ZJin JTan CTagami AWang SLi QZheng QYuan J(2023)A Survey of Next-generation Computing Technologies in Space-air-ground Integrated NetworksACM Computing Surveys10.1145/360601856:1(1-40)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3606018
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Xu HZhu TZhang LZhou WYu P(2023)Machine Unlearning: A SurveyACM Computing Surveys10.1145/360362056:1(1-36)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3603620
Zhao JZhang YHe LLi QZhang XJiang XYu HLiao XJin HGu LLiu HHe BZhang JSong XWang LZhou J(2023)GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph ProcessingACM Transactions on Architecture and Code Optimization10.1145/360009120:3(1-24)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3600091
Vatter JMayer RJacobsen H(2023)The Evolution of Distributed Systems for Graph Neural Networks and Their Origin in Graph Processing and Deep Learning: A SurveyACM Computing Surveys10.1145/359742856:1(1-37)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3597428
Amanullah MLoke SBaruwal Chhetri MDoss R(2023)A Taxonomy and Analysis of Misbehaviour Detection in Cooperative Intelligent Transport Systems: A Systematic ReviewACM Computing Surveys10.1145/359659856:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3596598
Feng WChen SLiu HJi YMohror KArnold DBadia R(2023)PeeK: A Prune-Centric Approach for K Shortest Path ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607110(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607110
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents