Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063384.2063471acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Parallel breadth-first search on distributed memory systems

Published: 12 November 2011 Publication History
  • Get Citation Alerts
  • Abstract

    Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

    References

    [1]
    V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. Scalable graph exploration on multicore processors. In Proc. ACM/IEEE Conference on Supercomputing (SC10), November 2010.
    [2]
    D. Ajwani, R. Dementiev, and U. Meyer. A computational study of external-memory BFS algorithms. In Proc. 17th annual ACM-SIAM Symposium on Discrete Algorithms (SODA '06), pages 601--610, January 2006.
    [3]
    D. Ajwani and U. Meyer. Design and engineering of external memory traversal algorithms for general graphs. In J. Lerner, D. Wagner, and K. A. Zweig, editors, Algorithmics of Large and Complex Networks: Design, Analysis, and Simulation, pages 1--33. Springer, 2009.
    [4]
    D. A. Bader and K. Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In Proc. 35th Int'l. Conf. on Parallel Processing (ICPP 2006), pages 523--530, August 2006.
    [5]
    J. Barnat, L. Brim, and J. Chaloupka. Parallel breadth-first search LTL model-checking. In Proc. 18th IEEE Int'l. Conf. on Automated Software Engineering, pages 106--115, October 2003.
    [6]
    P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In Proc. 13th Int'l. World Wide Web Conference (WWW 2004), pages 595--601, 2004.
    [7]
    A. Buluç and J. R. Gilbert. On the representation and multiplication of hypersparse matrices. In Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS 2008), pages 1--11. IEEE Computer Society, 2008.
    [8]
    A. Buluç and J. R. Gilbert. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications, Online first, 2011.
    [9]
    A. Buluç and K. Madduri. Parallel breadth-first search on distributed memory systems. Technical Report LBNL-4769E, Lawrence Berkeley National Laboratory, 2011.
    [10]
    D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proc. 4th SIAM Intl. Conf. on Data Mining (SDM), Orlando, FL, April 2004. SIAM.
    [11]
    A. Chan, F. Dehne, and R. Taylor. CGMGRAPH/CGMLIB: Implementing and testing CGM graph algorithms on PC clusters and shared memory machines. Int'l. Journal of High Performance Comput. Appl., 19(1):81--97, 2005.
    [12]
    G. Cong, G. Almasi, and V. Saraswat. Fast PGAS implementation of distributed graph algorithms. In Proc. ACM/IEEE Conference on Supercomputing (SC10), November 2010.
    [13]
    G. Cong and K. Makarychev. Improving memory access locality for large-scale graph analysis applications. In Proc. 22nd Intl. Parallel and Distributed Computing and Communication Systems (PDCCS 2009), pages 121--127, September 2009.
    [14]
    T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Inc., Cambridge, MA, 1990.
    [15]
    E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proc. 24th ACM Annual Conf./Annual Meeting, pages 157--172, 1969.
    [16]
    N. Edmonds, J. Willcock, T. Hoefler, and A. Lumsdaine. Design of a large-scale hybrid-parallel graph library. In International Conference on High Performance Computing, Student Research Symposium, Goa, India, December 2010. To appear.
    [17]
    H. Gazit and G. L. Miller. An improved parallel algorithm that computes the BFS numbering of a directed graph. Information Processing Letters, 28(2):61--65, 1988.
    [18]
    J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in Matlab: Design and implementation. SIAM Journal of Matrix Analysis and Applications, 13(1):333--356, 1992.
    [19]
    J. R. Gilbert, S. Reinhardt, and V. B. Shah. A unified framework for numerical and combinatorial computing. Computing in Science and Engineering, 10(2):20--25, 2008.
    [20]
    The Graph 500 List. http://www.graph500.org, last accessed April 2011.
    [21]
    D. Gregor and A. Lumsdaine. Lifting sequential graph algorithms for distributed-memory parallel computation. In Proc. 20th ACM SIGPLAN Conf. on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 423--437, October 2005.
    [22]
    P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the GPU using CUDA. In Proc. 14th Int'l. Conf. on High-Performance Computing (HiPC), pages 197--208, dec 2007.
    [23]
    B. Hendrickson, R. W. Leland, and S. Plimpton. An efficient parallel algorithm for matrix-vector multiplication. International Journal of High Speed Computing, 7(1):73--88, 1995.
    [24]
    R. E. Korf and P. Schultze. Large-scale parallel breadth-first search. In Proc. 20th National Conf. on Artificial Intelligence (AAAI'05), pages 1380--1385, July 2005.
    [25]
    C. E. Leiserson and T. B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In Proc. 22nd ACM Symp. on Parallism in Algorithms and Architectures (SPAA '10), pages 303--314, June 2010.
    [26]
    A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17:5--20, 2007.
    [27]
    L. Luo, M. Wong, and W m. Hwu. An effective GPU implementation of breadth-first search. In Proc. 47th Design Automation Conference (DAC '10), pages 52--55, June 2010.
    [28]
    G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. Int'l. Conf. on Management of Data (SIGMOD '10), pages 135--146, June 2010.
    [29]
    D. Mizell and K. Maschhoff. Early experiences with large-scale XMT systems. In Proc. Workshop on Multithreaded Architectures and Applications (MTAAP'09), May 2009.
    [30]
    R. Pearce, M. Gokhale, and N. M. Amato. Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In Proc. 2010 ACM/IEEE Int'l. Conf. for High Performance Computing, Networking, Storage and Analysis (SC'10), pages 1--11, 2010.
    [31]
    M. J. Quinn and N. Deo. Parallel graph algorithms. ACM Comput. Surv., 16(3):319--348, 1984.
    [32]
    A. E. Reghbati and D. G. Corneil. Parallel computations in graph theory. SIAM Journal of Computing, 2(2):230--237, 1978.
    [33]
    D. P. Scarpazza, O. Villa, and F. Petrini. Efficient Breadth-First Search on the Cell/BE processor. IEEE Transactions on Parallel and Distributed Systems, 19(10):1381--1395, 2008.
    [34]
    G. R. Schreiber and O. C. Martin. Cut size statistics of graph bisection heuristics. SIAM Journal on Optimization, 10(1):231--251, 1999.
    [35]
    J. Ullman and M. Yannakakis. High-probability parallel transitive closure algorithms. In Proc. 2nd Annual Symp. on Parallel Algorithms and Architectures (SPAA 1990), pages 200--209, July 1990.
    [36]
    S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178--194, 2009.
    [37]
    Y. Xia and V. K. Prasanna. Topologically adaptive parallel breadth-first search on multicore processors. In Proc. 21st Int'l. Conf. on Parallel and Distributed Computing Systems (PDCS'09), November 2009.
    [38]
    A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and Ü. V. Çatalyürek. A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In Proc. ACM/IEEE Conf. on High Performance Computing (SC2005), November 2005.
    [39]
    K. You, J. Chong, Y. Yi, E. Gonina, C. Hughes, Y-K. Chen, W. Sung, and K. Kuetzer. Parallel scalability in speech recognition: Inference engine in large vocabulary continuous speech recognition. IEEE Signal Processing Magazine, 26(6):124--135, 2009.

    Cited By

    View all
    • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
    • (2024)ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache OptimizationsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651381(71-85)Online publication date: 27-Apr-2024
    • (2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-2Online publication date: 30-May-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2011
    866 pages
    ISBN:9781450307710
    DOI:10.1145/2063384
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 November 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '11
    Sponsor:

    Acceptance Rates

    SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)131
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
    • (2024)ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache OptimizationsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651381(71-85)Online publication date: 27-Apr-2024
    • (2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-2Online publication date: 30-May-2024
    • (2023)A Survey of Next-generation Computing Technologies in Space-air-ground Integrated NetworksACM Computing Surveys10.1145/360601856:1(1-40)Online publication date: 28-Aug-2023
    • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
    • (2023)Machine Unlearning: A SurveyACM Computing Surveys10.1145/360362056:1(1-36)Online publication date: 28-Aug-2023
    • (2023)GraphTune: An Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph ProcessingACM Transactions on Architecture and Code Optimization10.1145/360009120:3(1-24)Online publication date: 19-Jul-2023
    • (2023)The Evolution of Distributed Systems for Graph Neural Networks and Their Origin in Graph Processing and Deep Learning: A SurveyACM Computing Surveys10.1145/359742856:1(1-37)Online publication date: 28-Aug-2023
    • (2023)A Taxonomy and Analysis of Misbehaviour Detection in Cooperative Intelligent Transport Systems: A Systematic ReviewACM Computing Surveys10.1145/359659856:1(1-38)Online publication date: 28-Aug-2023
    • (2023)PeeK: A Prune-Centric Approach for K Shortest Path ComputationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607110(1-14)Online publication date: 12-Nov-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media