Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1504176.1504212acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Petascale computing with accelerators

Published: 14 February 2009 Publication History
  • Get Citation Alerts
  • Abstract

    A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell™" 8i1 accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.

    References

    [1]
    Advanced Micro Devices, AMD Core Math Library, http://www.amd.com/acml
    [2]
    W. Alvaro, J. Kurzak, and J. J. Dongarra, Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor. UT-CS-08-609, January 2008.
    [3]
    K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin, J. C. Sancho. "Entering the Petaflop Era: The Architecture and Performance of Roadrunner", in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008
    [4]
    K. J. Bowers, B. J. Albright, B. K. Bergen, L. Yin, K. J. Barker, D. J. Kerbyson, 0.365 Pflop/s Trillion-particle Particle-in-cell Modeling of Laser Plasma Interactions on Roadrunner, in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008
    [5]
    A. Buttari, P. Luszczek, J. Kurzak, J. Dongarra, and G. Bosilca, A Rough Guide to Scientific Computing on the PlayStation 3, Technical Report UT-CS-07-595, Innovative Computing Laboratory, University of Tennessee Knoxville, May 11, 2007.
    [6]
    J. Chen, Y. Zhang, L. Zhang, W. Yuan, Performance Evaluation of Allgather Algorithms On Terascale Linux Cluster with Fast Ethernet, Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05), IEEE, 2005
    [7]
    ClearSpeed, Accelerated HPC Clusters, http://www.clearspeed.com/acceleration/accelhpcclusters/
    [8]
    J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., 16 (1990), pp. 1--17.
    [9]
    J. J. Dongarra, R. A. van de Geijn, D. W. Walker, Scalability Issues Affecting the Design of a Dense Linear Algebra Library, Journal of Parallel and Distributed Computing, Vol 22, Number 3, pp 523--537, 1994
    [10]
    A. E. Eichenberger, J. K. O'Brien, K. M. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo, Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture, IBM Systems Journal, Vol 45, Number 1, 2006
    [11]
    E. Gabriel, G. Fagg, G. Bosilca, et al, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Euro PVM/MPI, September, 2004.
    [12]
    K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, to appear.
    [13]
    F. G. Gustavson, High-performance linear algebra algorithms using new generalized data structures for matrices, IBM Journal of Research and Development, Vol. 47, Number 1, 2003, pp 31--55
    [14]
    P. Husbands and K. Yelick, Multi-Threading and One-Sided Communication in Parallel LU Factorization, in Proceedings of the 2007 ACM/IEEE conference on Supercomputing, Nov 2007
    [15]
    IBM, The IBM Software Kit for Multicore Acceleration Version 3.0 http://www.ibm.com/chips/techlib/techlib.nsf/products/IBM_SDK_for_Multicore_Acceleration
    [16]
    IBM, Data Communication and Synchronization Library for Hybrid-x86 Programmers Guide and API Reference, October 2007.
    [17]
    Intel Corp, Intel® Xeon® Processor 5000 Sequence: HPC Benchmarks: Dense Floating--point, http://www.intel.com/performance/server/xeon/hpcapp.htm
    [18]
    C. R. Johns and D. A. Brokenshire, Introduction to the Cell Broadband Engine Architecture, IBM Journal of Research and Development, Vol 51, Number 5, 2007, pp 503--520
    [19]
    D. J. Kerbyson and A. Hoisie, Analysis of Wavefront Algorithms on Large-scale Two-level Heterogeneous Processing Systems, Workshop on Unique Chips and Systems (UCAS2), IEEE Symposium on Performance Analysis of Systems and Software (ISPASS06), Austin, TX, Mar 2006
    [20]
    M. Kistler, J. Gunnels, D. Brokenshire, B. Benton, Programming the Linpack Benchmark for Roadrunner, IBM Journal of Research and Development, to appear
    [21]
    J. Kurzak and J. Dongarra, Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead, UT-CS-06-581, September 2006.
    [22]
    C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic Linear Algebra Subprograms for FORTRAN usage, ACM Trans. Math. Soft., 5 (1979), pp. 308--323.
    [23]
    Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1995. http://www.mpi-forum.org.
    [24]
    Message Passing Interface Forum. MPI-2: Extensions to the Message Passing Interface, July 1997. http://www.mpi-forum.org
    [25]
    OpenMP Specifications, http://www.openmp.org/drupal/node/view/8
    [26]
    G. Quintana-Orti, F. D. Igual, E. S. Quintana-Orti, R. van de Geijn, Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators, The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-22. May, 2008.
    [27]
    Panziera J.-P. and Baron J. A Highly Efficient Linpack Implementation Based on Shared-Memory Parallelism. In Proceedings of the 2005 International Supercomputer Conference, 2005.
    [28]
    A. Petitet, R. C. Whaley, J. J. Dongarra, and A. Cleary. HPL -- A portable implementation of the high-performance linpack benchmark for distributed memory computers. http://www.netlib.org/benchmark/hpl/, 2006
    [29]
    S. Swaminarayan, K. Kadau, T.C. Germann, 350--450 Tflops Molecular Dynamics Simulations on the Roadrunner General-purpose Heterogeneous Supercomputer, in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008
    [30]
    V. Volkov and J. Demmel, LU, QR, and Cholesky Factorizations using Vector Capabilities of GPUs, University of California at Berkeley Technical Report UCB/EECS-2008-49, May 2008

    Cited By

    View all
    • (2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
    • (2015)Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU NodesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232174226:7(1814-1825)Online publication date: 10-Jun-2015
    • (2015)A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU SystemsProceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing10.1109/PDP.2015.89(664-668)Online publication date: 4-Mar-2015
    • Show More Cited By

    Recommendations

    Reviews

    William I. Thacker

    There continues to be an increasing demand for more powerful computers. The current trend is to improve performance by including more processing elements, rather than improving the performance of a single processor. The difficulty with this scheme is communicating values between processors. The paper addresses this issue for one of the fastest supercomputers currently available. The Linpack benchmark has become a standard for measuring some aspects of a system's performance. This paper describes the techniques used to implement the Linpack benchmark on the Los Alamos National Laboratory's Roadrunner cluster, in order to achieve performance of over 1.0 PFLOPS. This system has 17 clusters of 180 compute nodes, where each node has two PowerXCell 8i accelerator processors connected to a four-core x86-64 processor. This well-written paper describes how Kistler et al. optimize computational kernels (some in assembly language), find ways to minimize communications, overlap communication with computation, and divide the work among heterogeneous processors, in order to "achieve high performance for certain applications on hybrid architectures." This is an excellent case study of how to adapt a frequently used system to take advantage of a specialized computer system. The techniques used can be adapted for many other types of systems. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
    February 2009
    322 pages
    ISBN:9781605583976
    DOI:10.1145/1504176
    • cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 44, Issue 4
      PPoPP '09
      April 2009
      294 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1594835
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 February 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. accelerators
    2. hybrid programming models

    Qualifiers

    • Research-article

    Conference

    PPoPP09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
    • (2015)Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU NodesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232174226:7(1814-1825)Online publication date: 10-Jun-2015
    • (2015)A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU SystemsProceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing10.1109/PDP.2015.89(664-668)Online publication date: 4-Mar-2015
    • (2012)Work Stealing on Hybrid Architectures2012 13th Symposium on Computer Systems10.1109/WSCAD-SSC.2012.28(17-24)Online publication date: Oct-2012
    • (2011)Optimizing linpack benchmark on GPU-accelerated petascale supercomputerJournal of Computer Science and Technology10.1007/s11390-011-0184-126:5(854-865)Online publication date: 1-Sep-2011
    • (2010)State-of-the-art in heterogeneous computingScientific Programming10.1155/2010/54015918:1(1-33)Online publication date: 1-Jan-2010
    • (2010)Linpack evaluation on a supercomputer with heterogeneous accelerators2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)10.1109/IPDPS.2010.5470353(1-8)Online publication date: Apr-2010
    • (2010)COMIC++: A software SVM system for heterogeneous multicore accelerator clustersHPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture10.1109/HPCA.2010.5416633(1-12)Online publication date: Jan-2010
    • (2010)Adaptive Optimization for Petascale Heterogeneous CPU/GPU ComputingProceedings of the 2010 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2010.12(19-28)Online publication date: 20-Sep-2010
    • (2009)Programming the Linpack benchmark for the IBM PowerXCell 8i processorScientific Programming10.1155/2009/40169117:1-2(43-57)Online publication date: 1-Jan-2009
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media