research-article

Petascale computing with accelerators

Authors:

Michael Kistler,

John Gunnels,

Daniel Brokenshire, and

Brad BentonAuthors Info & Claims

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2009

Pages 241 - 250

https://doi.org/10.1145/1504176.1504212

Published: 14 February 2009 Publication History

Get Access

Abstract

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell™" 8i1 accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.

References

[1]

Advanced Micro Devices, AMD Core Math Library, http://www.amd.com/acml

Google Scholar

[2]

W. Alvaro, J. Kurzak, and J. J. Dongarra, Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the CELL Processor. UT-CS-08-609, January 2008.

Google Scholar

[3]

K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin, J. C. Sancho. "Entering the Petaflop Era: The Architecture and Performance of Roadrunner", in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008

Digital Library

Google Scholar

[4]

K. J. Bowers, B. J. Albright, B. K. Bergen, L. Yin, K. J. Barker, D. J. Kerbyson, 0.365 Pflop/s Trillion-particle Particle-in-cell Modeling of Laser Plasma Interactions on Roadrunner, in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008

Digital Library

Google Scholar

[5]

A. Buttari, P. Luszczek, J. Kurzak, J. Dongarra, and G. Bosilca, A Rough Guide to Scientific Computing on the PlayStation 3, Technical Report UT-CS-07-595, Innovative Computing Laboratory, University of Tennessee Knoxville, May 11, 2007.

Google Scholar

[6]

J. Chen, Y. Zhang, L. Zhang, W. Yuan, Performance Evaluation of Allgather Algorithms On Terascale Linux Cluster with Fast Ethernet, Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05), IEEE, 2005

Digital Library

Google Scholar

[7]

ClearSpeed, Accelerated HPC Clusters, http://www.clearspeed.com/acceleration/accelhpcclusters/

Google Scholar

[8]

J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling, A set of Level 3 Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., 16 (1990), pp. 1--17.

Digital Library

Google Scholar

[9]

J. J. Dongarra, R. A. van de Geijn, D. W. Walker, Scalability Issues Affecting the Design of a Dense Linear Algebra Library, Journal of Parallel and Distributed Computing, Vol 22, Number 3, pp 523--537, 1994

Digital Library

Google Scholar

[10]

A. E. Eichenberger, J. K. O'Brien, K. M. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo, Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture, IBM Systems Journal, Vol 45, Number 1, 2006

Digital Library

Google Scholar

[11]

E. Gabriel, G. Fagg, G. Bosilca, et al, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Euro PVM/MPI, September, 2004.

Google Scholar

[12]

K. Goto and R. A. van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, to appear.

Digital Library

Google Scholar

[13]

F. G. Gustavson, High-performance linear algebra algorithms using new generalized data structures for matrices, IBM Journal of Research and Development, Vol. 47, Number 1, 2003, pp 31--55

Digital Library

Google Scholar

[14]

P. Husbands and K. Yelick, Multi-Threading and One-Sided Communication in Parallel LU Factorization, in Proceedings of the 2007 ACM/IEEE conference on Supercomputing, Nov 2007

Digital Library

Google Scholar

[15]

IBM, The IBM Software Kit for Multicore Acceleration Version 3.0 http://www.ibm.com/chips/techlib/techlib.nsf/products/IBM_SDK_for_Multicore_Acceleration

Google Scholar

[16]

IBM, Data Communication and Synchronization Library for Hybrid-x86 Programmers Guide and API Reference, October 2007.

Google Scholar

[17]

Intel Corp, Intel® Xeon® Processor 5000 Sequence: HPC Benchmarks: Dense Floating--point, http://www.intel.com/performance/server/xeon/hpcapp.htm

Google Scholar

[18]

C. R. Johns and D. A. Brokenshire, Introduction to the Cell Broadband Engine Architecture, IBM Journal of Research and Development, Vol 51, Number 5, 2007, pp 503--520

Digital Library

Google Scholar

[19]

D. J. Kerbyson and A. Hoisie, Analysis of Wavefront Algorithms on Large-scale Two-level Heterogeneous Processing Systems, Workshop on Unique Chips and Systems (UCAS2), IEEE Symposium on Performance Analysis of Systems and Software (ISPASS06), Austin, TX, Mar 2006

Google Scholar

[20]

M. Kistler, J. Gunnels, D. Brokenshire, B. Benton, Programming the Linpack Benchmark for Roadrunner, IBM Journal of Research and Development, to appear

Digital Library

Google Scholar

[21]

J. Kurzak and J. Dongarra, Implementing Linear Algebra Routines on Multi-Core Processors with Pipelining and a Look Ahead, UT-CS-06-581, September 2006.

Google Scholar

[22]

C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic Linear Algebra Subprograms for FORTRAN usage, ACM Trans. Math. Soft., 5 (1979), pp. 308--323.

Digital Library

Google Scholar

[23]

Message Passing Interface Forum. MPI: A Message Passing Interface Standard, June 1995. http://www.mpi-forum.org.

Google Scholar

[24]

Message Passing Interface Forum. MPI-2: Extensions to the Message Passing Interface, July 1997. http://www.mpi-forum.org

Google Scholar

[25]

OpenMP Specifications, http://www.openmp.org/drupal/node/view/8

Google Scholar

[26]

G. Quintana-Orti, F. D. Igual, E. S. Quintana-Orti, R. van de Geijn, Solving Dense Linear Algebra Problems on Platforms with Multiple Hardware Accelerators, The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-08-22. May, 2008.

Google Scholar

[27]

Panziera J.-P. and Baron J. A Highly Efficient Linpack Implementation Based on Shared-Memory Parallelism. In Proceedings of the 2005 International Supercomputer Conference, 2005.

Google Scholar

[28]

A. Petitet, R. C. Whaley, J. J. Dongarra, and A. Cleary. HPL -- A portable implementation of the high-performance linpack benchmark for distributed memory computers. http://www.netlib.org/benchmark/hpl/, 2006

Google Scholar

[29]

S. Swaminarayan, K. Kadau, T.C. Germann, 350--450 Tflops Molecular Dynamics Simulations on the Roadrunner General-purpose Heterogeneous Supercomputer, in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008

Digital Library

Google Scholar

[30]

V. Volkov and J. Demmel, LU, QR, and Cholesky Factorizations using Vector Capabilities of GPUs, University of California at Berkeley Technical Report UCB/EECS-2008-49, May 2008

Google Scholar

Cited By

View all

Fang JHuang CTang TWang Z(2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
https://doi.org/10.1007/s42514-020-00039-4
Gangwon Jo Jeongho Nah Jun Lee Jungwon Kim Jaejin Lee (2015)Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU NodesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232174226:7(1814-1825)Online publication date: 10-Jun-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2321742
Rohr DLindenstruth V(2015)A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU SystemsProceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing10.1109/PDP.2015.89(664-668)Online publication date: 4-Mar-2015
https://dl.acm.org/doi/10.1109/PDP.2015.89
Show More Cited By

Index Terms

Petascale computing with accelerators
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Petascale computing with accelerators
PPoPP '09

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an ...
Read More
Accelerating computing with the cell broadband engine processor
CF '08: Proceedings of the 5th conference on Computing frontiers

In this paper, we describe our approach to utilizing the compute power of the Cell Broadband Engine™ (Cell/B.E.)¹ processor as an accelerator for computationally intensive portions of high performance computing applications. We call this approach "...
Read More
Modeling and predicting performance of high performance computing applications on hardware accelerators

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Read More

Reviews

Reviewer: William I. Thacker

There continues to be an increasing demand for more powerful computers. The current trend is to improve performance by including more processing elements, rather than improving the performance of a single processor. The difficulty with this scheme is communicating values between processors. The paper addresses this issue for one of the fastest supercomputers currently available. The Linpack benchmark has become a standard for measuring some aspects of a system's performance. This paper describes the techniques used to implement the Linpack benchmark on the Los Alamos National Laboratory's Roadrunner cluster, in order to achieve performance of over 1.0 PFLOPS. This system has 17 clusters of 180 compute nodes, where each node has two PowerXCell 8i accelerator processors connected to a four-core x86-64 processor. This well-written paper describes how Kistler et al. optimize computational kernels (some in assembly language), find ways to minimize communications, overlap communication with computation, and divide the work among heterogeneous processors, in order to "achieve high performance for certain applications on hybrid architectures." This is an excellent case study of how to adapt a frequently used system to take advantage of a specialized computer system. The techniques used can be adapted for many other types of systems. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2009

322 pages

ISBN:9781605583976

DOI:10.1145/1504176

General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA

ACM SIGPLAN Notices Volume 44, Issue 4
PPoPP '09
April 2009
294 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1594835
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP09

Sponsor:

PPoPP09: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 14 - 18, 2009

NC, Raleigh, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
737
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Other Metrics

View Author Metrics

Citations

Cited By

View all

Fang JHuang CTang TWang Z(2020)Parallel programming models for heterogeneous many-cores: a comprehensive surveyCCF Transactions on High Performance Computing10.1007/s42514-020-00039-42:4(382-400)Online publication date: 31-Jul-2020
https://doi.org/10.1007/s42514-020-00039-4
Gangwon Jo Jeongho Nah Jun Lee Jungwon Kim Jaejin Lee (2015)Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU NodesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.232174226:7(1814-1825)Online publication date: 10-Jun-2015
https://dl.acm.org/doi/10.1109/TPDS.2014.2321742
Rohr DLindenstruth V(2015)A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU SystemsProceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing10.1109/PDP.2015.89(664-668)Online publication date: 4-Mar-2015
https://dl.acm.org/doi/10.1109/PDP.2015.89
Pinto VMaillard N(2012)Work Stealing on Hybrid Architectures2012 13th Symposium on Computer Systems10.1109/WSCAD-SSC.2012.28(17-24)Online publication date: Oct-2012
https://doi.org/10.1109/WSCAD-SSC.2012.28
Wang FYang CDu YChen JYi HXu W(2011)Optimizing linpack benchmark on GPU-accelerated petascale supercomputerJournal of Computer Science and Technology10.1007/s11390-011-0184-126:5(854-865)Online publication date: 1-Sep-2011
https://dl.acm.org/doi/10.1007/s11390-011-0184-1
Brodtkorb ADyken CHagen THjelmervik JStoraasli O(2010)State-of-the-art in heterogeneous computingScientific Programming10.1155/2010/54015918:1(1-33)Online publication date: 1-Jan-2010
https://dl.acm.org/doi/10.1155/2010/540159
Endo TMatsuoka SNukada AMaruyama N(2010)Linpack evaluation on a supercomputer with heterogeneous accelerators2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)10.1109/IPDPS.2010.5470353(1-8)Online publication date: Apr-2010
https://doi.org/10.1109/IPDPS.2010.5470353
Jaejin Lee Jun Lee Sangmin Seo Jungwon Kim Seungkyun Kim Sura Z(2010)COMIC++: A software SVM system for heterogeneous multicore accelerator clustersHPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture10.1109/HPCA.2010.5416633(1-12)Online publication date: Jan-2010
https://doi.org/10.1109/HPCA.2010.5416633
Yang CWang FDu YChen JLiu JYi HLu K(2010)Adaptive Optimization for Petascale Heterogeneous CPU/GPU ComputingProceedings of the 2010 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2010.12(19-28)Online publication date: 20-Sep-2010
https://dl.acm.org/doi/10.1109/CLUSTER.2010.12
Kistler MGunnels JBrokenshire DBenton B(2009)Programming the Linpack benchmark for the IBM PowerXCell 8i processorScientific Programming10.1155/2009/40169117:1-2(43-57)Online publication date: 1-Jan-2009
https://dl.acm.org/doi/10.1155/2009/401691
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations