research-article

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Authors:

Samuel Williams,

Jonathan Carter, and

John ShalfAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

Article No.: 55, Pages 1 - 12

https://doi.org/10.1145/2063384.2063458

Published: 12 November 2011 Publication History

Abstract

We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4x using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems.

References

[1]

P. Bhatnagar, E. Gross, and M. Krook. A model for collisional processes in gases I: small amplitude processes in charged and neutral one-component systems. Phys. Rev., 94:511, 1954.

[2]

D. Biskamp. Magnetohydrodynamic Turbulence. Cambridge University Press, 2003.

[3]

J. Carter, M. Soe, L. Oliker, Y. Tsuda, G. Vahala, L. Vahala, and A. Macnab. Magnetohydrodynamic turbulence simulations on the earth simulator using the lattice Boltzmann method. In SC05, Seattle, WA, 2005.

[4]

A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, and R. Vuduc. Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In Interational Conference on Parallel and Distributed Computing Systems (IPDPS), Atlanta, Georgia, 2010.

[5]

C. Chen, J. Chame, and M. Hall. CHiLL: A framework for composing high-level loop transformations. Technical Report 08--897, University of Southern California, June 2008.

[6]

K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. A. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 51(1):129--159, 2009.

Digital Library

[7]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proc. SC2008: High performance computing, networking, and storage conference, nov 2008.

Digital Library

[8]

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Auto-tuning the 27-point stencil for multicore. In In Proc. iWAPT2009: The Fourth International Workshop on Automatic Performance Tuning, 2009.

[9]

P. Dellar. Lattice kinetic schemes for magnetohydrodynamics. J. Comput. Phys., 79, 2002.

[10]

M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, volume 3, pages 1381--1384. IEEE, 1998.

[11]

M. Frigo and V. Strumpen. Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations. In Proc. of the 19th ACM International Conference on Supercomputing (ICS05), Boston, MA, 2005.

Digital Library

[12]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Interational Conference on Parallel and Distributed Computing Systems (IPDPS), Atlanta, Georgia, 2010.

[13]

S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In Memory Systen Performance, pages 36--43. ACM, 2005.

Digital Library

[14]

A. Macnab, G. Vahala, L. Vahala, and P. Pavlo. Lattice Boltzmann model for dissipative MHD. In Proc. 29th EPS Conference on Controlled Fusion and Plasma Physics, volume 26B, Montreux, Switzerland, June 17--21, 2002.

[15]

K. Madduri, S. Williams, S. Ethier, L. Oliker, J. Shalf, E. Strohmaier, and K. Yelick. Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors. In Proc. SC2009: High performance computing, networking, and storage conference, 2009.

Digital Library

[16]

D. Martinez, S. Chen, and W. Matthaeus. Lattice Boltzmann magnetohydrodynamics. Physics of Plasmas, 1:1850--1867, June 1994.

[17]

J. McCalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, Department of Computer Science, Rugers University, 1999.

[18]

M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick. Minimizing communication in sparse matrix solvers. In Proc. SC2009: High performance computing, networking, and storage conference, 2009. http://dx.doi.org/10.1145/1654059.1654096.

Digital Library

[19]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

[20]

B. Palmer and J. Nieplocha. Efficient algorithms for ghost cell updates on two classes of MPP architectures. In Proc. PDCS International Conference on Parallel and Distributed Computing Systems, pages 192--197, 2002.

[21]

M. Pattison, K. Premnath, N. Morley, and M. Abdou. Progress in lattice Boltzmann methods for magnetohydrodynamic flows relevant to fusion applications. Fusion Eng. Des., 83:557--572, 2008.

[22]

T. Pohl, M. Kowarschik, J. Wilke, K. Iglberger, and U. Rüde. Optimization and profiling of the cache performance of parallel lattice Boltzmann codes. Parallel Processing Letters, 13(4):S:549, 2003.

[23]

SPIRAL Project. http://www.spiral.net.

[24]

STREAM: Sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream.

[25]

S. Succi. The Lattice Boltzmann equation for fluids and beyond. Oxford Science Publ., 2001.

[26]

Top500 Supercomputer Sites. http://www.top500.org.

[27]

R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. of SciDAC 2005, J. of Physics: Conference Series. Institute of Physics Publishing, June 2005.

[28]

G. Wellein, G. Hager, T. Zeiser, M. Wittmann, and H. Fehske. Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In International Computer Software and Applications Conference, pages 579--586, 2009.

Digital Library

[29]

G. Wellein, T. Zeiser, G. Hager, and S. Donath. On the single processor performance of simple lattice Boltzmann kernels. computers & fluids, 35(8--9):910--919, Nov. 2006. ISSN 0045--7930.

[30]

R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1--2):3--35, 2001.

[31]

S. Williams. Auto-tuning Performance on Multicore Computers. PhD thesis, EECS Department, University of California, Berkeley, December 2008.

Digital Library

[32]

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In International Parallel & Distributed Processing Symposium, 2008.

[33]

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. Journal of Parallel and Distributed Computing, 69(9):762--777, 2009.

Digital Library

[34]

S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Resource-efficient, hierarchical auto-tuning of a hybrid lattice Boltzmann computation on the Cray XT4. In Proc. CUG09: Cray User Group meeting, 2009.

[35]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. SC2007: High performance computing, networking, and storage conference, 2007.

Digital Library

[36]

S. Williams, A. Watterman, and D. Patterson. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Communications of the ACM, April 2009.

Digital Library

[37]

D. Yu, R. Mei, W. Shyy, and L. Luo. Lattice Boltzmann method for 3D flows with curved boundary. Journal of Comp. Physics, 161:680--699, 2000.

Digital Library

[38]

T. Zeiser, G. Hager, and G. Wellein. Benchmark analysis and application results for lattice Boltzmann simulations on NEC SXvector and Intel Nehalemsystems. Parallel Processing Letters, 19(4):491--511, 2009.

[39]

T. Zeiser, G. Wellein, A. Nitsure, K. Iglberger, U. Rude, and G. Hager. Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method. Progress in Computational Fluid Dynamics, 8, 2008.

Cited By

Yousef ADraeger ERandles A(2023)Low-Cost Post Hoc Reconstruction of HPC Simulations at Full Resolution2023 IEEE 13th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV60332.2023.00009(17-21)Online publication date: 23-Oct-2023
https://doi.org/10.1109/LDAV60332.2023.00009
Cámara JCuenca JGiménez D(2020)Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-020-03235-9Online publication date: 7-Mar-2020
https://doi.org/10.1007/s11227-020-03235-9
Loffeld JHittinger J(2019)On the arithmetic intensity of high-order finite-volume discretizations for hyperbolic systems of conservation lawsInternational Journal of High Performance Computing Applications10.1177/109434201769187633:1(25-52)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1177/1094342017691876
Show More Cited By

Index Terms

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Numerical differentiation
  2. Mathematical software
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific ...
Read More
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Read More
Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA

This paper presents implementation strategies and optimization approaches for a D3Q19 lattice Boltzmann flow solver on nVIDIA graphics processing units (GPUs). Using the STREAM benchmarks we demonstrate the GPU parallelization approach and obtain an ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Advanced Scientific Computing Research

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
219
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)3

Other Metrics

View Author Metrics

Citations

Cited By

Yousef ADraeger ERandles A(2023)Low-Cost Post Hoc Reconstruction of HPC Simulations at Full Resolution2023 IEEE 13th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV60332.2023.00009(17-21)Online publication date: 23-Oct-2023
https://doi.org/10.1109/LDAV60332.2023.00009
Cámara JCuenca JGiménez D(2020)Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-020-03235-9Online publication date: 7-Mar-2020
https://doi.org/10.1007/s11227-020-03235-9
Loffeld JHittinger J(2019)On the arithmetic intensity of high-order finite-volume discretizations for hyperbolic systems of conservation lawsInternational Journal of High Performance Computing Applications10.1177/109434201769187633:1(25-52)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1177/1094342017691876
Zhao TBasu PWilliams SHall MJohansen HTaufer MBalaji PPeña A(2019)Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356210(1-44)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356210
Vardhan MGounley JHegele LDraeger ERandles ATaufer MBalaji PPeña A(2019)Moment representation in the lattice Boltzmann method on massively parallel hardwareProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356204(1-21)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356204
Ames JRizzi SInsley JPatel SHernaandez BDraeger ERandles A(2019)Low-Overhead In Situ Visualization Using Halo Replay2019 IEEE 9th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV48142.2019.8944265(16-26)Online publication date: Oct-2019
https://doi.org/10.1109/LDAV48142.2019.8944265
Stawinoga NField T(2018)Predictable Thread CoarseningACM Transactions on Architecture and Code Optimization10.1145/319424215:2(1-26)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3194242
Zhao TWilliams SHall MJohansen H(2018)Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC.2018.00009(59-70)Online publication date: Nov-2018
https://doi.org/10.1109/P3HPC.2018.00009
Geier MSchönherr M(2017)Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel HardwareComputation10.3390/computation50200195:4(19)Online publication date: 23-Mar-2017
https://doi.org/10.3390/computation5020019
Chaimov NIbrahim KWilliams SIancu C(2017)Reaching bandwidth saturation using transparent injection parallelizationInternational Journal of High Performance Computing Applications10.1177/109434201667272031:5(405-421)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1177/1094342016672720
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents