Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063384.2063458acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Published: 12 November 2011 Publication History
  • Get Citation Alerts
  • Abstract

    We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling constraints limit increases in microprocessor clock speeds. In this work, we demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine sophisticated sequential auto-tuning techniques including loop transformations, virtual vectorization, and use of ISA-specific intrinsics. Next, we present a variety of parallel optimization approaches including programming model exploration (flat MPI, MPI/OpenMP, and MPI/Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. Finally, we evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our unique tuning approach improves performance and energy requirements by up to 3.4x using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods on forthcoming HPC systems.

    References

    [1]
    P. Bhatnagar, E. Gross, and M. Krook. A model for collisional processes in gases I: small amplitude processes in charged and neutral one-component systems. Phys. Rev., 94:511, 1954.
    [2]
    D. Biskamp. Magnetohydrodynamic Turbulence. Cambridge University Press, 2003.
    [3]
    J. Carter, M. Soe, L. Oliker, Y. Tsuda, G. Vahala, L. Vahala, and A. Macnab. Magnetohydrodynamic turbulence simulations on the earth simulator using the lattice Boltzmann method. In SC05, Seattle, WA, 2005.
    [4]
    A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, and R. Vuduc. Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In Interational Conference on Parallel and Distributed Computing Systems (IPDPS), Atlanta, Georgia, 2010.
    [5]
    C. Chen, J. Chame, and M. Hall. CHiLL: A framework for composing high-level loop transformations. Technical Report 08--897, University of Southern California, June 2008.
    [6]
    K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. A. Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 51(1):129--159, 2009.
    [7]
    K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proc. SC2008: High performance computing, networking, and storage conference, nov 2008.
    [8]
    K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Auto-tuning the 27-point stencil for multicore. In In Proc. iWAPT2009: The Fourth International Workshop on Automatic Performance Tuning, 2009.
    [9]
    P. Dellar. Lattice kinetic schemes for magnetohydrodynamics. J. Comput. Phys., 79, 2002.
    [10]
    M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, volume 3, pages 1381--1384. IEEE, 1998.
    [11]
    M. Frigo and V. Strumpen. Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations. In Proc. of the 19th ACM International Conference on Supercomputing (ICS05), Boston, MA, 2005.
    [12]
    S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In Interational Conference on Parallel and Distributed Computing Systems (IPDPS), Atlanta, Georgia, 2010.
    [13]
    S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In Memory Systen Performance, pages 36--43. ACM, 2005.
    [14]
    A. Macnab, G. Vahala, L. Vahala, and P. Pavlo. Lattice Boltzmann model for dissipative MHD. In Proc. 29th EPS Conference on Controlled Fusion and Plasma Physics, volume 26B, Montreux, Switzerland, June 17--21, 2002.
    [15]
    K. Madduri, S. Williams, S. Ethier, L. Oliker, J. Shalf, E. Strohmaier, and K. Yelick. Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors. In Proc. SC2009: High performance computing, networking, and storage conference, 2009.
    [16]
    D. Martinez, S. Chen, and W. Matthaeus. Lattice Boltzmann magnetohydrodynamics. Physics of Plasmas, 1:1850--1867, June 1994.
    [17]
    J. McCalpin and D. Wonnacott. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, Department of Computer Science, Rugers University, 1999.
    [18]
    M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick. Minimizing communication in sparse matrix solvers. In Proc. SC2009: High performance computing, networking, and storage conference, 2009. http://dx.doi.org/10.1145/1654059.1654096.
    [19]
    A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--13, Washington, DC, USA, 2010. IEEE Computer Society.
    [20]
    B. Palmer and J. Nieplocha. Efficient algorithms for ghost cell updates on two classes of MPP architectures. In Proc. PDCS International Conference on Parallel and Distributed Computing Systems, pages 192--197, 2002.
    [21]
    M. Pattison, K. Premnath, N. Morley, and M. Abdou. Progress in lattice Boltzmann methods for magnetohydrodynamic flows relevant to fusion applications. Fusion Eng. Des., 83:557--572, 2008.
    [22]
    T. Pohl, M. Kowarschik, J. Wilke, K. Iglberger, and U. Rüde. Optimization and profiling of the cache performance of parallel lattice Boltzmann codes. Parallel Processing Letters, 13(4):S:549, 2003.
    [23]
    SPIRAL Project. http://www.spiral.net.
    [24]
    STREAM: Sustainable memory bandwidth in high performance computers. http://www.cs.virginia.edu/stream.
    [25]
    S. Succi. The Lattice Boltzmann equation for fluids and beyond. Oxford Science Publ., 2001.
    [26]
    Top500 Supercomputer Sites. http://www.top500.org.
    [27]
    R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. of SciDAC 2005, J. of Physics: Conference Series. Institute of Physics Publishing, June 2005.
    [28]
    G. Wellein, G. Hager, T. Zeiser, M. Wittmann, and H. Fehske. Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In International Computer Software and Applications Conference, pages 579--586, 2009.
    [29]
    G. Wellein, T. Zeiser, G. Hager, and S. Donath. On the single processor performance of simple lattice Boltzmann kernels. computers & fluids, 35(8--9):910--919, Nov. 2006. ISSN 0045--7930.
    [30]
    R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1--2):3--35, 2001.
    [31]
    S. Williams. Auto-tuning Performance on Multicore Computers. PhD thesis, EECS Department, University of California, Berkeley, December 2008.
    [32]
    S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. In International Parallel & Distributed Processing Symposium, 2008.
    [33]
    S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Lattice Boltzmann simulation optimization on leading multicore platforms. Journal of Parallel and Distributed Computing, 69(9):762--777, 2009.
    [34]
    S. Williams, J. Carter, L. Oliker, J. Shalf, and K. Yelick. Resource-efficient, hierarchical auto-tuning of a hybrid lattice Boltzmann computation on the Cray XT4. In Proc. CUG09: Cray User Group meeting, 2009.
    [35]
    S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. SC2007: High performance computing, networking, and storage conference, 2007.
    [36]
    S. Williams, A. Watterman, and D. Patterson. Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Communications of the ACM, April 2009.
    [37]
    D. Yu, R. Mei, W. Shyy, and L. Luo. Lattice Boltzmann method for 3D flows with curved boundary. Journal of Comp. Physics, 161:680--699, 2000.
    [38]
    T. Zeiser, G. Hager, and G. Wellein. Benchmark analysis and application results for lattice Boltzmann simulations on NEC SXvector and Intel Nehalemsystems. Parallel Processing Letters, 19(4):491--511, 2009.
    [39]
    T. Zeiser, G. Wellein, A. Nitsure, K. Iglberger, U. Rude, and G. Hager. Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method. Progress in Computational Fluid Dynamics, 8, 2008.

    Cited By

    View all
    • (2023)Low-Cost Post Hoc Reconstruction of HPC Simulations at Full Resolution2023 IEEE 13th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV60332.2023.00009(17-21)Online publication date: 23-Oct-2023
    • (2020)Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-020-03235-9Online publication date: 7-Mar-2020
    • (2019)On the arithmetic intensity of high-order finite-volume discretizations for hyperbolic systems of conservation lawsInternational Journal of High Performance Computing Applications10.1177/109434201769187633:1(25-52)Online publication date: 1-Jan-2019
    • Show More Cited By

    Index Terms

    1. Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
          November 2011
          866 pages
          ISBN:9781450307710
          DOI:10.1145/2063384
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 12 November 2011

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. BlueGene
          2. Lattice Boltzmann
          3. OpenMP
          4. SIMD
          5. auto-tuning
          6. hybrid programming models

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          SC '11
          Sponsor:

          Acceptance Rates

          SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
          Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)6
          • Downloads (Last 6 weeks)3

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Low-Cost Post Hoc Reconstruction of HPC Simulations at Full Resolution2023 IEEE 13th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV60332.2023.00009(17-21)Online publication date: 23-Oct-2023
          • (2020)Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clustersThe Journal of Supercomputing10.1007/s11227-020-03235-9Online publication date: 7-Mar-2020
          • (2019)On the arithmetic intensity of high-order finite-volume discretizations for hyperbolic systems of conservation lawsInternational Journal of High Performance Computing Applications10.1177/109434201769187633:1(25-52)Online publication date: 1-Jan-2019
          • (2019)Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356210(1-44)Online publication date: 17-Nov-2019
          • (2019)Moment representation in the lattice Boltzmann method on massively parallel hardwareProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356204(1-21)Online publication date: 17-Nov-2019
          • (2019)Low-Overhead In Situ Visualization Using Halo Replay2019 IEEE 9th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV48142.2019.8944265(16-26)Online publication date: Oct-2019
          • (2018)Predictable Thread CoarseningACM Transactions on Architecture and Code Optimization10.1145/319424215:2(1-26)Online publication date: 12-Jun-2018
          • (2018)Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC.2018.00009(59-70)Online publication date: Nov-2018
          • (2017)Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel HardwareComputation10.3390/computation50200195:4(19)Online publication date: 23-Mar-2017
          • (2017)Reaching bandwidth saturation using transparent injection parallelizationInternational Journal of High Performance Computing Applications10.1177/109434201667272031:5(405-421)Online publication date: 1-Sep-2017
          • Show More Cited By

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media