Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations

Published: 27 May 2015 Publication History

Abstract

The Lattice-Boltzmann method (LBM), a promising new particle-based simulation technique for complex and multiscale fluid flows, has seen tremendous adoption in recent years in computational fluid dynamics. Even with a state-of-the-art LBM solver such as Palabos, a user has to still manually write the program using library-supplied primitives. We propose an automated code generator for a class of LBM computations with the objective to achieve high performance on modern architectures. Few studies have looked at time tiling for LBM codes. We exploit a key similarity between stencils and LBM to enable polyhedral optimizations and in turn time tiling for LBM. We also characterize the performance of LBM with the Roofline performance model. Experimental results for standard LBM simulations like Lid Driven Cavity, Flow Past Cylinder, and Poiseuille Flow show that our scheme consistently outperforms Palabos—on average by up to 3× while running on 16 cores of an Intel Xeon (Sandybridge). We also obtain an improvement of 2.47× on the SPEC LBM benchmark.

Supplementary Material

TACO1202-14 (taco1202-14.pdf)
Slide deck associated with this paper

References

[1]
Aravind Acharya and Uday Bondhugula. 2015. PLUTO+: Near-complete modeling of affine transformations for parallelism and locality. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 54--64.
[2]
Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. Supercomputing (SC). Article 40, 40:1--40:11.
[3]
Cédric Bastoul. 2008. Clan: The Chunky Loop Analyzer. (2008). The Clan User Guide.
[4]
Somashekar Bhaskaracharya and Uday Bondhugula. 2014. Automatic Intra-Array Storage Optimization. Technical Report IISc-CSA-TR-2014-3. Indian Institute of Science, Bangalore, India.
[5]
P. L. Bhatnagar, E. P. Gross, and M. Krook. 1954. A model for collision processes in gases. I. Small amplitude processes in charged and neutral one-component systems. Physics Review 94, 3 (May 1954), 511--525.
[6]
Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 39--50.
[7]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). 101--113.
[8]
Shiyi Chen and Gary D. Doolen. 1998. Lattice Boltzmann method for fluid flows. Annual Review of Fluid Mechanics 30, 1 (1998), 329--364.
[9]
Dominique d’Humières. 2002. Multiple--relaxation--time lattice Boltzmann models in three dimensions. Philosophical Transactions of the Royal Society of London, Series A: Mathematical, Physical and Engineering Sciences 360, 1792 (2002), 437--451.
[10]
Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In International Symposium on Code Generation and Optimization (CGO). 66--75.
[11]
Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6’13). 24--31.
[12]
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In ACM International Conference on Supercomputing (ICS). 13--24.
[13]
David Joyner, Ondřej Čertík, Aaron Meurer, and Brian E. Granger. 2012. Open source computer algebra systems: SymPy. ACM Communications in Computer Algebra 45, 3/4 (2012), 225--234.
[14]
JSON 2001. Introducing JSON. (2001). http://www.json.org/.
[15]
Carolin Körner, Thomas Pohl, Ulrich Rüde, Nils Thürey, and Thomas Zeiser. 2006. Parallel lattice Boltzmann methods for CFD applications. In Numerical Solution of Partial Differential Equations on Parallel Computers. Springer, 439--466.
[16]
Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 235--244.
[17]
Federico Massaioli and Giorgio Amati. 2002. Achieving high performance in a LBM code using OpenMP. In 4th European Workshop on OpenMP.
[18]
Keijo Mattila, Jari Hyväluoma, Tuomo Rossi, Mats Aspnäs, and Jan Westerholm. 2007. An efficient swap algorithm for the lattice Boltzmann method. Computer Physics Communications 176, 3 (2007), 200-- 210.
[19]
Abdulmajeed A. Mohamad. 2011. Lattice Boltzmann Method: Fundamentals and Engineering Applications with Computer Codes. Springer.
[20]
Satish Nadathur, Changkyu Kim, Jatin Chhugani, Hideki Saito, Rakesh Krishnaiyer, Mikhail Smelyanskiy, Milind Girkar, and Pradeep Dubey. 2012. Can traditional programming bridge the ninja performance gap for parallel computing applications? In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA). 440--451.
[21]
Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Supercomputing (SC). 1--13.
[22]
Aditya Nitsure, Klaus Iglberger, Ulrich Rüde, Christian Feichtinger, Gerhard Wellein, and Georg Hager. 2006. Optimization of cache oblivious lattice Boltzmann method in 2D and 3D. In Frontiers in Simulation: Simulationstechnique--19th Symposium in Hannover. 265--270.
[23]
Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G. Spampinato, and Markus Püschel. 2014. Applying the roofline model. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). 76--85.
[24]
Nissa Osheim, Michelle Mills Strout, Dave Rostron, and Sanjay Rajopadhye. 2008. Smashing: Folding space to tile through time. In Workshop on Languages and Compilers for Parallel Computing (LCPC). Springer-Verlag, 80--93.
[25]
Palabos 2009. Palabos. (2009). http://www.palabos.org/.
[26]
Pluto. 2008. PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. (2008). http://pluto-compiler.sourceforge.net.
[27]
Thomas Pohl, Markus Kowarschik, Jens Wilke, Klaus Iglberger, and Ulrich Rüde. 2003. Optimization and profiling of the cache performance of parallel lattice Boltzmann codes. Parallel Processing Letters 13, 4 (2003), 549--560.
[28]
L. Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. In ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI). 405--414.
[29]
Aniruddha G. Shet, Shahajhan H. Sorathiya, Siddharth Krithivasan, Anand M. Deshpande, Bharat Kaul, Sunil D. Sherlekar, and Santosh Ansumali. 2013. Data structure and movement for lattice-based simulations. Physics Review E 88 (July 2013), 013314.
[30]
Sunil Shrestha, Guang R. Gao, Joseph Manzano, Andres Marquez, and John Feo. 2015. Locality aware concurrent start for stencil applications. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). 157--166.
[31]
Yonghong Song and Zhiyuan Li. 1999. New tiling techniques to improve cache temporal locality. In ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI’99). 215--228.
[32]
M. Strout, Larry Carter, Jeanne Ferrante, and Beth Simon. 1998. Schedule-independent storage mapping for loops. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 24--33.
[33]
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and H.-P. Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In International Conference on Parallel Processing (ICPP’11). 571--581.
[34]
Sauro Succi. 2001. The Lattice Boltzmann Equation: For Fluid Dynamics and Beyond. Oxford University Press.
[35]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). 117--128.
[36]
Gerhard Wellein, Thomas Zeiser, Georg Hager, and Stefan Donath. 2006. On the single processor performance of simple lattice Boltzmann kernels. Computers & Fluids 35, 8 (2006), 910--919.
[37]
Jens Wilke, Thomas Pohl, Markus Kowarschik, and Ulrich Rüde. 2003. Cache performance optimizations for parallel lattice Boltzmann codes. In Euro-Par 2003 Parallel Processing. Springer, 441--450.
[38]
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization of a Lattice Boltzmann computation on state-of-the-art multicore platforms. Journal of Parallel and Distributed Computing 69, 9 (Sept. 2009), 762--777.
[39]
Samuel Williams, Leonid Oliker, Jonathan Carter, and John Shalf. 2011. Extracting ultra-scale lattice Boltzmann performance via hierarchical and distributed auto-tuning. In Supercomputing (SC). IEEE, 1--12.
[40]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (2009), 65--76.
[41]
Markus Wittmann, Thomas Zeiser, Georg Hager, and Gerhard Wellein. 2013. Comparison of different propagation steps for lattice Boltzmann methods. Computers & Mathematics with Applications 65, 6 (2013), 924--935.
[42]
Michael Wolfe. 1995. High Performance Compilers for Parallel Computing, Carter Shanklin and Leda Ortega (Eds.). Addison-Wesley Longman Publishing Co., Inc.
[43]
D. Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 171--180.
[44]
David Wonnacott. 2002. Achieving scalable locality with time skewing. International Journal of Parallel Programming 30, 3 (2002), 181--221.
[45]
David Wonnacott and Michelle Strout. 2013. On the scalability of loop tiling techniques. In International Workshop on Polyhedral Compilation Techniques (IMPACT).
[46]
Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers.
[47]
Qisu Zou and Xiaoyi He. 1996. On pressure and velocity flow boundary conditions and bounceback for the lattice Boltzmann BGK model. arXiv preprint comp-gas/9611001 (1996).

Cited By

View all
  • (2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
  • (2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-x77:12(14502-14524)Online publication date: 1-Dec-2021
  • (2021)Revisiting split tiling for stencil computations in polyhedral compilationThe Journal of Supercomputing10.1007/s11227-021-03835-zOnline publication date: 27-May-2021
  • Show More Cited By

Index Terms

  1. An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 2
    July 2015
    410 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2775085
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015
    Accepted: 01 February 2015
    Received: 01 January 2015
    Published in TACO Volume 12, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Lattice-Boltzmann method
    2. performance modeling
    3. polyhedral framework
    4. time tiling

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)109
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 13 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
    • (2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-x77:12(14502-14524)Online publication date: 1-Dec-2021
    • (2021)Revisiting split tiling for stencil computations in polyhedral compilationThe Journal of Supercomputing10.1007/s11227-021-03835-zOnline publication date: 27-May-2021
    • (2019)On the arithmetic intensity of high-order finite-volume discretizations for hyperbolic systems of conservation lawsInternational Journal of High Performance Computing Applications10.1177/109434201769187633:1(25-52)Online publication date: 1-Jan-2019
    • (2019)Flextended TilesACM Transactions on Architecture and Code Optimization10.1145/336938216:4(1-25)Online publication date: 17-Dec-2019
    • (2019)Tessellating Star StencilsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337835(1-10)Online publication date: 5-Aug-2019
    • (2018)DeLICM: scalar dependence removal at zero memory costProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168815(241-253)Online publication date: 24-Feb-2018
    • (2017)Optimistic loop optimizationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049864(292-304)Online publication date: 4-Feb-2017
    • (2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
    • (2017)Diamond TilingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.261509428:5(1285-1298)Online publication date: 1-May-2017
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media