research-article

Open access

An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations

Authors:

Irshad Pananilath,

Aravind Acharya,

Uday BondhugulaAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 2

Article No.: 14, Pages 1 - 23

https://doi.org/10.1145/2739047

Published: 27 May 2015 Publication History

Abstract

The Lattice-Boltzmann method (LBM), a promising new particle-based simulation technique for complex and multiscale fluid flows, has seen tremendous adoption in recent years in computational fluid dynamics. Even with a state-of-the-art LBM solver such as Palabos, a user has to still manually write the program using library-supplied primitives. We propose an automated code generator for a class of LBM computations with the objective to achieve high performance on modern architectures. Few studies have looked at time tiling for LBM codes. We exploit a key similarity between stencils and LBM to enable polyhedral optimizations and in turn time tiling for LBM. We also characterize the performance of LBM with the Roofline performance model. Experimental results for standard LBM simulations like Lid Driven Cavity, Flow Past Cylinder, and Poiseuille Flow show that our scheme consistently outperforms Palabos—on average by up to 3× while running on 16 cores of an Intel Xeon (Sandybridge). We also obtain an improvement of 2.47× on the SPEC LBM benchmark.

Supplementary Material

TACO1202-14 (taco1202-14.pdf)

Slide deck associated with this paper

Download
777.88 KB

References

[1]

Aravind Acharya and Uday Bondhugula. 2015. PLUTO+: Near-complete modeling of affine transformations for parallelism and locality. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 54--64.

Digital Library

[2]

Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. Supercomputing (SC). Article 40, 40:1--40:11.

Digital Library

[3]

Cédric Bastoul. 2008. Clan: The Chunky Loop Analyzer. (2008). The Clan User Guide.

[4]

Somashekar Bhaskaracharya and Uday Bondhugula. 2014. Automatic Intra-Array Storage Optimization. Technical Report IISc-CSA-TR-2014-3. Indian Institute of Science, Bangalore, India.

[5]

P. L. Bhatnagar, E. P. Gross, and M. Krook. 1954. A model for collision processes in gases. I. Small amplitude processes in charged and neutral one-component systems. Physics Review 94, 3 (May 1954), 511--525.

[6]

Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 39--50.

Digital Library

[7]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). 101--113.

Digital Library

[8]

Shiyi Chen and Gary D. Doolen. 1998. Lattice Boltzmann method for fluid flows. Annual Review of Fluid Mechanics 30, 1 (1998), 329--364.

[9]

Dominique d’Humières. 2002. Multiple--relaxation--time lattice Boltzmann models in three dimensions. Philosophical Transactions of the Royal Society of London, Series A: Mathematical, Physical and Engineering Sciences 360, 1792 (2002), 437--451.

[10]

Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In International Symposium on Code Generation and Optimization (CGO). 66--75.

Digital Library

[11]

Tobias Grosser, Albert Cohen, Paul H. J. Kelly, J. Ramanujam, P. Sadayappan, and Sven Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6’13). 24--31.

Digital Library

[12]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In ACM International Conference on Supercomputing (ICS). 13--24.

Digital Library

[13]

David Joyner, Ondřej Čertík, Aaron Meurer, and Brian E. Granger. 2012. Open source computer algebra systems: SymPy. ACM Communications in Computer Algebra 45, 3/4 (2012), 225--234.

Digital Library

[14]

JSON 2001. Introducing JSON. (2001). http://www.json.org/.

[15]

Carolin Körner, Thomas Pohl, Ulrich Rüde, Nils Thürey, and Thomas Zeiser. 2006. Parallel lattice Boltzmann methods for CFD applications. In Numerical Solution of Partial Differential Equations on Parallel Computers. Springer, 439--466.

[16]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 235--244.

Digital Library

[17]

Federico Massaioli and Giorgio Amati. 2002. Achieving high performance in a LBM code using OpenMP. In 4th European Workshop on OpenMP.

[18]

Keijo Mattila, Jari Hyväluoma, Tuomo Rossi, Mats Aspnäs, and Jan Westerholm. 2007. An efficient swap algorithm for the lattice Boltzmann method. Computer Physics Communications 176, 3 (2007), 200-- 210.

[19]

Abdulmajeed A. Mohamad. 2011. Lattice Boltzmann Method: Fundamentals and Engineering Applications with Computer Codes. Springer.

[20]

Satish Nadathur, Changkyu Kim, Jatin Chhugani, Hideki Saito, Rakesh Krishnaiyer, Mikhail Smelyanskiy, Milind Girkar, and Pradeep Dubey. 2012. Can traditional programming bridge the ninja performance gap for parallel computing applications&quest; In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA). 440--451.

Digital Library

[21]

Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Supercomputing (SC). 1--13.

Digital Library

[22]

Aditya Nitsure, Klaus Iglberger, Ulrich Rüde, Christian Feichtinger, Gerhard Wellein, and Georg Hager. 2006. Optimization of cache oblivious lattice Boltzmann method in 2D and 3D. In Frontiers in Simulation: Simulationstechnique--19th Symposium in Hannover. 265--270.

[23]

Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G. Spampinato, and Markus Püschel. 2014. Applying the roofline model. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). 76--85.

[24]

Nissa Osheim, Michelle Mills Strout, Dave Rostron, and Sanjay Rajopadhye. 2008. Smashing: Folding space to tile through time. In Workshop on Languages and Compilers for Parallel Computing (LCPC). Springer-Verlag, 80--93.

Digital Library

[25]

Palabos 2009. Palabos. (2009). http://www.palabos.org/.

[26]

Pluto. 2008. PLUTO: A polyhedral automatic parallelizer and locality optimizer for multicores. (2008). http://pluto-compiler.sourceforge.net.

[27]

Thomas Pohl, Markus Kowarschik, Jens Wilke, Klaus Iglberger, and Ulrich Rüde. 2003. Optimization and profiling of the cache performance of parallel lattice Boltzmann codes. Parallel Processing Letters 13, 4 (2003), 549--560.

[28]

L. Renganarayanan, DaeGon Kim, Sanjay Rajopadhye, and Michelle Mills Strout. 2007. Parameterized tiled loops for free. In ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI). 405--414.

Digital Library

[29]

Aniruddha G. Shet, Shahajhan H. Sorathiya, Siddharth Krithivasan, Anand M. Deshpande, Bharat Kaul, Sunil D. Sherlekar, and Santosh Ansumali. 2013. Data structure and movement for lattice-based simulations. Physics Review E 88 (July 2013), 013314.

[30]

Sunil Shrestha, Guang R. Gao, Joseph Manzano, Andres Marquez, and John Feo. 2015. Locality aware concurrent start for stencil applications. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). 157--166.

Digital Library

[31]

Yonghong Song and Zhiyuan Li. 1999. New tiling techniques to improve cache temporal locality. In ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI’99). 215--228.

Digital Library

[32]

M. Strout, Larry Carter, Jeanne Ferrante, and Beth Simon. 1998. Schedule-independent storage mapping for loops. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 24--33.

Digital Library

[33]

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and H.-P. Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In International Conference on Parallel Processing (ICPP’11). 571--581.

Digital Library

[34]

Sauro Succi. 2001. The Lattice Boltzmann Equation: For Fluid Dynamics and Beyond. Oxford University Press.

[35]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). 117--128.

Digital Library

[36]

Gerhard Wellein, Thomas Zeiser, Georg Hager, and Stefan Donath. 2006. On the single processor performance of simple lattice Boltzmann kernels. Computers & Fluids 35, 8 (2006), 910--919.

[37]

Jens Wilke, Thomas Pohl, Markus Kowarschik, and Ulrich Rüde. 2003. Cache performance optimizations for parallel lattice Boltzmann codes. In Euro-Par 2003 Parallel Processing. Springer, 441--450.

[38]

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization of a Lattice Boltzmann computation on state-of-the-art multicore platforms. Journal of Parallel and Distributed Computing 69, 9 (Sept. 2009), 762--777.

Digital Library

[39]

Samuel Williams, Leonid Oliker, Jonathan Carter, and John Shalf. 2011. Extracting ultra-scale lattice Boltzmann performance via hierarchical and distributed auto-tuning. In Supercomputing (SC). IEEE, 1--12.

Digital Library

[40]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (2009), 65--76.

Digital Library

[41]

Markus Wittmann, Thomas Zeiser, Georg Hager, and Gerhard Wellein. 2013. Comparison of different propagation steps for lattice Boltzmann methods. Computers & Mathematics with Applications 65, 6 (2013), 924--935.

Digital Library

[42]

Michael Wolfe. 1995. High Performance Compilers for Parallel Computing, Carter Shanklin and Leda Ortega (Eds.). Addison-Wesley Longman Publishing Co., Inc.

Digital Library

[43]

D. Wonnacott. 2000. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 171--180.

Digital Library

[44]

David Wonnacott. 2002. Achieving scalable locality with time skewing. International Journal of Parallel Programming 30, 3 (2002), 181--221.

Digital Library

[45]

David Wonnacott and Michelle Strout. 2013. On the scalability of loop tiling techniques. In International Workshop on Polyhedral Compilation Techniques (IMPACT).

[46]

Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers.

Digital Library

[47]

Qisu Zou and Xiaoyi He. 1996. On pressure and velocity flow boundary conditions and bounceback for the lattice Boltzmann BGK model. arXiv preprint comp-gas/9611001 (1996).

Cited By

Zhao WYuan LYan BMa PZhang YWang LWang Z(2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656611
Tao XPang JXu JZhu Y(2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-x77:12(14502-14524)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s11227-021-03853-x
Li YSun HPang J(2021)Revisiting split tiling for stencil computations in polyhedral compilationThe Journal of Supercomputing10.1007/s11227-021-03835-zOnline publication date: 27-May-2021
https://doi.org/10.1007/s11227-021-03835-z
Show More Cited By

Index Terms

An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Numerical Investigation of Transport Processes in Porous Media Under Laminar, Transitional and Turbulent Flow Conditions with the Lattice-Boltzmann Method
Computational Science – ICCS 2021
Abstract
In the present paper the mass transfer in porous media under laminar, transitional and turbulent flow conditions was investigated using the lattice-Boltzmann method (LBM). While previous studies have applied the LBM to species transport in complex ...
Rectangular Lattice-Boltzmann Schemes with BGK-Collision Operator

The usual lattice-Boltzmann schemes for fluid flow simulations operate with square and cubic lattices. Instead of relying on square lattices it is possible to use rectangular and orthorombic lattices as well. Schemes using rectangular lattices can be ...
A scalable interface-resolved simulation of particle-laden flow using the lattice Boltzmann method

We examine the scalable implementation of the lattice Boltzmann method (LBM) in the context of interface-resolved direct numerical simulation of wall-bounded turbulent particle-laden flows.Three distinct aspects relevant to performance optimization of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 12, Issue 2

July 2015

410 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2775085

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Accepted: 01 February 2015

Received: 01 January 2015

Published in TACO Volume 12, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
797
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)12

Reflects downloads up to 13 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao WYuan LYan BMa PZhang YWang LWang Z(2024)Stencil Computation with Vector Outer ProductProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656611(247-258)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656611
Tao XPang JXu JZhu Y(2021)Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architectureThe Journal of Supercomputing10.1007/s11227-021-03853-x77:12(14502-14524)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s11227-021-03853-x
Li YSun HPang J(2021)Revisiting split tiling for stencil computations in polyhedral compilationThe Journal of Supercomputing10.1007/s11227-021-03835-zOnline publication date: 27-May-2021
https://doi.org/10.1007/s11227-021-03835-z
Loffeld JHittinger J(2019)On the arithmetic intensity of high-order finite-volume discretizations for hyperbolic systems of conservation lawsInternational Journal of High Performance Computing Applications10.1177/109434201769187633:1(25-52)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1177/1094342017691876
Zhao JCohen A(2019)Flextended TilesACM Transactions on Architecture and Code Optimization10.1145/336938216:4(1-25)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3369382
Yuan LHuang SZhang YCao H(2019)Tessellating Star StencilsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337835(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337835
Kruse MGrosser TKnoop JSchordan MJohnson TO'Boyle M(2018)DeLICM: scalar dependence removal at zero memory costProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168815(241-253)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168815
Doerfert JGrosser THack SReddi VSmith ATang L(2017)Optimistic loop optimizationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049864(292-304)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049864
Yuan LZhang YGuo PHuang SMohr BRaghavan P(2017)Tessellating stencilsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126920(1-13)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126920
Bondhugula UBandishti VPananilath I(2017)Diamond TilingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.261509428:5(1285-1298)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2615094
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents