research-article

Performance Evaluation of a Two-Dimensional Lattice Boltzmann Solver Using CUDA and PGAS UPC Based Parallelisation

Authors:

Tamás István Józsa,

Ádám Koleszár,

Irene Moulitsas,

László KönözsyAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 44, Issue 1

Article No.: 8, Pages 1 - 22

https://doi.org/10.1145/3085590

Published: 14 July 2017 Publication History

Abstract

The Unified Parallel C (UPC) language from the Partitioned Global Address Space (PGAS) family unifies the advantages of shared and local memory spaces and offers a relatively straightforward code parallelisation with the Central Processing Unit (CPU). In contrast, the Computer Unified Device Architecture (CUDA) development kit gives a tool to make use of the Graphics Processing Unit (GPU). We provide a detailed comparison between these novel techniques through the parallelisation of a two-dimensional lattice Boltzmann method based fluid flow solver. Our comparison between the CUDA and UPC parallelisation takes into account the required conceptual effort, the performance gain, and the limitations of the approaches from the application oriented developers’ point of view. We demonstrated that UPC led to competitive efficiency with the local memory implementation. However, the performance of the shared memory code fell behind our expectations, and we concluded that the investigated UPC compilers could not efficiently treat the shared memory space. The CUDA implementation proved to be more complex compared to the UPC approach mainly because of the complicated memory structure of the graphics card which also makes GPUs suitable for the parallelisation of the lattice Boltzmann method.

References

[1]

G. Amati, S. Succi, and R. Piva. 1997. Massively parallel lattice-Boltzmann simulation of turbulent channel flow. International Journal of Modern Physics C 8, 4 (1997), 869--877.

[2]

J. A. Anderson, C. D. Lorenz, and A. Travesset. 2008. General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational Physics 227, 10 (2008), 5342--5359.

Digital Library

[3]

P. L. Bhatnagar, E. P. Gross, and M. Krook. 1954. A model for collision processes in gases I: Small amplitude processes in charged and neutral one-component systems. Physical Review 94, 3 (1954), 511.

[4]

F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. 2004. Productivity analysis of the UPC language. In Proceedings of the 18th International Parallel and Distributed Processing Symposium. IEEE, 254.

[5]

B. L. Chamberlain, D. Callahan, and H. P. Zima. 2007. Parallel programmability and the chapel language. The International Journal of High Performance Computing Applications 21, 3 (2007), 291--312.

Digital Library

[6]

S. Chauwvin, P. Saha, F. Cantonnet, S. Annareddy, and T. El-Ghazawi. 2007. UPC Manual. The George Washington University, Washington, DC. Version 1.2.

[7]

N. Chentanez and M. Müller. 2011. Real-time Eulerian water simulation using a restricted tall cell grid. ACM Transactions on Graphics 30, 4 (2011).

Digital Library

[8]

A. J. Chorin. 1967. A numerical method for solving incompressible viscous flow problems. Journal of Computational Physics 2, 1 (1967), 12--26.

[9]

Cray Inc. 2015. Performance Measurement and Analysis Tools (s-2376-63 ed.). Cray.

[10]

Cray Inc. 2012. Cray standard C and C++ reference manual. http://docs.cray.com/books/S-2179-81/S-2179-81.pdf.

[11]

D. d’Humières. 1992. Generalized lattice-Boltzmann equations. In Rarefied Gas Dynamics: Theory and Simulations, B. D. Shizgal and D. P. Weaver (Eds.).

[12]

D. d’Humières, I. Ginzburg, M. Krafczyk, P. Lallemand, and L. S. Luo. 2002. Multiple--relaxation--time lattice Boltzmann models in three dimensions. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences 360, 1792 (2002), 437--451.

[13]

Kemal Ebcioglu, Vijay Saraswat, and Vivek Sarkar. 2004. X10: Programming for hierarchical parallelism and non-uniform data access. In Proceedings of the International Workshop on Language Runtimes, (OOPSLA 2004).

[14]

T. A. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy, and A. S. Mohamed. 2006. Benchmarking parallel compilers: A UPC case study. Future Generation Computer Systems 22, 7 (2006), 764--775.

Digital Library

[15]

U. Ghia, K. N. Ghia, and C. T. Shin. 1982. High-Re solutions for incompressible flow using the Navier-Stokes equations and a multigrid method. Journal of Computational Physics 48, 3 (1982), 387--411.

[16]

X. He and L.-S. Luo. 1997. Lattice Boltzmann model for the incompressible Navier-Stokes equation. Journal of Statistical Physics 88, 3--4 (1997), 927--944.

[17]

P. Husbands, C. Iancu, and K. Yelick. 2003. A performance analysis of the Berkeley UPC compiler. In Proceedings of the 17th Annual International Conference on Supercomputing. ACM, 63--73.

Digital Library

[18]

Intel. 2015. Automated Relational Knowledgebase (ARK). (2015). Retrieved Feb. 15, 2015 from http://ark.intel.com/

[19]

A. Johnson. 2005. Unified parallel C within computational fluid dynamics applications on the Cray X1. In Proceedings of the Cray User’s Group Conference. 1--9.

[20]

I. T. Józsa, M. Szőke, T.-R. Teschner, L. Könözsy, and I. Moulitsas. 2016. Validation and verification of a 2D lattice Boltzmann solver for incompressible fluid flow. ECCOMAS Congress 2016 - Proceedings of the 7th European Congress on Computational Methods in Applied Sciences and Engineering 1 (2016), 1046--1060.

[21]

D. Kandhai, A. Koponen, A. G. Hoekstra, M. Kataja, J. Timonen, and P. M. A. Sloot. 1998. Lattice-Boltzmann hydrodynamics on parallel systems. Computer Physics Communications 111, 1--3 (1998), 14--26.

[22]

D. A. Mallón, A. Gómez, J. C. Mouriño, G. L. Taboada, C. Teijeiro, J. Touriño, B. B. Fraguela, R. Doallo, and B. Wibecan. 2009. UPC performance evaluation on a multicore system. In Proceedings of the 3rd Conference on Partitioned Global Address Space Programing Models. ACM, 9.

Digital Library

[23]

S. A. Manavski and G. Valle. 2008. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics 9, Suppl 2 (2008), S10.

[24]

S. Markidis and G. Lapenta. 2010. Development and performance analysis of a UPC Particle-in-Cell code. In Proceedings of the 4th Conference on Partitioned Global Address Space Programming Model. ACM, 10.

Digital Library

[25]

J. C. Maxwell. 1860. Illustrations of the dynamical theory of gases. Philosophical Magazine Series 4 20, 130 (1860), 21--37.

[26]

C. McClanahan. 2010. History and evolution of GPU architecture. In A Paper Survey.

[27]

Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard. (September 2012).

[28]

A. A. Mohamed. 2011. Lattice Boltzmann Method: Fundamentals and Engineering Applications with Computer Codes. Springer, London.

[29]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53.

Digital Library

[30]

Robert W. Numrich and John Reid. 1998. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17, 2 (1998), 1--31.

Digital Library

[31]

PGAS. 2015. Partitioned Global Address Space Consortium. Retrieved Feb. 15, 2015 from http://www.pgas.org/.

[32]

B. Ren, C. Li, X. Yan, M. C. Lin, J. Bonet, and S.-M. Hu. 2014. Multiple-fluid SPH simulation using a mixture model. ACM Transactions on Graphics 33, 5 (2014), 171.

Digital Library

[33]

P. R. Rinaldi, E. A. Dari, M. J. Vénere, and A. Clausse. 2012. A lattice-Boltzmann solver for 3D fluid simulation on GPU. Simulation Modelling Practice and Theory 25 (2012), 163--171.

[34]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-M. W. Hwu. 2008. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 73--82.

Digital Library

[35]

J. Sanders and E. Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional.

Digital Library

[36]

S. S. Stone, J. P. Haldar, S. C. Tsao, W.-M. Hwu, B. P. Sutton, Z.-P. Liang, and others. 2008. Accelerating advanced MRI reconstructions on GPUs. Journal of Parallel and Distributed Computing 68, 10 (2008), 1307--1318.

Digital Library

[37]

S. Succi. 2001. The Lattice Boltzmann Equation for Fluid Dynamics and Beyond. Oxford.

[38]

G. L. Taboada, C. Teijeiro, J. Tourio, B. B. Fraguela, R. Doallo, J. C. Mourino, and D. A. Mallon. 2009. Performance evaluation of unified parallel C collective communications. In Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications. IEEE, 69--78.

Digital Library

[39]

J. Tölke. 2010. Implementation of a lattice Boltzmann kernel using the compute unified device architecture developed by nVIDIA. Computing and Visualization in Science 13, 1 (2010), 29--39.

Digital Library

[40]

P. Valero-Lara and J. Jansson. 2015. LBM-HPC—An open-source tool for fluid simulations. case study: Unified parallel C (UPC-PGAS). In Proceedings of the IEEE International Conference on Cluster Computing. IEEE, 318--321.

Digital Library

[41]

P. Welander. 1954. On the temperature jump in a rarefied gas. Arkiv Fysik 7 (1954).

[42]

W. Xian and A. Takayuki. 2011. Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster. Parallel Computing 37, 9 (2011), 521--535.

Digital Library

[43]

Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham, David Gay, Phil Colella, and Alex Aiken. 1998. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience 10, 11--13 (1998), 825--836.

[44]

J. Zhang, B. Behzad, and M. Snir. 2011. Optimizing the Barnes-Hut algorithm in UPC. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 75:1--75:11.

Digital Library

[45]

Y. Zhang, J. Cohen, and J. D. Owens. 2010. Fast tridiagonal solvers on the GPU. ACM Sigplan Notices 45, 5 (2010), 127--136.

Digital Library

[46]

Q. Zou and X. He. 1997. On pressure and velocity boundary conditions for the lattice Boltzmann BGK model. Physics of Fluids 9, 6 (1997), 1591--1598.

Cited By

Colbrook M(2024)Another look at residual dynamic mode decomposition in the regime of fewer snapshots than dictionary sizePhysica D: Nonlinear Phenomena10.1016/j.physd.2024.134341(134341)Online publication date: Aug-2024
https://doi.org/10.1016/j.physd.2024.134341
Colbrook M(2024)The multiverse of dynamic mode decomposition algorithmsNumerical Analysis Meets Machine Learning10.1016/bs.hna.2024.05.004(127-230)Online publication date: 2024
https://doi.org/10.1016/bs.hna.2024.05.004
Colbrook MAyton LSzőke M(2023)Residual dynamic mode decomposition: robust and verified KoopmanismJournal of Fluid Mechanics10.1017/jfm.2022.1052955Online publication date: 17-Jan-2023
https://doi.org/10.1017/jfm.2022.1052
Show More Cited By

Index Terms

Performance Evaluation of a Two-Dimensional Lattice Boltzmann Solver Using CUDA and PGAS UPC Based Parallelisation

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Moment Representation of Regularized Lattice Boltzmann Methods on NVIDIA and AMD GPUs
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

The lattice Boltzmann method is a highly scalable Navier-Stokes solver that has been applied to flow problems in a wide array of domains. However, the method is bandwidth-bound on modern GPU accelerators and has a large memory footprint. In this paper, ...
Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA

This paper presents implementation strategies and optimization approaches for a D3Q19 lattice Boltzmann flow solver on nVIDIA graphics processing units (GPUs). Using the STREAM benchmarks we demonstrate the GPU parallelization approach and obtain an ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 44, Issue 1

March 2018

308 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/3071076

Editor:
Daniel Kressner
EPF Lausanne, Switzerland

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 July 2017

Accepted: 01 April 2017

Revised: 01 September 2016

Received: 01 January 2016

Published in TOMS Volume 44, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
351
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Colbrook M(2024)Another look at residual dynamic mode decomposition in the regime of fewer snapshots than dictionary sizePhysica D: Nonlinear Phenomena10.1016/j.physd.2024.134341(134341)Online publication date: Aug-2024
https://doi.org/10.1016/j.physd.2024.134341
Colbrook M(2024)The multiverse of dynamic mode decomposition algorithmsNumerical Analysis Meets Machine Learning10.1016/bs.hna.2024.05.004(127-230)Online publication date: 2024
https://doi.org/10.1016/bs.hna.2024.05.004
Colbrook MAyton LSzőke M(2023)Residual dynamic mode decomposition: robust and verified KoopmanismJournal of Fluid Mechanics10.1017/jfm.2022.1052955Online publication date: 17-Jan-2023
https://doi.org/10.1017/jfm.2022.1052
Takáč MPetráš I(2021)Cross-Platform GPU-Based Implementation of Lattice Boltzmann Method Solver Using ArrayFire LibraryMathematics10.3390/math91517939:15(1793)Online publication date: 28-Jul-2021
https://doi.org/10.3390/math9151793
Józsa T(2019)Analytical solutions of incompressible laminar channel and pipe flows driven by in-plane wall oscillationsPhysics of Fluids10.1063/1.510435631:8Online publication date: 12-Aug-2019
https://doi.org/10.1063/1.5104356
Prokhorova EKrivovichev G(2018)Parallel realization of the computational algorithm based on the implicit lattice Boltzmann equationsJournal of Physics: Conference Series10.1088/1742-6596/1038/1/0120411038(012041)Online publication date: 14-Jun-2018
https://doi.org/10.1088/1742-6596/1038/1/012041

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents