Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Automatic Code Generation for High-performance Discontinuous Galerkin Methods on Modern Architectures

Published: 08 December 2020 Publication History

Abstract

SIMD vectorization has lately become a key challenge in high-performance computing. However, hand-written explicitly vectorized code often poses a threat to the software’s sustainability. In this publication, we solve this sustainability and performance portability issue by enriching the simulation framework dune-pdelab with a code generation approach. The approach is based on the well-known domain-specific language UFL but combines it with loopy, a more powerful intermediate representation for the computational kernel. Given this flexible tool, we present and implement a new class of vectorization strategies for the assembly of Discontinuous Galerkin methods on hexahedral meshes exploiting the finite element’s tensor product structure. The performance-optimal variant from this class is chosen by the code generator through an auto-tuning approach. The implementation is done within the open source PDE software framework Dune and the discretization module dune-pdelab. The strength of the proposed approach is illustrated with performance measurements for DG schemes for a scalar diffusion reaction equation and the Stokes equation. In our measurements, we utilize both the AVX2 and the AVX512 instruction set, achieving 30% to 40% of the machine’s theoretical peak performance for one matrix-free application of the operator.

References

[1]
Ahmad Abdelfattah, Marc Baboulin, Veselin Dobrev, Jack Dongarra, A. Haidar, I. Karlin, Tz Kolev, I. Masliah, and S. Tomov. [2017]. Small Tensor Operations on Advanced Architectures for High-order Applications. Technical Report. Technical Report UT-EECS-17-749.
[2]
Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, design, and autotuning of batched GEMM for GPUs. In Proceedings of the International Conference on High Performance Computing. Springer, 21--38.
[3]
Mark Ainsworth, Gaelle Andriamaro, and Oleg Davydov. 2011. Bernstein–Bézier finite elements of arbitrary order and optimal assembly procedures. SIAM J. Sci. Comput. 33, 6 (2011), 3087--3109.
[4]
Martin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified form language: A domain-specific language for weak formulations of partial differential equations. ACM Trans. Math. Softw. 40, 2 (2014), 9.
[5]
Martin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified form language: A domain-specific language for weak formulations of partial differential equations. ACM Trans. Math. Softw. 40, 2 (2014).
[6]
W. Bangerth, R. Hartmann, and G. Kanschat. 2007. deal.II—A general purpose object oriented finite element library. ACM Trans. Math. Softw. 33, 4 (2007), 24/1–24/27.
[7]
Peter Bastian. 2014. A fully coupled discontinuous Galerkin method for two-phase flow in porous media with discontinuous capillary pressure. Comput. Geosci. 18, 5 (2014), 779--796.
[8]
P. Bastian, K. Birken, K. Johannsen, S. Lang, N. Neuß, H. Rentz-Reichert, and C. Wieners. 1997. UG—A flexible software toolbox for solving partial differential equations. Comput. Visual. Sci. 1, 1 (Jan. 1997), 27--40.
[9]
Peter Bastian, Markus Blatt, Andreas Dedner, Christian Engwer, Robert Klöfkorn, Ralf Kornhuber, Mario Ohlberger, and Oliver Sander. 2008. A generic grid interface for parallel and adaptive scientific computing. part II: Implementation and tests in DUNE. Computing 82, 2–3 (2008), 121--138.
[10]
Peter Bastian, Markus Blatt, Andreas Dedner, Christian Engwer, Robert Klöfkorn, Mario Ohlberger, and Oliver Sander. 2008. A generic grid interface for parallel and adaptive scientific computing. part I: Abstract framework. Computing 82, 2–3 (2008), 103--119.
[11]
Peter Bastian, Felix Heimann, and Sven Marnach. 2010. Generic implementation of finite element methods in the distributed and unified numerics environment (DUNE). Kybernetika 46, 2 (2010), 294--315.
[12]
Peter Bastian, Eike Hermann Müller, Steffen Müthing, and Marian Piatkowski. 2019. Matrix-free multigrid block-preconditioners for higher order discontinuous Galerkin discretisations. J. Comput. Phys. 394 (2019), 417--439.
[13]
Paul E. Buis and Wayne R. Dyksen. 1996. Efficient vector and parallel manipulation of tensor products. ACM Trans. Math. Softw. 22, 1 (1996), 18--23.
[14]
B. Cockburn, S. Y. Lin, and C.-W. Shu (Eds.). 2000. Discontinuous Galerkin Methods. Theory, Computation and Applications. Lecture Notes in Computational Science and Engineering, Vol. 11. Springer-Verlag.
[15]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, Piscataway, NJ, Article 4, 12 pages. Retrieved from http://dl.acm.org/citation.cfm?id=1413370.1413375.
[16]
R. H. Dennard, F. H. Gaensslen, H. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid-State Circ. 9, 5 (Oct. 1974), 256--268.
[17]
Romain Dolbeau. 2018. Theoretical peak FLOPS per instruction set: A tutorial. J. Supercomput. 74 (2018), 1341--1377.
[18]
Craig C. Douglas, Jonathan Hu, Wolfgang Karl, Markus Kowarschik, Ulrich Rüde, and Christian Weiß. 2000. Fixed and Adaptive Cache Aware Algorithms for Multigrid Methods. Springer, Berlin, 87--93.
[19]
Alexandre Ern, Annette F. Stephansen, and Paolo Zunino. 2009. A discontinuous Galerkin method with weighted averages for advection-diffusion equations with locally small and anisotropic diffusivity. IMA J. Numer. Anal. 29, 2 (2009), 235--256.
[20]
Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton, Kasia Swirydowicz, et al. 2020. Scalability of high-performance PDE solvers. Arxiv Preprint Arxiv:2004.06722 (2020).
[21]
P. F. Fischer, K. Heisey, and M. Min. 2015. Scaling limits for PDE-based simulation. In Proceedings of the 22nd AIAA Computational Fluid Dynamics Conference. Dallas, TX.
[22]
Agner Fog. [n.d.]. VCL C++ vector class library v 1.30. Retrieved from http://www.agner.org/optimize/vectorclass.pdf.
[23]
F. Franchetti, S. Kral, J. Lorenz, and C. W. Ueberhuber. 2005. Efficient utilization of SIMD extensions. Proc. IEEE 93, 2 (Feb. 2005), 409--425.
[24]
Vivette Girault, Mary Béatrice Rivière, and F. Wheeler. 2005. A discontinuous Galerkin method with nonoverlapping domain decomposition for the stokes and Navier-Stokes problems. Math. Comput. 74 (2005), 53--84.
[25]
Google. 2020. Benchmark. Retrieved from https://github.com/google/benchmark.
[26]
Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 84.
[27]
Miklós Homolya, Lawrence Mitchell, Fabio Luporini, and David A. Ham. 2018. TSFC: A structure-preserving form compiler. SIAM J. Sci. Comput. 40, 3 (2018), C401–C428.
[28]
H. Huang and G. Scovazzi. 2013. A high-order, fully coupled, upwind, compact discontinuous Galerkin method for modeling of viscous fingering in compressible porous media. Computer Methods in Applied Mechanics and Engineering 263, 0 (2013), 169--187.
[29]
George Em Karniadakis and Spencer J. Sherwin. 2005. Spectral/hp Element Methods for CFD. Oxford University Press.
[30]
Dominic Kempf and René Heß. 2020. Automatic Code Generation for High-Performance Discontinuous Galerkin Methods on Modern Architectures—Software Stack: Retrieved from https://doi.org/10.5281/zenodo.377926.
[31]
Dominic Kempf and Timo Koch. 2017. System testing in scientific numerical software frameworks using the example of DUNE. Arch. Numer. Softw. 5, 1 (2017), 151--168.
[32]
Kyungjoo Kim, Timothy B. Costa, Mehmet Deveci, Andrew M. Bradley, Simon D. Hammond, Murat E. Guney, Sarah Knepper, Shane Story, and Sivasankaran Rajamanickam. 2017. Designing vector-friendly compact BLAS and LAPACK kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, Article 55, 12 pages.
[33]
Robert C. Kirby and Anders Logg. 2006. A compiler for variational forms. ACM Trans. Math. Softw. 32, 3 (2006), 417--444.
[34]
Robert C. Kirby and Kieu Tri Thinh. 2012. Fast simplicial quadrature-based finite element operators using Bernstein polynomials. Numer. Math. 121, 2 (2012), 261--279.
[35]
Andreas Klöckner. 2014. Loo.Py: Transformation-based code generation for GPUs and CPUs. In Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14). ACM, New York, NY, Article 82, 6 pages.
[36]
Andreas Klöckner, Lucas C. Wilcox, and T. Warburton. 2016. Array program transformation with Loo.Py by example: High-order finite elements. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’16). ACM, New York, NY, 9--16.
[37]
Tzanio Kolev et al. [n.d.]. MFEM: Modular finite element methods. Retrieved from http://mfem.org.
[38]
Benjamin Krank, Niklas Fehn, Wolfgang A. Wall, and Martin Kronbichler. 2017. A high-order semi-explicit discontinuous Galerkin solver for 3D incompressible flow with application to DNS and LES of turbulent channel flow. J. Comput. Phys. 348 (2017), 634--659.
[39]
Matthias Kretz and Volker Lindenstruth. 2012. Vc: A C++ library for explicit vectorization. Softw.: Pract. Exper. 42, 11 (2012), 1409--1430.
[40]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, 5 (2014), C401–C423.
[41]
Martin Kronbichler and Katharina Kormann. 2012. A generic interface for parallel cell-based finite element operator application. Comput. Fluids 63 (2012), 135--147.
[42]
Martin Kronbichler and Katharina Kormann. 2019. Fast matrix-free evaluation of discontinuous Galerkin finite element operators. ACM Trans. Math. Softw. 45, 3, Article Article 29 (Aug. 2019), 40 pages.
[43]
Martin Kronbichler and Wolfgang A. Wall. 2018. A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. SIAM J. Sci. Comput. 40, 5 (2018), A3423–A3448.
[44]
Jizhou Li and Beatrice Riviere. 2015. Numerical solutions of the incompressible miscible displacement equations in heterogeneous media. Comput. Methods Appl. Mech. Eng. 292 (2015), 107--121.
[45]
Charles F. Van Loan. 2000. The ubiquitous kronecker product. J. Comput. Appl. Math. 123, 1 (2000), 85--100.
[46]
Anders Logg, Kent-Andre Mardal, Garth N. Wells, et al. 2012. Automated Solution of Differential Equations by the Finite Element Method. Springer.
[47]
A. T. T. McRae, G.-T. Bercea, L. Mitchell, D. A. Ham, and C. J. Cotter. 2016. Automated generation and symbolic manipulation of tensor product finite elements. SIAM J. Sci. Comput. 38, 5 (2016), S25–S47.
[48]
Steffen Müthing, Marian Piatkowski, and Peter Bastian. 2017. High-performance implementation of matrix-free high-order discontinuous Galerkin methods. Retrieved from https://Arxiv:1711.10885.
[49]
Steven A. Orszag. 1980. Spectral methods for problems in complex geometries. J. Comput. Phys. 37, 1 (1980), 70--92.
[50]
Will Pazner and Per-Olof Persson. 2018. Approximate tensor-product preconditioners for very high-order discontinuous Galerkin methods. J. Comput. Phys. 354 (2018), 344--369.
[51]
Marian Piatkowski, Steffen Müthing, and Peter Bastian. 2018. A stable and high-order accurate discontinuous Galerkin-based splitting method for the incompressible Navier-Stokes equations. J. Comput. Phys. 356 (2018), 220--239.
[52]
Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2015. Firedrake: Automating the finite element method by composing abstractions. Retrieved from http://arxiv.org/abs/1501.01809.
[53]
J. Schöberl, A. Arnold, J. Erb, J. M. Melenk, and T. P. Wihler. 2017. C++11 Implementation of Finite Elements in NGSolve. Technical Report.
[54]
Sriram Sellappa and Siddhartha Chatterjee. 2004. Cache-efficient multigrid algorithms. Int. J. High Perform. Comput. Appl. 18, 1 (Feb. 2004), 115--133.
[55]
Tianjiao Sun, Lawrence Mitchell, Kaushik Kulkarni, Andreas Klöckner, David A. Ham, and Paul H. J. Kelly. 2019. A study of vectorization for matrix-free finite element methods. Retrieved from https://Arxiv:1903.08243.
[56]
Herb Sutter. 2005. The free lunch is over. Dr. Dobb’s J. 30, 3 (2005).
[57]
Kasia Świrydowicz, Noel Chalmers, Ali Karakus, and Tim Warburton. 2019. Acceleration of tensor-product operations for high-order finite element methods. Int. J. High Perform. Comput. Appl. 33, 4 (2019), 735--757.
[58]
Ulrich Trottenberg, Cornelius W. Oosterlee, and Anton Schuller. 2000. Multigrid. Elsevier.
[59]
Maurice V. Wilkes. 2001. The memory gap and the future of high performance memories. SIGARCH Comput. Archit. News 29, 1 (Mar. 2001), 2--7.
[60]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.

Cited By

View all
  • (2024)Parallel Pattern Language Code GenerationProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649245(32-41)Online publication date: 3-Mar-2024
  • (2024)Parallel Pattern Compiler for Automatic Global OptimizationsParallel Computing10.1016/j.parco.2024.103112122(103112)Online publication date: Nov-2024
  • (2024)A benchmark study on reactive two-phase flow in porous media: Part II - results and discussionComputational Geosciences10.1007/s10596-024-10269-y28:3(395-412)Online publication date: 3-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 47, Issue 1
March 2021
219 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/3441641
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2020
Accepted: 01 September 2020
Revised: 01 April 2020
Received: 01 December 2018
Published in TOMS Volume 47, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Code generation
  2. Galerkin methods

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Parallel Pattern Language Code GenerationProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649245(32-41)Online publication date: 3-Mar-2024
  • (2024)Parallel Pattern Compiler for Automatic Global OptimizationsParallel Computing10.1016/j.parco.2024.103112122(103112)Online publication date: Nov-2024
  • (2024)A benchmark study on reactive two-phase flow in porous media: Part II - results and discussionComputational Geosciences10.1007/s10596-024-10269-y28:3(395-412)Online publication date: 3-Feb-2024
  • (2023)Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementationsInternational Journal of High Performance Computing Applications10.1177/1094342022110788037:2(61-81)Online publication date: 1-Mar-2023
  • (2023)Toward Interpretable Graph Tensor Convolution Neural Network for Code Semantics EmbeddingACM Transactions on Software Engineering and Methodology10.1145/358257432:5(1-40)Online publication date: 21-Jul-2023
  • (2023)Multi-discretization domain specific language and code generation for differential equationsJournal of Computational Science10.1016/j.jocs.2023.10198168(101981)Online publication date: Apr-2023
  • (2022)BPM supported model generation by contemplating key elements of information securityAutomated Software Engineering10.1007/s10515-022-00321-529:1Online publication date: 1-May-2022
  • (2022)Finch: Domain Specific Language and Code Generation for Finite Element and Finite Volume in JuliaComputational Science – ICCS 202210.1007/978-3-031-08751-6_9(118-132)Online publication date: 21-Jun-2022
  • (2021)hyper.deal: An Efficient, Matrix-free Finite-element Library for High-dimensional Partial Differential EquationsACM Transactions on Mathematical Software10.1145/346972047:4(1-34)Online publication date: 28-Sep-2021
  • (2021)A next-generation discontinuous galerkin fluid dynamics solver with application to high-resolution lung airflow simulationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476171(1-15)Online publication date: 14-Nov-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media