research-article

Open access

Reducing communication in the conjugate gradient method: a case study on high-order finite elements

Authors:

Niclas Jansson,

Philipp Schlatter,

Stefano MarkidisAuthors Info & Claims

PASC '22: Proceedings of the Platform for Advanced Scientific Computing Conference

Article No.: 2, Pages 1 - 11

https://doi.org/10.1145/3539781.3539785

Published: 12 July 2022 Publication History

Abstract

Currently, a major bottleneck for several scientific computations is communication, both communication between different processors, so-called horizontal communication, and vertical communication between different levels of the memory hierarchy. With this bottleneck in mind, we target a notoriously communication-bound solver at the core of many high-performance applications, namely the conjugate gradient method (CG). To reduce the communication we present lower bounds on the vertical data movement in CG and go on to make a CG solver with reduced data movement. Using our theoretical analysis we apply our CG solver on a high-performance discretization used in practice, the spectral element method (SEM). Guided by our analysis, we show that for the Poisson equation on modern GPUs we can improve the performance by 30% by both rematerializing the discrete system and by reformulating the system to work on unique degrees of freedom. In order to investigate how horizontal communication can be reduced, we compare CG to two communication-reducing techniques, namely communication-avoiding and pipelined CG. We strong scale up to 4096 CPU cores and showcase performance improvements of upwards of 70% for pipelined CG compared to standard CG when applied on SEM at scale. We show that in addition to improving the scaling capabilities of the solver, initial measurements indicate that the convergence of SEM is largely unaffected by pipelined CG.

References

[1]

Accessed Dec 10 2021. AMD CDNA^™ 2 Architecture. https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf.

[2]

Grey Ballard, Erin Carson, James Demmel, Mark Hoemmen, Nicholas Knight, and Oded Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23 (2014), 1--155.

[3]

Richard Barrett, Michael Berry, Tony F Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst. 1994. Templates for the solution of linear systems: building blocks for iterative methods. SIAM.

[4]

Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).

[5]

Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction. Springer, 132--146.

[6]

Erin Carson, Nicholas Knight, and James Demmel. 2014. An efficient deflation technique for the communication-avoiding conjugate gradient method. Electronic Transactions on Numerical Analysis 43, 125141 (2014), 09.

[7]

Erin Claire Carson. 2015. Communication-avoiding Krylov subspace methods in theory and practice. University of California, Berkeley.

[8]

Anthony T. Chronopoulos and C. William Gear. 1989. On the efficient implementation of preconditioned s-step conjugate gradient methods on multiprocessors with memory hierarchy. Parallel computing 11, 1 (1989), 37--53.

[9]

Jeffrey Cornelis, Siegfried Cools, and Wim Vanroose. 2018. The communication-hiding conjugate gradient method with deep pipelines. arXiv preprint arXiv:1801.04728 (2018).

[10]

David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten Von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. 1--12.

Digital Library

[11]

Erik D Demaine and Quanquan C Liu. 2018. Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures. 195--204.

Digital Library

[12]

James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal on Scientific Computing 34, 1 (2012), A206--A239.

Digital Library

[13]

James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. 2008. Avoiding communication in sparse matrix computations. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1--12.

[14]

Michel O Deville, Paul F Fischer, Paul F Fischer, EH Mund, et al. 2002. High-order methods for incompressible fluid flow. Vol. 9. Cambridge university press.

[15]

Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J Ramanujam, and P Sadayappan. 2013. Data access complexity: The red/blue pebble game revisited. Technical Report. Technical Report.

[16]

Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2014. On characterizing the data movement complexity of computational DAGs for parallel execution. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures. 296--306.

Digital Library

[17]

Paul Fischer, Stefan Kerkemeier, Misun Min, Yu-Hsiang Lan, Malachi Phillips, Thilina Rathnayake, Elia Merzari, Ananias Tomboulides, Ali Karakus, Noel Chalmers, et al. 2021. NekRS, a GPU-Accelerated Spectral Element Navier-Stokes Solver. arXiv preprint arXiv:2104.05829 (2021).

[18]

Paul Fischer, Misun Min, Thilina Rathnayake, Som Dutta, Tzanio Kolev, Veselin Dobrev, Jean-Sylvain Camier, Martin Kronbichler, Tim Warburton, Kasia Świry-dowicz, et al. 2020. Scalability of high-performance PDE solvers. The International Journal of High Performance Computing Applications 34, 5 (2020), 562--586.

Digital Library

[19]

Paul F Fischer. 2015. Scaling limits for PDE-based simulation. In 22nd AIAA Computational Fluid Dynamics Conference. 3049.

[20]

Pieter Ghysels and Wim Vanroose. 2014. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Parallel Comput. 40, 7 (2014), 224--238.

Digital Library

[21]

Anne Greenbaum. 1997. Estimating the attainable accuracy of recursively computed residual methods. SIAM journal on matrix analysis and applications 18, 3 (1997), 535--551.

[22]

Magnus Rudolph Hestenes, Eduard Stiefel, et al. 1952. Methods of conjugate gradients for solving linear systems. Vol. 49. NBS Washington, DC.

[23]

Mark Hoemmen. 2010. Communication-avoiding Krylov subspace methods. University of California, Berkeley.

Digital Library

[24]

Dror Irony, Sivan Toledo, and Alexander Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel and Distrib. Comput. 64, 9 (2004), 1017--1026.

Digital Library

[25]

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2021. Data Movement Is All You Need: A Case Study on Optimizing Transformers. Proceedings of Machine Learning and Systems 3 (2021).

[26]

Niclas Jansson, Martin Karp, Artur Podobas, Stefano Markidis, and Philipp Schlatter. 2021. Neko: A Modern, Portable, and Scalable Framework for High-Fidelity Computational Fluid Dynamics. arXiv preprint arXiv:2107.01243 (2021).

[27]

Hong Jia-Wei and Hsiang-Tsung Kung. 1981. I/O complexity: The red-blue pebble game. In Proceedings of the thirteenth annual ACM symposium on Theory of computing. 326--333.

Digital Library

[28]

Martin Karp, Artur Podobas, Tobias Kenter, Niclas Jansson, Christian Plessl, Philipp Schlatter, and Stefano Markidis. 2022. A high-fidelity flow solver for unstructured meshes on field-programmable gate arrays: Design, evaluation, and future challenges. In International Conference on High Performance Computing in Asia-Pacific Region. 125--136.

Digital Library

[29]

Tzanio Kolev, Paul Fischer, Misun Min, Jack Dongarra, Jed Brown, Veselin Dobrev, Tim Warburton, Stanimire Tomov, Mark S Shephard, Ahmad Abdelfattah, et al. 2021. Efficient exascale discretizations: High-order finite element methods. The International Journal of High Performance Computing Applications (2021), 1--26.

[30]

Dimitri Komatitsch, Seiji Tsuboi, Jeroen Tromp, A Levander, and G Nolet. 2005. The spectral-element method in seismology. Geophysical Monograph-American Geophysical Union 157 (2005), 205.

[31]

Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.

Digital Library

[32]

Jörg Liesen and Petr Tichỳ. 2004. Convergence analysis of Krylov subspace methods. GAMM-Mitteilungen 27, 2 (2004), 153--173.

[33]

James W Lottes and Paul F Fischer. 2005. Hybrid multigrid/Schwarz algorithms for the spectral element method. Journal of Scientific Computing 24, 1 (2005), 45--78.

Digital Library

[34]

Vladimir Marjanović, José Gracia, and Colin W Glass. 2014. Performance modeling of the HPCG benchmark. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems. Springer, 172--192.

[35]

Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing communication in sparse matrix solvers. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1--12.

Digital Library

[36]

Anthony T Patera. 1984. A spectral element method for fluid dynamics: laminar flow in a channel expansion. Journal of computational Physics 54, 3 (1984), 468--488.

[37]

James W. Lottes Paul F. Fischer and Stefan G. Kerkemeier. 2008. nek5000 Web page. http://nek5000.mcs.anl.gov.

[38]

Artur Podobas, Kentaro Sano, and Satoshi Matsuoka. 2020. A survey on coarse-grained reconfigurable architectures from a performance perspective. IEEE Access 8 (2020), 146719--146743.

[39]

Samyam Rajbhandari, Akshay Nikam, Pai-Wei Lai, Kevin Stock, Sriram Krishnamoorthy, and Ponnuswamy Sadayappan. 2014. A communication-optimal framework for contracting distributed tensors. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 375--386.

Digital Library

[40]

John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In International Conference on High Performance Computing for Computational Science. Springer, 1--25.

Digital Library

[41]

Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel. 2017. Tradeoffs between synchronization, communication, and computation in parallel linear algebra computations. ACM Transactions on Parallel Computing (TOPC) 3, 1 (2017), 1--47.

[42]

Sivan Avraham Toledo. 1995. Quantitative performance modeling of scientific computations and creating locality in numerical algorithms. Ph.D. Dissertation. Massachusetts Institute of Technology.

[43]

Henry M Tufo and Paul F Fischer. 1999. Terascale spectral element algorithms and implementations. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing. 68--es.

Digital Library

[44]

Samuel Williams, Mike Lijewski, Ann Almgren, Brian Van Straalen, Erin Carson, Nicholas Knight, and James Demmel. 2014. s-step Krylov subspace methods as bottom solvers for geometric multigrid. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1149--1158.

Digital Library

[45]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.

Digital Library

Cited By

Karp MMassaro DJansson NHart AWahlgren JSchlatter PMarkidis S(2023)Large-Scale direct numerical simulations of turbulence using GPUs and modern FortranInternational Journal of High Performance Computing Applications10.1177/1094342023115861637:5(487-502)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1177/10943420231158616
Ju YLi MPerez ABellentani LJansson NMarkidis SSchlatter PLaure E(2023)In-Situ Techniques on GPU-Accelerated Data-Intensive Applications2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254865(1-10)Online publication date: 9-Oct-2023
https://doi.org/10.1109/e-Science58273.2023.10254865

Index Terms

Reducing communication in the conjugate gradient method: a case study on high-order finite elements
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Solvers

Recommendations

Reducing Communication Overhead in the High Performance Conjugate Gradient Benchmark on Tianhe-2
DCABES '14: Proceedings of the 2014 13th International Symposium on Distributed Computing and Applications to Business, Engineering and Science

The High Performance Conjugate Gradient (HPCG) benchmark, proposed recently in 2013, has drawn increasingly more attention from both academia and industry. Unlike the High Performance Linpack (HPL) benchmark, which has a very high computation-to-...
Strategies for Efficient Execution of Pipelined Conjugate Gradient Method on GPU Systems
High Performance Computing. ISC High Performance 2022 International Workshops
Abstract
The Preconditioned Conjugate Gradient (PCG) method is widely used for solving linear systems of equations with sparse matrices. A recent version of PCG, Pipelined PCG (PIPECG), eliminates the dependencies in the computations of the PCG algorithm ...
An implementation of block conjugate gradient algorithm on CPU-GPU processors
Co-HPC '14: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing

In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PASC '22: Proceedings of the Platform for Advanced Scientific Computing Conference

June 2022

181 pages

ISBN:9781450394109

DOI:10.1145/3539781

Conference Chair:
Timothy Robinson
ETH Zurich/CSCS

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

CSCS: Swiss National Supercomputing Centre

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

PASC '22

Sponsor:

SIGHPC

PASC '22: Platform for Advanced Scientific Computing Conference

June 27 - 29, 2022

Basel, Switzerland

Acceptance Rates

PASC '22 Paper Acceptance Rate 17 of 22 submissions, 77%;

Overall Acceptance Rate 109 of 221 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
323
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)27

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Karp MMassaro DJansson NHart AWahlgren JSchlatter PMarkidis S(2023)Large-Scale direct numerical simulations of turbulence using GPUs and modern FortranInternational Journal of High Performance Computing Applications10.1177/1094342023115861637:5(487-502)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1177/10943420231158616
Ju YLi MPerez ABellentani LJansson NMarkidis SSchlatter PLaure E(2023)In-Situ Techniques on GPU-Accelerated Data-Intensive Applications2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254865(1-10)Online publication date: 9-Oct-2023
https://doi.org/10.1109/e-Science58273.2023.10254865

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents