research-article

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods

Authors:

Luke N. OlsonAuthors Info & Claims

SIAM Journal on Scientific Computing, Volume 34, Issue 4

Pages C123 - C152

https://doi.org/10.1137/110838844

Published: 01 January 2012 Publication History

Abstract

Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid methods on massively parallel throughput-oriented processors, such as graphics processing units, demands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of $1.8\times$ in the setup phase and $5.7\times$ in the cycling phase when compared to a representative CPU implementation.

References

[1]

M. Adams, M. Brezina, J. Hu, and R. Tuminaro, Parallel multigrid smoothing: Polynomial versus Gauss-Seidel, J. Comput. Phys., 188 (2003), pp. 593--610.

[2]

A. H. Baker, T. Gamblin, M. Schulz, and U. M. Yang, Challenges of scaling algebraic multigrid across modern multicore architectures, in Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, 2011.

[3]

R. E. Bank and C. C. Douglas, Sparse matrix multiplication package (SMMP), Adv. Comput. Math., 1 (1993), pp. 127--137.

[4]

M. M. Baskaran and R. Bordawekar, Optimizing Sparse Matrix-Vector Multiplication on GPUs, Research report RC24704, IBM, 2009.

[5]

N. Bell and M. Garland, Efficient Sparse Matrix-Vector Multiplication on CUDA, Technical report NVR-2008-004, NVIDIA Corporation, 2008.

[6]

N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, New York, ACM, 2009, pp. 18:1--18:11.

[7]

N. Bell and M. Garland, CUSP: Generic Parallel Algorithms for Sparse Matrix and Graph Computations, http://code.google.com/p/cusp-library (2009).

[8]

G. E. Blelloch, Vector Models for Data-Parallel Computing, MIT Press, Cambridge, MA, 1990.

[9]

J. Bolz, I. Farmer, E. Grinspun, and P. Schröoder, Sparse matrix solvers on the GPU: Conjugate gradients and multigrid, ACM Trans. Graph., 22 (2003), pp. 917--924.

[10]

E. Chow, R. D. Falgout, J. J. Hu, R. S. Tuminaro, and U. M. Yang, A survey of parallelization techniques for multigrid solvers, Parallel Processing for Scientific Computing, SIAM, Philadelphia, 2006, pp. 179--202.

[11]

M. Christen, O. Schenk, and H. Burkhar, General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform, in Proceedings of the First Workshop on General Purpose Processing on Graphics Processing Units, Northeastern University, Boston, MA, 2007.

[12]

A. J. Cleary, R. D. Falgout, V. E. Henson, J. E. Jones, T. A. Manteuffel, S. F. McCormick, G. N. Miranda, and J. W. Ruge, Robustness and scalability of algebraic multigrid, SIAM J. Sci. Comput., 21 (2000), pp. 1886--1908.

[13]

J. M. Cohen and M. J. Molemaker, A fast double precision CFD code using CUDA, in Proceedings of the 21st International Conference on Parallel Computational Fluid Dynamics, 2009.

[14]

Cublas Library, Version 3.1, NVIDIA Corporation, 2010, http://developer.nvidia.com/cublas.

[15]

E. F. DAzevedo, M. R. Fahey, and R. T. Mills, Vectorized sparse matrix multiply for compressed row storage format, in Proceedings of the International Conference on Computational Science, Springer, New York, 2005, pp. 99--106.

[16]

M. Garland and D. B. Kirk, Understanding throughput-oriented architectures, Commun. ACM, 53 (2010), pp. 58--66.

[17]

M. W. Gee, C. M. Siefert, J. J. Hu, R. S. Tuminaro, and M. G. Sala, ML $5.0$ Smoothed Aggregation User's Guide, Technical report SAND2006-2649, Sandia National Laboratories, Livermore, CA, 2006.

[18]

D. Göddeke, R. Strzodka, J. Mohd-Yusof, P. S. McCormick, H. Wobker, C. Becker, and S. Turek, Using GPUs to improve multigrid solver performance on a cluster, Int. J. Comput. Sci. Engrg., 4 (2008), pp. 36--55.

[19]

N. Goodnight, G. Lewin, D. Luebke, and K. Skadron, A multigrid solver for boundary value problems using programmable graphics hardware, in HWWS '03: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, Aire-la-Ville, Switzerland, Eurographics Association, 2003, pp. 102--111.

[20]

F. G. Gustavson, Two fast algorithms for sparse matrices: Multiplication and permuted transposition, ACM Trans. Math. Softw., 4 (1978), pp. 250--269.

[21]

G. Haase, M. Liebmann, G. Plank, and C. Douglas, Parallel algebraic multigrid on general purpose GPUS, in Proceedings of the 3rd Austrian Grid Symposium, J. Volkert et al., ed., 2010, pp. 28--37.

[22]

V. E. Henson and U. M. Yang, Boomerang: A parallel algebraic multigrid solver and preconditioner, Appl. Numer. Math., 41 (2002), pp. 155--177.

[23]

J. Hoberock and N. Bell, Thrust: A Parallel Template Library, Version 1.4.0, 2011, http:// thrust/github.com.

[24]

G. Karypis and V. Kumar, Parallel multilevel k-way partitioning scheme for irregular graphs, SIAM Rev., 41 (1999), pp. 278--300.

[25]

M. Kazhdan and H. Hoppe, Streaming multigrid for gradient-domain operations on large images, in Proceedings of SIGGRAPH '08, New York, ACM, 2008, pp. 21:1--21:10.

[26]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic linear algebra subprograms for Fortran usage, ACM Trans. Math. Software, 5 (1979), pp. 308--323.

[27]

M. Luby, A simple parallel algorithm for the maximal independent set problem, SIAM J. Comput., 15 (1986), pp. 1036--1055.

[28]

D. G. Merrill and A. S. Grimshaw, Revisiting Sorting for GPGNU Stream Architectures, Technical report CS2010-03, Department of Computer Science, University of Virginia, Charlottesville, VA, 2010.

[29]

D. Merrill and A. Grimshaw, Parallel Scan for Stream Architectures, Technical report CS2009-14, Department of Computer Science, University of Virginia, Charlottesville, VA, 2009.

[30]

NVIDIA CUDA Programming Guide, version 4.0, NVIDIA Corporation, 2011, http:// developer.nvidia.com/cuda.

[31]

L. N. Olson, J. Schroder, and R. S. Tuminaro, A new perspective on strength measures in algebraic multigrid, Numer. Linear Algebra Appl., 17 (2010), pp. 713--733.

[32]

J. W. Ruge and K. Stüben, Algebraic multigrid, in Multigrid Methods, Front. Appl. Math. 3, SIAM, Philadelphia, 1987, pp. 73--130.

[33]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, Scan primitives for GPU computing, in Proceedings of Graphics Hardware 2007, ACM, 2007, pp. 97--106.

[34]

M. Stürmer, H. Köstler, and U. Rüde, How to optimize geometric multigrid methods on GPUS, in Proceedings of the 15th Copper Mountain Conference on Multigrid Methods, 2011.

[35]

R. S. Tuminaro and C. Tong, Parallel smoothed aggregation multigrid : Aggregation strategies on massively parallel machines, in Proceedings of the Supercomp Conference, 2000, p. 5.

[36]

S. Tzeng and L.-Y. Wei, Parallel white noise generation on a GPU via cryptographic hash, in Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, ACM, 2008, pp. 79--87.

[37]

P. Vaněk, J. Mandel, and M. Brezina, Algebraic Multigrid by Smoothed Aggregation for Second and Fourth Order Elliptic Problems, Computing, 56 (1996), pp. 179--196.

[38]

R. W. Vuduc and H.-J. Moon, Fast sparse matrix-vector multiplication by exploiting variable block structure, in Proceedings of the High Performance Computing and Communications: First International Conference, HPCC 2005, Sorrento, Italy, 2005.

[39]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, in Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007.

[40]

F. Zafar, M. Olano, and A. Curtis, GPU random numbers via the tiny encryption algorithm, in Proceedings of the Conference on High Performance Graphics, Eurographics Association, 2010, pp. 133--141.

Cited By

Wei BWang YChang FGao JJi W(2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsInternational Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1177/10943420241231928
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Wang YChang FWei BGao JJi W(2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3664924
Show More Cited By

Recommendations

A Parallel Auxiliary Grid Algebraic Multigrid Method for Graphic Processing Units

In this paper, we develop a new parallel auxiliary grid algebraic multigrid (AMG) method to leverage the power of graphic processing units (GPUs). In the construction of the hierarchical coarse grid, we use a simple and fixed coarsening procedure based ...
Accelerating algebraic multigrid solvers on NVIDIA GPUs

This paper presents the development of parallel algebraic multigrid solvers on NVIDIA GPUs. A classical algebraic multigrid solver and a smoothed aggregation algebraic multigrid solver are implemented. The W-cycle, F-cycle and V-cycle are investigated. ...
Toward performance-portable PETSc for GPU-based exascale systems
Abstract
The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization. The PETSc design for performance ...

Comments

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing

SIAM Journal on Scientific Computing Volume 34, Issue 4

2012

767 pages

ISSN:1064-8275

DOI:10.1137/sjoce3.34.4

Issue’s Table of Contents

© 2012, Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2012

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wei BWang YChang FGao JJi W(2024)Predicting optimal sparse general matrix-matrix multiplication algorithm on GPUsInternational Journal of High Performance Computing Applications10.1177/1094342024123192838:3(245-259)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1177/10943420241231928
Hong CWang QMao RLiang YXia RLiu J(2024)SaSpGEMM: Sorting-Avoiding Sparse General Matrix-Matrix Multiplication on Multi-Core ProcessorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673054(1166-1175)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673054
Wang YChang FWei BGao JJi W(2024)Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUsACM Transactions on Architecture and Code Optimization10.1145/366492421:3(1-27)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3664924
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Guo HWang HChen WZhang CHan YZhu SZhang DGuo YShang JWan TLi QWu G(2024)Optimizing sparse general matrix–matrix multiplication for DCUsThe Journal of Supercomputing10.1007/s11227-024-06234-280:14(20176-20200)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1007/s11227-024-06234-2
Bacciu DConte ALandolfi FWilliams BChen YNeville J(2023)Generalizing downsampling from regular data to graphsProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i6.25824(6718-6727)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i6.25824
Pazner WKolev TCamier J(2023)End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizationsInternational Journal of High Performance Computing Applications10.1177/1094342023117546237:5(578-599)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1177/10943420231175462
Takayashiki HYagi HNishimoto HYoshifuji N(2023)A New Sparse GEneral Matrix-matrix Multiplication Method for Long Vector Architecture by Hierarchical Row MergingProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3625131(756-759)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3625131
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Le Fèvre VCasas MButt AMi NChard K(2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593000
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents