research-article

Public Access

Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor Core

Authors:

Shaoshuai Zhang,

Hiroyuki Ootomo,

Panruo WuAuthors Info & Claims

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 301 - 312

https://doi.org/10.1145/3572848.3577516

Published: 21 February 2023 Publication History

Abstract

Symmetric eigenvalue decomposition (EVD) is a fundamental analytic and numerical tool used in many scientific areas. The state-of-the-art algorithm in terms of performance is typically the two-stage tridiagonalization method. The first stage in the two-stage tridiagonalization is called successive band reduction (SBR), which reduces a symmetric matrix to a band form, and its computational cost usually dominates. When Tensor Core (specialized matrix computational accelerator) is used to accelerate the expensive EVD, the conventional ZY-representation-based method results in suboptimal performance due to unfavorable shapes of the matrix computations. In this paper, we propose a new method that uses WY representation instead of ZY representation (see Section 3.2 for details), which can provide a better combination of locality and parallelism so as to perform better on Tensor Cores. Experimentally, the proposed method can bring up to 3.7x speedup in SBR and 2.3x in the entire EVD compared to state-of-the-art implementations.

References

[1]

Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433--459.

Digital Library

[2]

Edward Anderson, Zhaojun Bai, Christian Bischof, L Susan Blackford, James Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, Alan McKenney, et al. 1999. LAPACK Users' guide. SIAM.

[3]

Michael Anderson, Grey Ballard, James Demmel, and Kurt Keutzer. 2011. Communication-avoiding QR decomposition for GPUs. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 48--58.

Digital Library

[4]

Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Hong Diep Nguyen, and Edgar Solomonik. 2014. Reconstructing Householder vectors from tall-skinny QR. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1159--1170.

Digital Library

[5]

Christian Bischof and Charles Van Loan. 1987. The WY representation for products of Householder matrices. SIAM J. Sci. Statist. Comput. 8, 1 (1987), s2--s13.

Digital Library

[6]

Christian H Bischof, Bruno Lang, and Xiaobai Sun. 2000. A framework for symmetric band reduction. ACM Transactions on Mathematical Software (TOMS) 26, 4 (2000), 581--601.

Digital Library

[7]

Christian H. Bischof, Xiaobai Sun, and Bruno Lang. 1994. Parallel tridiagonalization through two-step band reduction. Proceedings of IEEE Scalable High Performance Computing Conference (1994), 23--27.

[8]

Ralph Byers and Hongguo Xu. 2008. A new scaling for Newton's iteration for the polar decomposition and its backward stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822--843.

Digital Library

[9]

Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro 41, 2 (2021), 29--35.

[10]

RZ Dautov, AD Lyashko, and SI Solov'ev. 1994. The bisection method for symmetrie eigenvalue problems with a parameter entering nonlinearly. (1994).

[11]

Inderjit S Dhillon and Beresford N Parlett. 2004. Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algebra Appl. 387 (2004), 1--28.

[12]

Jack J Dongarra, Danny C Sorensen, and Sven J Hammarling. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 1--2 (1989), 215--227.

[13]

Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (third ed.). The Johns Hopkins University Press.

Digital Library

[14]

Gene H Golub and Charles F Van Loan. 2013. Matrix computations. JHU press.

[15]

Roger Grimes, Henry Krakauer, John Lewis, Horst Simon, and Su-Hai Wei. 1987. The solution of large dense generalized eigenvalue problems on the Cray X-MP/24 with SSD. J. Comput. Phys. 69, 2 (1987), 471--481.

Digital Library

[16]

Ming Gu. 2015. Subspace iteration randomization and singular value problems. SIAM Journal on Scientific Computing 37, 3 (2015), A1139--A1173.

Digital Library

[17]

Vineet Gupta, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning. PMLR, 1842--1850.

[18]

Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems Using Aggregated Fine-Grained and Memory-Aware Kernels. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). Association for Computing Machinery, New York, NY, USA, Article 8, 11 pages.

Digital Library

[19]

Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11.

Digital Library

[20]

Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 603--613.

Digital Library

[21]

Desmond J Higham and Nicholas J Higham. 1998. Structured backward error and condition of generalized eigenvalue problems. SIAM J. Matrix Anal. Appl. 20, 2 (1998), 493--512.

Digital Library

[22]

Nicholas J Higham. 1986. Computing the polar decomposition---with applications. SIAM J. Sci. Statist. Comput. 7, 4 (1986), 1160--1174.

Digital Library

[23]

Florent Lopez and Théo Mary. 2020. Mixed Precision LU Factorization on GPU Tensor Cores: Reducing Data Movement and Memory Footprint. (2020).

[24]

Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2012. Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures. In Applications, Tools and Techniques on the Road to Exascale Computing. IOS Press, 397--404.

[25]

Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 944--955.

Digital Library

[26]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE.

[27]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 522--531.

[28]

PG Martinsson and JA Tropp. [n.d.]. Randomized numerical linear algebra: foundations & algorithms (2020). arXiv preprint arXiv:2002.01387 ([n.d.]).

[29]

Per-Gunnar Martinsson and Sergey Voronin. 2016. A randomized blocked algorithm for efficiently computing rank-revealing factorizations of matrices. SIAM Journal on Scientific Computing 38, 5 (2016), S485--S507.

Digital Library

[30]

Yuji Nakatsukasa and Nicholas J Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325--A1349.

Digital Library

[31]

Hiroyuki Ootomo and Rio Yokota. 2022. Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance. arXiv preprint arXiv:2203.03341 (2022).

[32]

Matt Probert. 2011. Electronic Structure: Basic Theory and Practical Methods, by Richard M. Martin: Scope: graduate level textbook. Level: theoretical materials scientists/condensed matter physicists/computational chemists.

[33]

Notker Rösch, Sven Krüger, Vladimir A Nasluzov, and Alexei V Matveev. 2005. ParaGauss: The density functional program paragauss for complex systems in chemistry. In High Performance Computing in Science and Engineering, Garching 2004. Springer, 285--296.

[34]

Robert Schreiber and Charles Van Loan. 1989. A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Statist. Comput. 10, 1 (1989), 53--57.

Digital Library

[35]

Ruchi Shah, Shaoshuai Zhang, Ying Lin, and Panruo Wu. 2019. xSVM: Scalable distributed kernel support vector machine training. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 155--164.

[36]

Dalal Sukkari, Hatem Ltaief, and David Keyes. 2016. A high performance QDWH-SVD solver using hardware accelerators. ACM Transactions on Mathematical Software (TOMS) 43, 1 (2016), 1--25.

Digital Library

[37]

Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. 2011. MAGMA Users' Guide. ICL, UTK (November 2009) (2011).

[38]

YAOHUNG M Tsai, PIOTR Luszczek, and JACK Dongarra. 2021. Mixed-precision algorithm for finding selected eigenvalues and eigenvectors of symmetric and Hermitian matrices. Technical Report. Technical report ICL-UT-21-05, Innovative Computing Laboratory, The ....

[39]

David S Watkins. 1982. Understanding the QR algorithm. SIAM review 24, 4 (1982), 427--440.

[40]

Qiaochu Yuan, Ming Gu, and Bo Li. 2018. Superlinear convergence of randomized block lanczos algorithm. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 1404--1409.

[41]

Shaoshuai Zhang, Elaheh Baharlouei, and Panruo Wu. 2020. High accuracy matrix computations on neural engines: A study of qr factorization and its applications. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 17--28.

Digital Library

[42]

Shaoshuai Zhang, Vivek Karihaloo, and Panruo Wu. 2020. Basic Linear Algebra Operations on TensorCore GPU. In 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). IEEE, 44--52.

[43]

Shaoshuai Zhang, Ruchi Shah, and Panruo Wu. 2020. TensorSVM: accelerating kernel machines with tensor engine. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--11.

Digital Library

[44]

Shaoshuai Zhang and Panruo Wu. 2021. Recursion Brings Speedup to Out-of-Core TensorCore-based Linear Algebra Algorithms: A Case Study of Classic Gram-Schmidt QR Factorization. In 50th International Conference on Parallel Processing. 1--11.

Digital Library

Index Terms

Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor Core

Recommendations

Analysis of a QR Algorithm for Computing Singular Values

We extend the Golub--Kahan algorithm for computing the singular value decomposition of bidiagonal matrices to triangular matrices $R$. Our algorithm avoids the explicit formation of $R^TR$ or $RR^T$.

We derive a relation between left and right ...
Computing eigenvalues of normal matrices via complex symmetric matrices
Abstract
Computing all eigenvalues of a modest size matrix typically proceeds in two phases. In the first phase, the matrix is transformed to a suitable condensed matrix format, sharing the eigenvalues, and in the second stage the eigenvalues ...
Note on a rank-one modification of the singular value decomposition
Abstract
In this paper, we investigate the singular value decomposition (SVD) of Σ + x y H, where Σ is an m × n real diagonal matrix, x ∈ C m, and y ∈ C n. We start by briefly revisiting an existing approach for determining the desired SVD by sequentially ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 2023

480 pages

ISBN:9798400700156

DOI:10.1145/3572848

General Chair:
Maryam Mehri Dehnavi
University of Toronto
,
Program Chairs:
Milind Kulkarni
Purdue University
,
Sriram Krishnamoorthy
Google

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

PPoPP '23

Sponsor:

PPoPP '23: The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 25 - March 1, 2023

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
393
Total Downloads

Downloads (Last 12 months)256
Downloads (Last 6 weeks)62

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents