Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3572848.3577516acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Public Access

Fast Symmetric Eigenvalue Decomposition via WY Representation on Tensor Core

Published: 21 February 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Symmetric eigenvalue decomposition (EVD) is a fundamental analytic and numerical tool used in many scientific areas. The state-of-the-art algorithm in terms of performance is typically the two-stage tridiagonalization method. The first stage in the two-stage tridiagonalization is called successive band reduction (SBR), which reduces a symmetric matrix to a band form, and its computational cost usually dominates. When Tensor Core (specialized matrix computational accelerator) is used to accelerate the expensive EVD, the conventional ZY-representation-based method results in suboptimal performance due to unfavorable shapes of the matrix computations. In this paper, we propose a new method that uses WY representation instead of ZY representation (see Section 3.2 for details), which can provide a better combination of locality and parallelism so as to perform better on Tensor Cores. Experimentally, the proposed method can bring up to 3.7x speedup in SBR and 2.3x in the entire EVD compared to state-of-the-art implementations.

    References

    [1]
    Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2, 4 (2010), 433--459.
    [2]
    Edward Anderson, Zhaojun Bai, Christian Bischof, L Susan Blackford, James Demmel, Jack Dongarra, Jeremy Du Croz, Anne Greenbaum, Sven Hammarling, Alan McKenney, et al. 1999. LAPACK Users' guide. SIAM.
    [3]
    Michael Anderson, Grey Ballard, James Demmel, and Kurt Keutzer. 2011. Communication-avoiding QR decomposition for GPUs. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 48--58.
    [4]
    Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, Hong Diep Nguyen, and Edgar Solomonik. 2014. Reconstructing Householder vectors from tall-skinny QR. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1159--1170.
    [5]
    Christian Bischof and Charles Van Loan. 1987. The WY representation for products of Householder matrices. SIAM J. Sci. Statist. Comput. 8, 1 (1987), s2--s13.
    [6]
    Christian H Bischof, Bruno Lang, and Xiaobai Sun. 2000. A framework for symmetric band reduction. ACM Transactions on Mathematical Software (TOMS) 26, 4 (2000), 581--601.
    [7]
    Christian H. Bischof, Xiaobai Sun, and Bruno Lang. 1994. Parallel tridiagonalization through two-step band reduction. Proceedings of IEEE Scalable High Performance Computing Conference (1994), 23--27.
    [8]
    Ralph Byers and Hongguo Xu. 2008. A new scaling for Newton's iteration for the polar decomposition and its backward stability. SIAM J. Matrix Anal. Appl. 30, 2 (2008), 822--843.
    [9]
    Jack Choquette, Wishwesh Gandhi, Olivier Giroux, Nick Stam, and Ronny Krashinsky. 2021. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro 41, 2 (2021), 29--35.
    [10]
    RZ Dautov, AD Lyashko, and SI Solov'ev. 1994. The bisection method for symmetrie eigenvalue problems with a parameter entering nonlinearly. (1994).
    [11]
    Inderjit S Dhillon and Beresford N Parlett. 2004. Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algebra Appl. 387 (2004), 1--28.
    [12]
    Jack J Dongarra, Danny C Sorensen, and Sven J Hammarling. 1989. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. 27, 1--2 (1989), 215--227.
    [13]
    Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (third ed.). The Johns Hopkins University Press.
    [14]
    Gene H Golub and Charles F Van Loan. 2013. Matrix computations. JHU press.
    [15]
    Roger Grimes, Henry Krakauer, John Lewis, Horst Simon, and Su-Hai Wei. 1987. The solution of large dense generalized eigenvalue problems on the Cray X-MP/24 with SSD. J. Comput. Phys. 69, 2 (1987), 471--481.
    [16]
    Ming Gu. 2015. Subspace iteration randomization and singular value problems. SIAM Journal on Scientific Computing 37, 3 (2015), A1139--A1173.
    [17]
    Vineet Gupta, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning. PMLR, 1842--1850.
    [18]
    Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems Using Aggregated Fine-Grained and Memory-Aware Kernels. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). Association for Computing Machinery, New York, NY, USA, Article 8, 11 pages.
    [19]
    Azzam Haidar, Hatem Ltaief, and Jack Dongarra. 2011. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11.
    [20]
    Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 603--613.
    [21]
    Desmond J Higham and Nicholas J Higham. 1998. Structured backward error and condition of generalized eigenvalue problems. SIAM J. Matrix Anal. Appl. 20, 2 (1998), 493--512.
    [22]
    Nicholas J Higham. 1986. Computing the polar decomposition---with applications. SIAM J. Sci. Statist. Comput. 7, 4 (1986), 1160--1174.
    [23]
    Florent Lopez and Théo Mary. 2020. Mixed Precision LU Factorization on GPU Tensor Cores: Reducing Data Movement and Memory Footprint. (2020).
    [24]
    Hatem Ltaief, Piotr Luszczek, Azzam Haidar, and Jack Dongarra. 2012. Solving the generalized symmetric eigenvalue problem using tile algorithms on multicore architectures. In Applications, Tools and Techniques on the Road to Exascale Computing. IOS Press, 397--404.
    [25]
    Piotr Luszczek, Hatem Ltaief, and Jack Dongarra. 2011. Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 944--955.
    [26]
    Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE.
    [27]
    Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 522--531.
    [28]
    PG Martinsson and JA Tropp. [n.d.]. Randomized numerical linear algebra: foundations & algorithms (2020). arXiv preprint arXiv:2002.01387 ([n.d.]).
    [29]
    Per-Gunnar Martinsson and Sergey Voronin. 2016. A randomized blocked algorithm for efficiently computing rank-revealing factorizations of matrices. SIAM Journal on Scientific Computing 38, 5 (2016), S485--S507.
    [30]
    Yuji Nakatsukasa and Nicholas J Higham. 2013. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM Journal on Scientific Computing 35, 3 (2013), A1325--A1349.
    [31]
    Hiroyuki Ootomo and Rio Yokota. 2022. Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance. arXiv preprint arXiv:2203.03341 (2022).
    [32]
    Matt Probert. 2011. Electronic Structure: Basic Theory and Practical Methods, by Richard M. Martin: Scope: graduate level textbook. Level: theoretical materials scientists/condensed matter physicists/computational chemists.
    [33]
    Notker Rösch, Sven Krüger, Vladimir A Nasluzov, and Alexei V Matveev. 2005. ParaGauss: The density functional program paragauss for complex systems in chemistry. In High Performance Computing in Science and Engineering, Garching 2004. Springer, 285--296.
    [34]
    Robert Schreiber and Charles Van Loan. 1989. A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Statist. Comput. 10, 1 (1989), 53--57.
    [35]
    Ruchi Shah, Shaoshuai Zhang, Ying Lin, and Panruo Wu. 2019. xSVM: Scalable distributed kernel support vector machine training. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 155--164.
    [36]
    Dalal Sukkari, Hatem Ltaief, and David Keyes. 2016. A high performance QDWH-SVD solver using hardware accelerators. ACM Transactions on Mathematical Software (TOMS) 43, 1 (2016), 1--25.
    [37]
    Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. 2011. MAGMA Users' Guide. ICL, UTK (November 2009) (2011).
    [38]
    YAOHUNG M Tsai, PIOTR Luszczek, and JACK Dongarra. 2021. Mixed-precision algorithm for finding selected eigenvalues and eigenvectors of symmetric and Hermitian matrices. Technical Report. Technical report ICL-UT-21-05, Innovative Computing Laboratory, The ....
    [39]
    David S Watkins. 1982. Understanding the QR algorithm. SIAM review 24, 4 (1982), 427--440.
    [40]
    Qiaochu Yuan, Ming Gu, and Bo Li. 2018. Superlinear convergence of randomized block lanczos algorithm. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 1404--1409.
    [41]
    Shaoshuai Zhang, Elaheh Baharlouei, and Panruo Wu. 2020. High accuracy matrix computations on neural engines: A study of qr factorization and its applications. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 17--28.
    [42]
    Shaoshuai Zhang, Vivek Karihaloo, and Panruo Wu. 2020. Basic Linear Algebra Operations on TensorCore GPU. In 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). IEEE, 44--52.
    [43]
    Shaoshuai Zhang, Ruchi Shah, and Panruo Wu. 2020. TensorSVM: accelerating kernel machines with tensor engine. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--11.
    [44]
    Shaoshuai Zhang and Panruo Wu. 2021. Recursion Brings Speedup to Out-of-Core TensorCore-based Linear Algebra Algorithms: A Case Study of Classic Gram-Schmidt QR Factorization. In 50th International Conference on Parallel Processing. 1--11.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
    February 2023
    480 pages
    ISBN:9798400700156
    DOI:10.1145/3572848
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 February 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. HPC
    3. eigenvalue decomposition
    4. low rank approximation
    5. matrix computation
    6. mixed-precision computation
    7. numerical linear algebra
    8. singular value decomposition
    9. tensor core

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    PPoPP '23

    Acceptance Rates

    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 393
      Total Downloads
    • Downloads (Last 12 months)256
    • Downloads (Last 6 weeks)62
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media