Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3613424.3623783acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Open access

Spatula: A Hardware Accelerator for Sparse Matrix Factorization

Published: 08 December 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Solving sparse systems of linear equations is a crucial component in many science and engineering problems, like simulating physical systems. Sparse matrix factorization dominates a large class of these solvers. Efficient factorization algorithms have two key properties that make them challenging for existing architectures: they consist of small tasks that are structured and compute-intensive, and sparsity induces long chains of data dependences among these tasks. Data dependences make GPUs struggle, while CPUs and prior sparse linear algebra accelerators also suffer from low compute throughput.
    We present Spatula, an architecture for accelerating sparse matrix factorization algorithms. Spatula hardware combines systolic processing elements that execute structured tasks at high throughput with a flexible scheduler that handles challenging data dependences. Spatula enables a novel scheduling algorithm that avoids stalls and load imbalance while reducing data movement, achieving high compute utilization. As a result, Spatula outperforms a GPU running the state-of-the-art sparse Cholesky and LU factorization implementations by gmean 47 × across a wide range of matrices, and by up to thousands of times on some challenging matrices.

    References

    [1]
    Ahmad Abdelfattah, Pieter Ghysels, Wajih Boukaram, Stanimire Tomov, Xiaoye Sherry Li, and Jack Dongarra. 2022. Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
    [2]
    Patrick R Amestoy, Iain S Duff, Jean-Yves L’Excellent, and Jacko Koster. 2001. MUMPS: a general purpose distributed memory sparse solver. In Applied Parallel Computing.
    [3]
    Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, and Sudhakar Yalamanchili. 2020. Alrescha: A lightweight reconfigurable sparse-computation accelerator. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).
    [4]
    David Bailey, Tim Harris, William Saphir, Rob Van Der Wijngaart, Alex Woo, and Maurice Yarrow. 1995. The NAS parallel benchmarks 2.0. Technical Report NAS-95-020. NASA Ames Research Center.
    [5]
    Shane Barratt and Stephen Boyd. 2022. Covariance prediction via convex optimization. Optimization and Engineering (2022).
    [6]
    Richard P Brent and Franklin T Luk. 1982. Computing the Cholesky factorization using a systolic architecture. Technical Report 82-521. Cornell University.
    [7]
    Xiaoming Chen, Yu Wang, and Huazhong Yang. 2013. NICSLU: An adaptive sparse matrix solver for parallel circuit simulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD) 32, 2 (2013).
    [8]
    Yanqing Chen, Timothy A Davis, William W Hager, and Sivasankaran Rajamanickam. 2008. Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software (TOMS) 35, 3 (2008).
    [9]
    Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018).
    [10]
    Vidushi Dadu and Tony Nowatzki. 2022. TaskStream: Accelerating task-parallel workloads by recovering program structure. In Proceedings of the 27th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVII).
    [11]
    Sal Dasgupta, Teja Singh, Ashish Jain, Samuel Naffziger, Deepesh John, Chetan Bisht, and Pradeep Jayaraman. 2020. Radeon RX 5700 Series: The AMD 7nm Energy-Efficient High-Performance GPUs. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC).
    [12]
    Sal Dasgupta, Teja Singh, Ashish Jain, Samuel Naffziger, Deepesh John, Chetan Bisht, and Pradeep Jayaraman. 2020. Radeon RX 5700 Series: The AMD 7nm Energy-Efficient High-Performance GPUs. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC).
    [13]
    Timothy A Davis. 2004. Algorithm 832: UMFPACK V4.3—an unsymmetric-pattern multifrontal method. ACM Transactions on Mathematical Software (TOMS) 30, 2 (2004).
    [14]
    Timothy A Davis. 2006. Direct methods for sparse linear systems. SIAM.
    [15]
    Timothy A Davis. 2013. Algorithm 930: FACTORIZE: An object-oriented linear system solver for MATLAB. ACM Transactions on Mathematical Software (TOMS) 39, 4 (2013).
    [16]
    Timothy A Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011).
    [17]
    Timothy A Davis and E Palamadai Natarajan. 2011. Sparse matrix methods for circuit simulation problems. In Scientific Computing in Electrical Engineering (SCEE 2010).
    [18]
    Timothy A Davis and Ekanathan Palamadai Natarajan. 2010. Algorithm 907: KLU, a direct sparse solver for circuit simulation problems. ACM Transactions on Mathematical Software (TOMS) 37, 3 (2010).
    [19]
    Iain S Duff and John K Reid. 1983. The multifrontal solution of indefinite sparse symmetric linear. ACM Transactions on Mathematical Software (TOMS) 9, 3 (1983).
    [20]
    Wei Ge, Mengnan Zhao, Cheng Wu, and Jun He. 2011. The Design and Implementation of DDR PHY Static Low-Power Optimization Strategies. In Communication Systems and Information Technology.
    [21]
    Thomas George, Vaibhav Saxena, Anshul Gupta, Amik Singh, and Anamitra R Choudhury. 2011. Multifrontal factorization of sparse SPD matrices on GPUs. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS).
    [22]
    Pieter Ghysels and Ryan Synk. 2022. High performance sparse multifrontal solvers on modern GPUs. Parallel Comput. (2022).
    [23]
    John R Gilbert and Tim Peierls. 1988. Sparse partial pivoting in time proportional to arithmetic operations. SIAM J. Sci. Statist. Comput. 9, 5 (1988).
    [24]
    Gene H Golub and Charles F Van Loan. 2013. Matrix computations. JHU press.
    [25]
    Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th IEEE international symposium on High Performance Computer Architecture (HPCA-17).
    [26]
    Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. 2011. Bundled execution of recurring traces for energy-efficient general purpose processing. In Proceedings of the 44th annual IEEE/ACM international symposium on Microarchitecture (MICRO-44).
    [27]
    Azzam Haidar, Ahmad Abdelfatah, Stanimire Tomov, and Jack Dongarra. 2017. High-performance Cholesky factorization for GPU-only execution. In Proceedings of the General Purpose GPUs (GPGPU-10).
    [28]
    Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Yuhan Chen, Ronald Dreslinski, and Trevor Mudge. 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices. In Proceedings of the 34th ACM International Conference on Supercomputing (ICS).
    [29]
    Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W Fletcher. 2019. ExTensor: An accelerator for sparse tensor algebra. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52).
    [30]
    Nachiket Kapre and André DeHon. 2009. Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (FPT).
    [31]
    Kshitij Khare, Sang-Yun Oh, Syed Rahman, and Bala Rajaratnam. 2019. A scalable sparse Cholesky based approach for learning high-dimensional covariance matrices in ordered data. Machine Learning 108, 12 (2019).
    [32]
    Seid Koric and Anshul Gupta. 2016. Sparse matrix factorization in the implicit finite element method on petascale architecture. Computer Methods in Applied Mechanics and Engineering (2016).
    [33]
    Seid Koric, Qiyue Lu, and Erman Guleryuz. 2014. Evaluation of massively parallel linear sparse solvers on unstructured finite element meshes. Computers & Structures (2014).
    [34]
    Hsiang Tsung Kung and Charles E Leiserson. 1979. Systolic arrays (for VLSI). In Sparse Matrix Proceedings.
    [35]
    Jean-Yves L’Excellent. 2012. Multifrontal methods: parallelism, memory usage and numerical aspects. Ph. D. Dissertation. Ecole Normale Supérieure de Lyon.
    [36]
    Xiaoye S Li. 2005. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Transactions on Mathematical Software (TOMS) 31, 3 (2005).
    [37]
    Xiaoye S Li and James Demmel. 1999. A Scalable Sparse Direct Solver Using Static Pivoting. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (PPSC).
    [38]
    Zhiyao Li, Jiaxiang Li, Taijie Chen, Dimin Niu, Hongzhong Zheng, Yuan Xie, and Mingyu Gao. 2023. Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow. In Proceedings of the 28th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVIII).
    [39]
    Joseph WH Liu. 1992. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev. (1992).
    [40]
    Robert F Lucas, Gene Wagenbreth, John J Tran, and Dan M Davis. 2012. Multifrontal sparse matrix factorization on graphics processing units. Technical Report ISI-TR-677. USC Information Sciences Institute.
    [41]
    Carl J. Mauer, Mark D. Hill, and David A. Wood. 2002. Full-system timing-first simulation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.
    [42]
    Paul Messina. 2017. The Exascale Computing Project. Computing in Science & Engineering (2017).
    [43]
    Micron. 2018. High Bandwidth Memory with ECC. https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/hbm2e/8gb_and_16gb_hbm2e_dram.pdf.
    [44]
    Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai Budiu. 2006. Tartan: Evaluating Spatial Computation for Whole Program Execution. In Proceedings of the 12th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII).
    [45]
    Francisco Muñoz-Martínez, Raveesh Garg, Michael Pellauer, José L Abellán, Manuel E Acacio, and Tushar Krishna. 2023. Flexagon: A Multi-Dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing. In Proceedings of the 28th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVIII).
    [46]
    Tarek Nechma and Mark Zwolinski. 2014. Parallel sparse matrix solution for circuit simulation on FPGAs. IEEE Trans. Comput. (2014).
    [47]
    NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
    [48]
    NVIDIA. 2020. NVIDIA DGX Station A100 System Architecture. https://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-station-a100-system-architecture-white-paper.pdf.
    [49]
    Subhankar Pal, Aporva Amarnath, Siying Feng, Michael O’Boyle, Ronald Dreslinski, and Christophe Dubach. 2021. SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator. In Proceedings of the 54th annual IEEE/ACM international symposium on Microarchitecture (MICRO-54).
    [50]
    Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In Proceedings of the 24th IEEE international symposium on High Performance Computer Architecture (HPCA-24).
    [51]
    Giorgos Passas, Manolis Katevenis, and Dionisios Pnevmatikatos. 2012. Crossbar NoCs are scalable beyond 100 nodes. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD) (2012).
    [52]
    Rambus Inc.2020. White paper: HBM2E and GDDR6: Memory Solutions for AI.
    [53]
    Gengyu Rao, Jingji Chen, Jason Yik, and Xuehai Qian. 2022. SparseCore: Stream ISA and processor specialization for sparse computation. In Proceedings of the 27th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVII).
    [54]
    Steven C Rennich, Darko Stosic, and Timothy A Davis. 2016. Accelerating sparse Cholesky factorization on GPUs. Parallel Comput. (2016).
    [55]
    Fazle Sadi, Joe Sweeney, Tze Meng Low, James C Hoe, Larry Pileggi, and Franz Franchetti. 2019. Efficient SpMV operation for large and highly sparse matrices using scalable multi-way merge parallelization. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52).
    [56]
    Robert Schreiber. 1982. A new implementation of sparse Gaussian elimination. ACM Transactions on Mathematical Software (TOMS) 8, 3 (1982).
    [57]
    Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).
    [58]
    Matthew L Staten, Steven J Owen, Suzanne M Shontz, Andrew G Salinger, and Todd S Coffey. 2012. A comparison of mesh morphing methods for 3D shape optimization. In Proceedings of the 20th International Meshing Roundtable.
    [59]
    Yuzhi Sun, ZJ Wang, and Yen Liu. 2007. Efficient implicit non-linear LU-SGS approach for viscous flow computation using high-order spectral difference method. In Proceedings of the 18th AIAA Computational Fluid Dynamics Conference.
    [60]
    Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture (2020).
    [61]
    James E Thornton. 1964. Parallel operation in the Control Data 6600. In Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems.
    [62]
    Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. 2010. Conservation Cores: Reducing the Energy of Mature Computations. In Proceedings of the 15th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XV).
    [63]
    Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021. Dual-side sparse tensor core. In Proceedings of the 48th annual International Symposium on Computer Architecture (ISCA-48).
    [64]
    Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. Advances in neural information processing systems (NeurIPS) (2016).
    [65]
    Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, and Tony Nowatzki. 2020. A hybrid systolic-dataflow architecture for inductive matrix algorithms. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).
    [66]
    Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. SpaceA: Sparse matrix vector multiplication on processing-in-memory accelerator. In Proceedings of the 27th IEEE international symposium on High Performance Computer Architecture (HPCA-27).
    [67]
    Yifan Yang, Joel S Emer, and Daniel Sanchez. 2023. ISOSceles: Accelerating Sparse CNNs through Inter-Layer Pipelining. In Proceedings of the 29th IEEE international symposium on High Performance Computer Architecture (HPCA-29).
    [68]
    Guowei Zhang, Nithya Attaluri, Joel S Emer, and Daniel Sanchez. 2021. Gamma: Leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In Proceedings of the 26th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVI).
    [69]
    Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient architecture for sparse matrix multiplication. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).

    Cited By

    View all
    • (2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
    • (2024)Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00072(931-945)Online publication date: 29-Jun-2024

    Index Terms

    1. Spatula: A Hardware Accelerator for Sparse Matrix Factorization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture
      October 2023
      1528 pages
      ISBN:9798400703294
      DOI:10.1145/3613424
      This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 December 2023

      Check for updates

      Author Tags

      1. Cholesky
      2. Hardware accelerators
      3. LU.
      4. matrix factorization
      5. sparse linear algebra

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      MICRO '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 484 of 2,242 submissions, 22%

      Upcoming Conference

      MICRO '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,216
      • Downloads (Last 6 weeks)127
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
      • (2024)Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00072(931-945)Online publication date: 29-Jun-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media