research-article

Open access

Spatula: A Hardware Accelerator for Sparse Matrix Factorization

Authors:

Daniel SanchezAuthors Info & Claims

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 91 - 104

https://doi.org/10.1145/3613424.3623783

Published: 08 December 2023 Publication History

All formats PDF

Abstract

Solving sparse systems of linear equations is a crucial component in many science and engineering problems, like simulating physical systems. Sparse matrix factorization dominates a large class of these solvers. Efficient factorization algorithms have two key properties that make them challenging for existing architectures: they consist of small tasks that are structured and compute-intensive, and sparsity induces long chains of data dependences among these tasks. Data dependences make GPUs struggle, while CPUs and prior sparse linear algebra accelerators also suffer from low compute throughput.

We present Spatula, an architecture for accelerating sparse matrix factorization algorithms. Spatula hardware combines systolic processing elements that execute structured tasks at high throughput with a flexible scheduler that handles challenging data dependences. Spatula enables a novel scheduling algorithm that avoids stalls and load imbalance while reducing data movement, achieving high compute utilization. As a result, Spatula outperforms a GPU running the state-of-the-art sparse Cholesky and LU factorization implementations by gmean 47 × across a wide range of matrices, and by up to thousands of times on some challenging matrices.

References

[1]

Ahmad Abdelfattah, Pieter Ghysels, Wajih Boukaram, Stanimire Tomov, Xiaoye Sherry Li, and Jack Dongarra. 2022. Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).

Digital Library

[2]

Patrick R Amestoy, Iain S Duff, Jean-Yves L’Excellent, and Jacko Koster. 2001. MUMPS: a general purpose distributed memory sparse solver. In Applied Parallel Computing.

[3]

Bahar Asgari, Ramyad Hadidi, Tushar Krishna, Hyesoon Kim, and Sudhakar Yalamanchili. 2020. Alrescha: A lightweight reconfigurable sparse-computation accelerator. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).

[4]

David Bailey, Tim Harris, William Saphir, Rob Van Der Wijngaart, Alex Woo, and Maurice Yarrow. 1995. The NAS parallel benchmarks 2.0. Technical Report NAS-95-020. NASA Ames Research Center.

[5]

Shane Barratt and Stephen Boyd. 2022. Covariance prediction via convex optimization. Optimization and Engineering (2022).

[6]

Richard P Brent and Franklin T Luk. 1982. Computing the Cholesky factorization using a systolic architecture. Technical Report 82-521. Cornell University.

[7]

Xiaoming Chen, Yu Wang, and Huazhong Yang. 2013. NICSLU: An adaptive sparse matrix solver for parallel circuit simulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD) 32, 2 (2013).

[8]

Yanqing Chen, Timothy A Davis, William W Hager, and Sivasankaran Rajamanickam. 2008. Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software (TOMS) 35, 3 (2008).

Digital Library

[9]

Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018).

Digital Library

[10]

Vidushi Dadu and Tony Nowatzki. 2022. TaskStream: Accelerating task-parallel workloads by recovering program structure. In Proceedings of the 27th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVII).

Digital Library

[11]

Sal Dasgupta, Teja Singh, Ashish Jain, Samuel Naffziger, Deepesh John, Chetan Bisht, and Pradeep Jayaraman. 2020. Radeon RX 5700 Series: The AMD 7nm Energy-Efficient High-Performance GPUs. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC).

[12]

Sal Dasgupta, Teja Singh, Ashish Jain, Samuel Naffziger, Deepesh John, Chetan Bisht, and Pradeep Jayaraman. 2020. Radeon RX 5700 Series: The AMD 7nm Energy-Efficient High-Performance GPUs. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC).

[13]

Timothy A Davis. 2004. Algorithm 832: UMFPACK V4.3—an unsymmetric-pattern multifrontal method. ACM Transactions on Mathematical Software (TOMS) 30, 2 (2004).

Digital Library

[14]

Timothy A Davis. 2006. Direct methods for sparse linear systems. SIAM.

[15]

Timothy A Davis. 2013. Algorithm 930: FACTORIZE: An object-oriented linear system solver for MATLAB. ACM Transactions on Mathematical Software (TOMS) 39, 4 (2013).

Digital Library

[16]

Timothy A Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011).

Digital Library

[17]

Timothy A Davis and E Palamadai Natarajan. 2011. Sparse matrix methods for circuit simulation problems. In Scientific Computing in Electrical Engineering (SCEE 2010).

[18]

Timothy A Davis and Ekanathan Palamadai Natarajan. 2010. Algorithm 907: KLU, a direct sparse solver for circuit simulation problems. ACM Transactions on Mathematical Software (TOMS) 37, 3 (2010).

Digital Library

[19]

Iain S Duff and John K Reid. 1983. The multifrontal solution of indefinite sparse symmetric linear. ACM Transactions on Mathematical Software (TOMS) 9, 3 (1983).

Digital Library

[20]

Wei Ge, Mengnan Zhao, Cheng Wu, and Jun He. 2011. The Design and Implementation of DDR PHY Static Low-Power Optimization Strategies. In Communication Systems and Information Technology.

[21]

Thomas George, Vaibhav Saxena, Anshul Gupta, Amik Singh, and Anamitra R Choudhury. 2011. Multifrontal factorization of sparse SPD matrices on GPUs. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS).

Digital Library

[22]

Pieter Ghysels and Ryan Synk. 2022. High performance sparse multifrontal solvers on modern GPUs. Parallel Comput. (2022).

[23]

John R Gilbert and Tim Peierls. 1988. Sparse partial pivoting in time proportional to arithmetic operations. SIAM J. Sci. Statist. Comput. 9, 5 (1988).

Digital Library

[24]

Gene H Golub and Charles F Van Loan. 2013. Matrix computations. JHU press.

[25]

Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th IEEE international symposium on High Performance Computer Architecture (HPCA-17).

[26]

Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. 2011. Bundled execution of recurring traces for energy-efficient general purpose processing. In Proceedings of the 44th annual IEEE/ACM international symposium on Microarchitecture (MICRO-44).

Digital Library

[27]

Azzam Haidar, Ahmad Abdelfatah, Stanimire Tomov, and Jack Dongarra. 2017. High-performance Cholesky factorization for GPU-only execution. In Proceedings of the General Purpose GPUs (GPGPU-10).

Digital Library

[28]

Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Yuhan Chen, Ronald Dreslinski, and Trevor Mudge. 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices. In Proceedings of the 34th ACM International Conference on Supercomputing (ICS).

Digital Library

[29]

Kartik Hegde, Hadi Asghari-Moghaddam, Michael Pellauer, Neal Crago, Aamer Jaleel, Edgar Solomonik, Joel Emer, and Christopher W Fletcher. 2019. ExTensor: An accelerator for sparse tensor algebra. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52).

Digital Library

[30]

Nachiket Kapre and André DeHon. 2009. Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (FPT).

[31]

Kshitij Khare, Sang-Yun Oh, Syed Rahman, and Bala Rajaratnam. 2019. A scalable sparse Cholesky based approach for learning high-dimensional covariance matrices in ordered data. Machine Learning 108, 12 (2019).

[32]

Seid Koric and Anshul Gupta. 2016. Sparse matrix factorization in the implicit finite element method on petascale architecture. Computer Methods in Applied Mechanics and Engineering (2016).

[33]

Seid Koric, Qiyue Lu, and Erman Guleryuz. 2014. Evaluation of massively parallel linear sparse solvers on unstructured finite element meshes. Computers & Structures (2014).

[34]

Hsiang Tsung Kung and Charles E Leiserson. 1979. Systolic arrays (for VLSI). In Sparse Matrix Proceedings.

[35]

Jean-Yves L’Excellent. 2012. Multifrontal methods: parallelism, memory usage and numerical aspects. Ph. D. Dissertation. Ecole Normale Supérieure de Lyon.

[36]

Xiaoye S Li. 2005. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Transactions on Mathematical Software (TOMS) 31, 3 (2005).

Digital Library

[37]

Xiaoye S Li and James Demmel. 1999. A Scalable Sparse Direct Solver Using Static Pivoting. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (PPSC).

[38]

Zhiyao Li, Jiaxiang Li, Taijie Chen, Dimin Niu, Hongzhong Zheng, Yuan Xie, and Mingyu Gao. 2023. Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow. In Proceedings of the 28th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVIII).

Digital Library

[39]

Joseph WH Liu. 1992. The multifrontal method for sparse matrix solution: Theory and practice. SIAM Rev. (1992).

[40]

Robert F Lucas, Gene Wagenbreth, John J Tran, and Dan M Davis. 2012. Multifrontal sparse matrix factorization on graphics processing units. Technical Report ISI-TR-677. USC Information Sciences Institute.

[41]

Carl J. Mauer, Mark D. Hill, and David A. Wood. 2002. Full-system timing-first simulation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.

[42]

Paul Messina. 2017. The Exascale Computing Project. Computing in Science & Engineering (2017).

[43]

Micron. 2018. High Bandwidth Memory with ECC. https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/hbm2e/8gb_and_16gb_hbm2e_dram.pdf.

[44]

Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai Budiu. 2006. Tartan: Evaluating Spatial Computation for Whole Program Execution. In Proceedings of the 12th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII).

Digital Library

[45]

Francisco Muñoz-Martínez, Raveesh Garg, Michael Pellauer, José L Abellán, Manuel E Acacio, and Tushar Krishna. 2023. Flexagon: A Multi-Dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing. In Proceedings of the 28th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVIII).

Digital Library

[46]

Tarek Nechma and Mark Zwolinski. 2014. Parallel sparse matrix solution for circuit simulation on FPGAs. IEEE Trans. Comput. (2014).

[47]

NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

[48]

NVIDIA. 2020. NVIDIA DGX Station A100 System Architecture. https://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-station-a100-system-architecture-white-paper.pdf.

[49]

Subhankar Pal, Aporva Amarnath, Siying Feng, Michael O’Boyle, Ronald Dreslinski, and Christophe Dubach. 2021. SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator. In Proceedings of the 54th annual IEEE/ACM international symposium on Microarchitecture (MICRO-54).

Digital Library

[50]

Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In Proceedings of the 24th IEEE international symposium on High Performance Computer Architecture (HPCA-24).

[51]

Giorgos Passas, Manolis Katevenis, and Dionisios Pnevmatikatos. 2012. Crossbar NoCs are scalable beyond 100 nodes. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD) (2012).

Digital Library

[52]

Rambus Inc.2020. White paper: HBM2E and GDDR6: Memory Solutions for AI.

[53]

Gengyu Rao, Jingji Chen, Jason Yik, and Xuehai Qian. 2022. SparseCore: Stream ISA and processor specialization for sparse computation. In Proceedings of the 27th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVII).

Digital Library

[54]

Steven C Rennich, Darko Stosic, and Timothy A Davis. 2016. Accelerating sparse Cholesky factorization on GPUs. Parallel Comput. (2016).

[55]

Fazle Sadi, Joe Sweeney, Tze Meng Low, James C Hoe, Larry Pileggi, and Franz Franchetti. 2019. Efficient SpMV operation for large and highly sparse matrices using scalable multi-way merge parallelization. In Proceedings of the 52nd annual IEEE/ACM international symposium on Microarchitecture (MICRO-52).

Digital Library

[56]

Robert Schreiber. 1982. A new implementation of sparse Gaussian elimination. ACM Transactions on Mathematical Software (TOMS) 8, 3 (1982).

Digital Library

[57]

Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).

[58]

Matthew L Staten, Steven J Owen, Suzanne M Shontz, Andrew G Salinger, and Todd S Coffey. 2012. A comparison of mesh morphing methods for 3D shape optimization. In Proceedings of the 20th International Meshing Roundtable.

[59]

Yuzhi Sun, ZJ Wang, and Yen Liu. 2007. Efficient implicit non-linear LU-SGS approach for viscous flow computation using high-order spectral difference method. In Proceedings of the 18th AIAA Computational Fluid Dynamics Conference.

[60]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2020. Efficient processing of deep neural networks. Synthesis Lectures on Computer Architecture (2020).

[61]

James E Thornton. 1964. Parallel operation in the Control Data 6600. In Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems.

[62]

Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. 2010. Conservation Cores: Reducing the Energy of Mature Computations. In Proceedings of the 15th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XV).

Digital Library

[63]

Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021. Dual-side sparse tensor core. In Proceedings of the 48th annual International Symposium on Computer Architecture (ISCA-48).

Digital Library

[64]

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. Advances in neural information processing systems (NeurIPS) (2016).

[65]

Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, and Tony Nowatzki. 2020. A hybrid systolic-dataflow architecture for inductive matrix algorithms. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).

[66]

Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. SpaceA: Sparse matrix vector multiplication on processing-in-memory accelerator. In Proceedings of the 27th IEEE international symposium on High Performance Computer Architecture (HPCA-27).

[67]

Yifan Yang, Joel S Emer, and Daniel Sanchez. 2023. ISOSceles: Accelerating Sparse CNNs through Inter-Layer Pipelining. In Proceedings of the 29th IEEE international symposium on High Performance Computer Architecture (HPCA-29).

[68]

Guowei Zhang, Nithya Attaluri, Joel S Emer, and Daniel Sanchez. 2021. Gamma: Leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication. In Proceedings of the 26th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXVI).

Digital Library

[69]

Zhekai Zhang, Hanrui Wang, Song Han, and William J Dally. 2020. SpArch: Efficient architecture for sparse matrix multiplication. In Proceedings of the 26th IEEE international symposium on High Performance Computer Architecture (HPCA-26).

Cited By

Orenes-Vera MTureci EMartonosi MWentzlaff D(2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00015
Yang YEmer JSanchez D(2024)Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00072(931-945)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00072

Index Terms

Spatula: A Hardware Accelerator for Sparse Matrix Factorization
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Implementing a parallel matrix factorization library on the cell broadband engine
High Performance Computing with the Cell Broadband Engine

Matrix factorization (or often called decomposition) is a frequently used kernel in a large number of applications ranging from linear solvers to data clustering and machine learning. The central contribution of this paper is a thorough performance ...
Characterizing the efficiency of multicore and manycore processors for the solution of sparse linear systems

We analyze the efficiency of servers equipped with state-of-the-art general-purpose multicore processors as well as platforms based on accelerators such as graphics processing units (GPUs) and the Intel Xeon Phi. Following the proposal recently ...
Sparse Matrix Multiplication On An Associative Processor
Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 2023

1528 pages

ISBN:9798400703294

DOI:10.1145/3613424

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation
Wistron Corporation

Conference

MICRO '23

Sponsor:

SIGMICRO

MICRO '23: 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 28 - November 1, 2023

ON, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,216
Total Downloads

Downloads (Last 12 months)1,216
Downloads (Last 6 weeks)127

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Orenes-Vera MTureci EMartonosi MWentzlaff D(2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00015
Yang YEmer JSanchez D(2024)Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00072(931-945)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00072

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents