research-article

InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-aware Inner Product Processing

Authors:

Jaehyuk HuhAuthors Info & Claims

PACT '21: Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques

Pages 116 - 128

https://doi.org/10.1109/PACT52795.2021.00016

Published: 26 November 2024 Publication History

Abstract

Sparse matrix multiplication is one of the key computational kernels in large-scale data analytics. However, a naive implementation suffers from the overheads of irregular memory accesses due to the representation of sparsity. To mitigate the memory access overheads, recent accelerator designs advocated the outer product processing which minimizes input accesses but generates intermediate products to be merged to the final output matrix. Using real-world sparse matrices, this study first identifies the memory bloating problem of the outer product designs due to the unpredictable intermediate products. Such an unpredictable increase in memory requirement during computation can limit the applicability of accelerators. To address the memory bloating problem, this study revisits an alternative inner product approach, and proposes a new accelerator design called InnerSP. This study shows that nonzero element distributions in real-world sparse matrices have a certain level of locality. Using a smart caching scheme designed for inner product, the locality is effectively exploited with a modest on-chip cache. However, the row-wise inner product relies on on-chip aggregation of intermediate products. Due to uneven sparsity per row, overflows or underflows of the on-chip storage for aggregation can occur. To maximize the parallelism while avoiding costly overflows, the proposed accelerator uses pre-scanning for row splitting and merging. The simulation results show that the performance of InnerSP can exceed or be similar to those of the prior outer product approaches without any memory bloating problem.

References

[1]

V. Balaji, N. Crago, A.Jaleel, and B. Lucia, "P-OPT: Practical Optimal Cache Replacement for Graph Analytics", in Proceedings of the 27th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2021.

[2]

N. Bell and M. Garland, "CUSP: Generic Parallel Algorithms for Sparse Matrix and Graph Computations", https://cusplibrary.github.io/, [Online; accessed 25-November-2020].

[3]

D. Buono, F. Petrini, F. Checconi, X. Liu, X. Que, C. Long, and T.-C. Tuan, "Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics", in Proceedings of the 30th ACM International Conference on Supercomputing (ICS), 2016.

[4]

A. Coady, "Process and System for Sparse Vector and Matrix Representation of Document Indexing and Retrieval", 2004, US Patent 6,751,628.

[5]

T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection", ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1--25, 2011.

Digital Library

[6]

T. A. Davis and E. P. Natarajan, "Sparse Matrix Methods for Circuit Simulation Problems", in Proceedings of the 8th Scientific Computing in Electrical Engineering (SCEE), 2010.

[7]

S. Galal and M. Horowitz, "Energy-Efficient Floating-Point Unit Design", in IEEE Transactions on Computers, vol. 60, no. 7, 2010, pp. 913--922.

Digital Library

[8]

N. Gould, Y. Hu, and J. Scott, "GHS indef collection", ftp://ftp.numerical.rl.ac.uk/pub/matrices/symmetric/, [Online; accessed 20-August-2021].

[9]

K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, "UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition", in Proceedings of the 45th ACM/IEEE International Symposium on Computer Architecture (ISCA), 2018.

Digital Library

[10]

Intel, "Intel Math Kernel Library", https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html.

[11]

JEDEC, "High Bandwidth Memory (HBM) DRAM JESD235D", 2021.

[12]

J. Leskovec and A. Krevl, "SNAP Datasets: Stanford Large Network Dataset Collection", http://snap.stanford.edu/data, jun 2014.

[13]

S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. Jacob, "DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator", IEEE Computer Architecture Letters, vol. 19, no. 2, pp. 110--113, 2020.

Digital Library

[14]

T.-K. Lin and S.-Y. Chien, "Support Vector Machines on GPU with Sparse Matrix Format", in Proceedings of the 9th IEEE International Conference on Machine Learning and Applications (ICMLA), 2010.

[15]

W. Liu and B. Vinter, "An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data", in Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2014.

Digital Library

[16]

W. Liu and B. Vinter, "A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors", Journal of Parallel and Distributed Computing, vol. 85, pp. 47--61, 2015.

Digital Library

[17]

X. Luo, M. Zhou, S. Li, Z. You, Y. Xia, and Q. Zhu, "A Nonnegative Latent Factor Model for Large-Scale Sparse Matrices in Recommender Systems via Alternating Direction Method", IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 3, pp. 579--592, 2015.

[18]

B. Muite, "Ill-conditioned Chebyshev integration matrices", [Online; accessed 20-August-2021].

[19]

Y. Nagasaka, A. Nukada, and S. Matsuoka, "High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU", in Proceedings of the 46th International Conference on Parallel Processing (ICPP), 2017.

[20]

M. Naumov, L. S. Chien, P. Vandermersch, and U. Kapasi, "CUSPARSE Library", https://developer.nvidia.com/cusparse.

[21]

S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, "OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator", in Proceedings of the 24th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018.

[22]

E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, "SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training", in Proceedings of the 26th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.

[23]

F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, and F. Franchetti, "Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization", in Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019.

Digital Library

[24]

N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, "MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product", in Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020.

[25]

J. Theiler, G. Cao, L. R. Bachega, and C. A. Bouman, "Sparse Matrix Transform for Hyperspectral Image Processing", IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 3, pp. 424--437, 2011.

[26]

X. Xie, Z. Liang, P. Gu, A. Basak, L. Deng, L. Liang, X. Hu, and Y. Xie, "SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator", in Proceedings of the 27th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2021.

[27]

Z. Zhang, H. Wang, S. Han, and W. J. Dally, "SpArch: Efficient Architecture for Sparse Matrix Multiplication", in Proceedings of the 26th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.

[28]

Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, and F. Franchetti, "Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware", in Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC), 2013.

Index Terms

InnerSP: A Memory Efficient Sparse Matrix Multiplication Accelerator with Locality-aware Inner Product Processing

Index terms have been assigned to the content through auto-classification.

Recommendations

HARP: Hardware-Based Pseudo-Tiling for Sparse Matrix Multiplication Accelerator
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

General sparse matrix-matrix multiplication (SpGEMM) is a memory-bound workload, due to the compression format used. To minimize data movements for input matrices, outer product accelerators have been proposed. Since these accelerators access input ...
A New Algorithm for Inner Product

Abstract In this note we describe a new way of computing the inner product of two vectors. This method cuts down the number of multiplications required when we want to perform a large number of inner products on a smaller set of vectors. In particular, ...
Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product
Sparse-dense matrix multiplication (SpMM) is the performance bottleneck of many high-performance and deep-learning applications, making it attractive to design specialized SpMM hardware accelerators. Unfortunately, existing hardware solutions do not take ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '21: Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques

September 2021

370 pages

ISBN:9781665442787

Editor:
Jaejin Lee
Seoul National University Albert Cohen, Google

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS\DATC: IEEE Computer Society

Publisher

IEEE Press

Publication History

Published: 26 November 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

PACT '21

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE-CS\DATC

PACT '21: International Conference on Parallel Architectures and Compilation Techniques

September 26 - 29, 2021

GA, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
6
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)6

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents