research-article

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Authors:

Joseph L. Greathouse,

Mayank DagaAuthors Info & Claims

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 769 - 780

https://doi.org/10.1109/SC.2014.68

Published: 16 November 2014 Publication History

Abstract

The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) has poor performance due to irregular memory access patterns, load imbalance, and reduced parallelism. This has led researchers to propose new storage formats. Unfortunately, dynamically transforming CSR into these formats has significant runtime and storage overheads.

We propose a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs. Our implementation addresses the aforementioned challenges by (i) efficiently accessing DRAM by streaming data into the local scratchpad memory and (ii) dynamically assigning different numbers of rows to each parallel GPU compute unit. CSR-Adaptive achieves an average speedup of 14.7 × over existing CSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

References

[1]

AMD Accelerated Parallel Processing OpenCL^#8482; Programming Guide, Nov. 2013.

[2]

M. M. Baskaran and R. Bordawekar, "Optimizing Sparse Matrix-Vector Multiplication on GPUs," IBM Research, Tech. Rep., 2009.

[3]

N. Bell and M. Garland, "Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors," in Proc. of the Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009.

Digital Library

[4]

N. Bell and M. Garland, "Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2012, version 0.3.0. {Online}. Available: http://cusp-library.googlecode.com

[5]

J. W. Choi, A. Singh, and R. W. Vuduc, "Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs," in Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2010.

Digital Library

[6]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in Proc. of the Workshop on General-Purpose Computing on Graphics Processing Units (GPGPU), 2010.

Digital Library

[7]

J. D. Davis and E. S. Chung, "SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place," Microsoft Research, Tech. Rep., 2012.

[8]

I. S. Duff, M. A. Heroux, and R. Pozo, "An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum," Trans. on Mathematical Software, vol. 28, no. 2, pp. 239--267, 2002.

Digital Library

[9]

X. Feng, H. Jin, R. Zheng, K. Hu, J. Zeng, and Z. Shao, "Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs," in Proc. of the Int'l Conf. on Parallel and Distributed Systems (ICPADS), 2011.

Digital Library

[10]

M. Garland, "Sparse Matrix Computations on Many-core GPU's," in Proc. of Design Automation Conf. (DAC), 2008.

Digital Library

[11]

J. R. Gilbert, S. Reinhardt, and V. B. Shah, "High-performance Graph Algorithms from Parallel Sparse Matrices," in Proc. of the Int'l Workshop on Applied Parallel Computing, 2006.

Digital Library

[12]

W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Toward Realistic Performance Bounds for Implicit CFD Codes," in Proc. of the Int'l Parallel Computational Fluid Dynamics Conf. (PARCFD), 1999.

[13]

D. Guo and W. Gropp, "Adaptive Thread Distributions for SpMV on a GPU," in Proc. of the Extreme Scaling Workshop, 2012.

Digital Library

[14]

E.-J. Im and K. Yelick, "Optimization of Sparse Matrix Kernels for Data Mining," in Proc. of the Workshop on Text Mining, 2001.

[15]

Z. Koza, M. Matyka, S. Szkoda, and Ł. Mirosław, "Compressed Multiple-Row Storage Format for Sparse Matrices on Graphics Processing Units," SIAM Journal on Scientific Computing, vol. 32, no. 2, pp. C219--C239, 2014.

[16]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, "A Unified Sparse Matrix Data Format for Modern Processors with Wide SIMD Units," CoRR, vol. abs/1307.6209, 2014.

[17]

G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," HSA Foundation, Tech. Rep., 2012.

[18]

A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures," in Proc. of the Int'l Conf. on High Performance Embedded Architectures and Compilers (HiPEAC), 2010.

Digital Library

[19]

A. Munshi, "The OpenCL Specification," 2012, http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf.

[20]

M. Naumov, L. S. Chien, P. Vandermersch, and U. Kapasi, "CUSPARSE Library." Presented at the 2010 GPU Technology Conference, 2010.

[21]

J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable Parallel Programming with CUDA," Queue, vol. 6, no. 2, pp. 40--53, 2008.

Digital Library

[22]

S. Nussbaum, "AMD "Trinity" APU," in Hot Chips, 2012.

[23]

T. Oberhuber, A. Suzuki, and J. Vacata, "New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA," CoRR, vol. abs/1012.2270, 2010.

[24]

I. Reguly and M. Giles, "Efficient Sparse Matrix-Vector Multiplication on Cache-based GPUs," in Proc. of Innovative Parallel Computing (InPar), 2012.

[25]

K. Rupp, F. Rudolf, and J. Weinbub, "ViennaCL - A High Level Linear Algebra Library for GPUs and Multi-Core CPUs," in Int'l Workshop on GPUs and Scientific Applications (GPUScA), 2010.

[26]

B.-Y. Su and K. Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," in Proc. of the Int'l Conf. on Supercomputing (ICS), 2012.

Digital Library

[27]

L. N. Trefethen and D. Bau, III, Numerical Linear Algebra. Society for Industrial and Applied Mathematics, 1997.

[28]

F. Vázquez, J.-J. Fernández, and E. M. Garzón, "A New Approach for Sparse Matrix Vector Product on NVIDIA GPUs," Concurrency and Computation: Practice and Experience, vol. 23, no. 8, pp. 815--826, 2011.

Digital Library

[29]

R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and A. Shringarpure, "On the Limits of GPU Acceleration," in Proc. of the USENIX Conf. on Hot Topics in Parallelism (HotPar), 2010.

Digital Library

[30]

R. Vuduc, J. W. Demmel, and K. A. Yelick, "OSKI: A library of automatically tuned sparse matrix kernels," in Proc. SciDAC, J. Physics: Conf. Ser., 2005.

[31]

R. W. Vuduc, "Automatic Performance Tuning of Sparse Matrix Kernels," Ph.D. dissertation, University of California, Berkeley, 2003.

Digital Library

[32]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms," in Proc. of the Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2007.

Digital Library

[33]

S. Yan, C. Li, Y. Zhang, and H. Zhou, "yaSpMV: Yet Another SpMV Framework on GPUs," in Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2014.

Digital Library

[34]

S. Yan, G. Long, and Y. Zhang, "StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization," in Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2012.

Digital Library

Cited By

Gao JJi WLiu JWang YShi F(2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104799
Vasireddy PKavi KWeaver AMehta G(2023)Streaming Sparse Data on Architectures with Vector Extensions using Near Data ProcessingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631898(1-12)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631898
Jiang JHuang JBian H(2023)GTLB:A Load-Balanced SpMV Computation Method on GPUProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606057(101-107)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606057
Show More Cited By

Index Terms

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Recommendations

Efficient sparse-matrix multi-vector product on GPUs
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high ...
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2014

1054 pages

ISBN:9781479955008

General Chair:
Trish Damkroger
Lawrence Livermore National Laboratory, Livermore, California
,
Program Chair:
Jack Dongarra
University of Tennessee, Knoxville, Tennessee

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '14

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC '14: International Conference for High Performance Computing, Networking, Storage and Analysis

November 16 - 21, 2014

Louisana, New Orleans

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
771
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao JJi WLiu JWang YShi F(2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104799
Vasireddy PKavi KWeaver AMehta G(2023)Streaming Sparse Data on Architectures with Vector Extensions using Near Data ProcessingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631898(1-12)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631898
Jiang JHuang JBian H(2023)GTLB:A Load-Balanced SpMV Computation Method on GPUProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606057(101-107)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606057
Chen YChung Y(2023)Connectivity-Aware Link Analysis for Skewed GraphsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605579(482-491)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605579
Lyu CBai KWu YDesbrun MZheng CLiu X(2023)Building a Virtual Weakly-Compressible Wind Tunnel Testing FacilityACM Transactions on Graphics10.1145/359239442:4(1-20)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592394
Wu TCheng JZhang CHou JChen GHuang ZZhang WHan WBai B(2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588707
Chu GHe YDong LDing ZChen DBai HWang XHu CButt AMi NChard K(2023)Efficient Algorithm Design of Optimizing SpMV on GPUProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593002(115-128)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593002
Shi YNie NWang JLin KZhou CLi SYao KLi SFeng YZeng YLiu FWang YGao YMohror KArnold DBadia R(2023)Large-Scale Simulation of Structural Dynamics Computing on GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607082(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607082
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Yang XLi SYuan FDong DHuang CWang ZGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Optimizing Multi-grid Computation and Parallelization on Multi-coresProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593726(227-239)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593726
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents