Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2014.68acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Published: 16 November 2014 Publication History

Abstract

The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) has poor performance due to irregular memory access patterns, load imbalance, and reduced parallelism. This has led researchers to propose new storage formats. Unfortunately, dynamically transforming CSR into these formats has significant runtime and storage overheads.
We propose a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs. Our implementation addresses the aforementioned challenges by (i) efficiently accessing DRAM by streaming data into the local scratchpad memory and (ii) dynamically assigning different numbers of rows to each parallel GPU compute unit. CSR-Adaptive achieves an average speedup of 14.7 × over existing CSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

References

[1]
AMD Accelerated Parallel Processing OpenCL#8482; Programming Guide, Nov. 2013.
[2]
M. M. Baskaran and R. Bordawekar, "Optimizing Sparse Matrix-Vector Multiplication on GPUs," IBM Research, Tech. Rep., 2009.
[3]
N. Bell and M. Garland, "Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors," in Proc. of the Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009.
[4]
N. Bell and M. Garland, "Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2012, version 0.3.0. {Online}. Available: http://cusp-library.googlecode.com
[5]
J. W. Choi, A. Singh, and R. W. Vuduc, "Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs," in Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2010.
[6]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in Proc. of the Workshop on General-Purpose Computing on Graphics Processing Units (GPGPU), 2010.
[7]
J. D. Davis and E. S. Chung, "SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place," Microsoft Research, Tech. Rep., 2012.
[8]
I. S. Duff, M. A. Heroux, and R. Pozo, "An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum," Trans. on Mathematical Software, vol. 28, no. 2, pp. 239--267, 2002.
[9]
X. Feng, H. Jin, R. Zheng, K. Hu, J. Zeng, and Z. Shao, "Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs," in Proc. of the Int'l Conf. on Parallel and Distributed Systems (ICPADS), 2011.
[10]
M. Garland, "Sparse Matrix Computations on Many-core GPU's," in Proc. of Design Automation Conf. (DAC), 2008.
[11]
J. R. Gilbert, S. Reinhardt, and V. B. Shah, "High-performance Graph Algorithms from Parallel Sparse Matrices," in Proc. of the Int'l Workshop on Applied Parallel Computing, 2006.
[12]
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, "Toward Realistic Performance Bounds for Implicit CFD Codes," in Proc. of the Int'l Parallel Computational Fluid Dynamics Conf. (PARCFD), 1999.
[13]
D. Guo and W. Gropp, "Adaptive Thread Distributions for SpMV on a GPU," in Proc. of the Extreme Scaling Workshop, 2012.
[14]
E.-J. Im and K. Yelick, "Optimization of Sparse Matrix Kernels for Data Mining," in Proc. of the Workshop on Text Mining, 2001.
[15]
Z. Koza, M. Matyka, S. Szkoda, and Ł. Mirosław, "Compressed Multiple-Row Storage Format for Sparse Matrices on Graphics Processing Units," SIAM Journal on Scientific Computing, vol. 32, no. 2, pp. C219--C239, 2014.
[16]
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, "A Unified Sparse Matrix Data Format for Modern Processors with Wide SIMD Units," CoRR, vol. abs/1307.6209, 2014.
[17]
G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," HSA Foundation, Tech. Rep., 2012.
[18]
A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures," in Proc. of the Int'l Conf. on High Performance Embedded Architectures and Compilers (HiPEAC), 2010.
[19]
A. Munshi, "The OpenCL Specification," 2012, http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf.
[20]
M. Naumov, L. S. Chien, P. Vandermersch, and U. Kapasi, "CUSPARSE Library." Presented at the 2010 GPU Technology Conference, 2010.
[21]
J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable Parallel Programming with CUDA," Queue, vol. 6, no. 2, pp. 40--53, 2008.
[22]
S. Nussbaum, "AMD "Trinity" APU," in Hot Chips, 2012.
[23]
T. Oberhuber, A. Suzuki, and J. Vacata, "New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA," CoRR, vol. abs/1012.2270, 2010.
[24]
I. Reguly and M. Giles, "Efficient Sparse Matrix-Vector Multiplication on Cache-based GPUs," in Proc. of Innovative Parallel Computing (InPar), 2012.
[25]
K. Rupp, F. Rudolf, and J. Weinbub, "ViennaCL - A High Level Linear Algebra Library for GPUs and Multi-Core CPUs," in Int'l Workshop on GPUs and Scientific Applications (GPUScA), 2010.
[26]
B.-Y. Su and K. Keutzer, "clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs," in Proc. of the Int'l Conf. on Supercomputing (ICS), 2012.
[27]
L. N. Trefethen and D. Bau, III, Numerical Linear Algebra. Society for Industrial and Applied Mathematics, 1997.
[28]
F. Vázquez, J.-J. Fernández, and E. M. Garzón, "A New Approach for Sparse Matrix Vector Product on NVIDIA GPUs," Concurrency and Computation: Practice and Experience, vol. 23, no. 8, pp. 815--826, 2011.
[29]
R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and A. Shringarpure, "On the Limits of GPU Acceleration," in Proc. of the USENIX Conf. on Hot Topics in Parallelism (HotPar), 2010.
[30]
R. Vuduc, J. W. Demmel, and K. A. Yelick, "OSKI: A library of automatically tuned sparse matrix kernels," in Proc. SciDAC, J. Physics: Conf. Ser., 2005.
[31]
R. W. Vuduc, "Automatic Performance Tuning of Sparse Matrix Kernels," Ph.D. dissertation, University of California, Berkeley, 2003.
[32]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms," in Proc. of the Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2007.
[33]
S. Yan, C. Li, Y. Zhang, and H. Zhou, "yaSpMV: Yet Another SpMV Framework on GPUs," in Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2014.
[34]
S. Yan, G. Long, and Y. Zhang, "StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization," in Proc. of the Symp. on Principles and Practice of Parallel Programming (PPoPP), 2012.

Cited By

View all
  • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
  • (2023)Streaming Sparse Data on Architectures with Vector Extensions using Near Data ProcessingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631898(1-12)Online publication date: 2-Oct-2023
  • (2023)GTLB:A Load-Balanced SpMV Computation Method on GPUProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606057(101-107)Online publication date: 17-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2014
1054 pages
ISBN:9781479955008
  • General Chair:
  • Trish Damkroger,
  • Program Chair:
  • Jack Dongarra

Sponsors

Publisher

IEEE Press

Publication History

Published: 16 November 2014

Check for updates

Author Tags

  1. AMD
  2. compressed sparse row (CSR)
  3. general purpose computation on graphics processing units (GPGPU)
  4. performance acceleration
  5. sparse matrix-vector multiplication (SpMV)

Qualifiers

  • Research-article

Conference

SC '14
Sponsor:

Acceptance Rates

SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
  • (2023)Streaming Sparse Data on Architectures with Vector Extensions using Near Data ProcessingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631898(1-12)Online publication date: 2-Oct-2023
  • (2023)GTLB:A Load-Balanced SpMV Computation Method on GPUProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606057(101-107)Online publication date: 17-Jun-2023
  • (2023)Connectivity-Aware Link Analysis for Skewed GraphsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605579(482-491)Online publication date: 7-Aug-2023
  • (2023)Building a Virtual Weakly-Compressible Wind Tunnel Testing FacilityACM Transactions on Graphics10.1145/359239442:4(1-20)Online publication date: 26-Jul-2023
  • (2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
  • (2023)Efficient Algorithm Design of Optimizing SpMV on GPUProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593002(115-128)Online publication date: 7-Aug-2023
  • (2023)Large-Scale Simulation of Structural Dynamics Computing on GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607082(1-14)Online publication date: 12-Nov-2023
  • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
  • (2023)Optimizing Multi-grid Computation and Parallelization on Multi-coresProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593726(227-239)Online publication date: 21-Jun-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media