Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2597652.2597678acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs

Published: 10 June 2014 Publication History

Abstract

Sparse matrix-vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and non-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs.
In this paper we present a new blocked row-column (BRC) storage format with a novel two-dimensional blocking mechanism that effectively addresses the challenges: it reduces thread divergence by reordering and grouping rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an efficient auto-tuning technique to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. A CUDA implementation of BRC outperforms NVIDIA CUSP and cuSPARSE libraries and other state-of-the-art SpMV formats on a range of unstructured sparse matrices from multiple application domains. The BRC format has been integrated with PETSc, enabling its use in PETSc's solvers.

References

[1]
S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.4, Argonne National Laboratory, 2013.
[2]
S. Balay, J. Brown, K. Buschelman, W. D. Gropp,D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2013. http://www.mcs.anl.gov/petsc.
[3]
M. M. Baskaran and R. Bordawekar. Optimizing sparse matrix-vector multiplication on gpus. In Technical report, IBM Research Report RC24704 (W0812-047), 2008.
[4]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Conference on High Performance Computing Networking, Storage and Analysis, 2009.
[5]
J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), January 2010.
[6]
CUDA. A parallel computing platform and programming model invented by nvidia. https://developer.nvidia.com/cuda-home-new.html.
[7]
CUSP. The nvidia library of generic parallel algorithms for sparse linear algebra and graph computations on cuda architecture gpus. https://developer.nvidia.com/cusp.
[8]
cuSPARSE. The nvidia cuda sparse matrix library. https://developer.nvidia.com/cusparse.
[9]
J. Dongarra and M. A. Heroux. Toward a new metric for ranking high performance computing systems. UTK EECS Tech Report and Sandia National Labs Report SAND2013--4744, June 2013.
[10]
A. Ekambaram and E. Montagne. An alternative compressed storage format for sparse matrices. In ISCIS, pages 196--203, 2003.
[11]
R. Helfenstein and J. Koko. Parallel preconditioned conjugate gradient algorithm on gpu. Journal of Computational and Applied Mathematics, 236(15):3584--3590, 2012.
[12]
Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.
[13]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. International Conference on Supercomputing, pages 273--282, 2013.
[14]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with cuda. ACM Queue, 6(2):40--53, 2008.
[15]
N. I. of Standards and Technology. The matrix market format.
[16]
I. Reguly and M. Giles. Efficient sparse matrix-vector multiplication on cache-based gpus. In Innovative Parallel Computing (InPar), pages 1--12, 2012.
[17]
D. M. Y. Roger G. Grimes, David Ronald Kincaid. ITPACK 2.0: User's Guide. 1980.
[18]
Y. Saad. Krylov subspace methods on supercomputers. SIAM J. SCI. STAT. COMPUT, 10:1200--1232, 1989.
[19]
Y. Saad. Sparskit: a basic tool kit for sparse matrix computations - version 2. 1994.
[20]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In Graphics Hardware, pages 97--106, 2007.
[21]
R. W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, January 2004.
[22]
S. Williams, L. Oliker, R. W. Vuduc, J. Shalf, K. A. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178--194, 2009.
[23]
X. Yang, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus: implications for graph mining. Proc. VLDB Endow., 4(4):231--242, 2011.

Cited By

View all
  • (2024)Accelerating SpMV for Scale-Free Graphs with Optimized Bins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00190(2407-2420)Online publication date: 13-May-2024
  • (2024)Optimizing CSR-Based SpMV on a New MIMD Architecture Pezy-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_2(22-39)Online publication date: 1-Mar-2024
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
June 2014
378 pages
ISBN:9781450326421
DOI:10.1145/2597652
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. brc
  2. cuda
  3. gpu
  4. spmv

Qualifiers

  • Research-article

Conference

ICS'14
Sponsor:

Acceptance Rates

ICS '14 Paper Acceptance Rate 34 of 160 submissions, 21%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)5
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Accelerating SpMV for Scale-Free Graphs with Optimized Bins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00190(2407-2420)Online publication date: 13-May-2024
  • (2024)Optimizing CSR-Based SpMV on a New MIMD Architecture Pezy-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_2(22-39)Online publication date: 1-Mar-2024
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • (2023)Path-Based Processing using In-Memory Systolic Arrays for Accelerating Data-Intensive Applications2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323622(1-9)Online publication date: 28-Oct-2023
  • (2023)Memory Bandwidth Conservation for SpMV Kernels Through Adaptive Lossy Data CompressionParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_36(467-480)Online publication date: 8-Apr-2023
  • (2022)Vectorizing SpMV by Exploiting Dynamic Regular PatternsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545042(1-12)Online publication date: 29-Aug-2022
  • (2022)VCSR: An Efficient GPU Memory-Aware Sparse FormatIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317729133:12(3977-3989)Online publication date: 1-Dec-2022
  • (2022)AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse MatricesSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00071(1-15)Online publication date: Nov-2022
  • (2021)Variable-sized blocks for locality-aware SpMVProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370327(211-221)Online publication date: 27-Feb-2021
  • (2021)An effective SPMV based on block strategy and hybrid compression on GPUThe Journal of Supercomputing10.1007/s11227-021-04123-678:5(6318-6339)Online publication date: 18-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media