research-article

An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs

Authors:

Naser Sedaghati,

John Eisenlohr,

P. SadayappanAuthors Info & Claims

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Pages 273 - 282

https://doi.org/10.1145/2597652.2597678

Published: 10 June 2014 Publication History

Abstract

Sparse matrix-vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and non-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs.

In this paper we present a new blocked row-column (BRC) storage format with a novel two-dimensional blocking mechanism that effectively addresses the challenges: it reduces thread divergence by reordering and grouping rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an efficient auto-tuning technique to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. A CUDA implementation of BRC outperforms NVIDIA CUSP and cuSPARSE libraries and other state-of-the-art SpMV formats on a range of unstructured sparse matrices from multiple application domains. The BRC format has been integrated with PETSc, enabling its use in PETSc's solvers.

References

[1]

S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.4, Argonne National Laboratory, 2013.

[2]

S. Balay, J. Brown, K. Buschelman, W. D. Gropp,D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2013. http://www.mcs.anl.gov/petsc.

[3]

M. M. Baskaran and R. Bordawekar. Optimizing sparse matrix-vector multiplication on gpus. In Technical report, IBM Research Report RC24704 (W0812-047), 2008.

[4]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Conference on High Performance Computing Networking, Storage and Analysis, 2009.

Digital Library

[5]

J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), January 2010.

Digital Library

[6]

CUDA. A parallel computing platform and programming model invented by nvidia. https://developer.nvidia.com/cuda-home-new.html.

[7]

CUSP. The nvidia library of generic parallel algorithms for sparse linear algebra and graph computations on cuda architecture gpus. https://developer.nvidia.com/cusp.

[8]

cuSPARSE. The nvidia cuda sparse matrix library. https://developer.nvidia.com/cusparse.

[9]

J. Dongarra and M. A. Heroux. Toward a new metric for ranking high performance computing systems. UTK EECS Tech Report and Sandia National Labs Report SAND2013--4744, June 2013.

[10]

A. Ekambaram and E. Montagne. An alternative compressed storage format for sparse matrices. In ISCIS, pages 196--203, 2003.

[11]

R. Helfenstein and J. Koko. Parallel preconditioned conjugate gradient algorithm on gpu. Journal of Computational and Applied Mathematics, 236(15):3584--3590, 2012.

Digital Library

[12]

Khronos OpenCL Working Group. The OpenCL Specification, version 1.0.29, 8 December 2008.

[13]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. International Conference on Supercomputing, pages 273--282, 2013.

Digital Library

[14]

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with cuda. ACM Queue, 6(2):40--53, 2008.

Digital Library

[15]

N. I. of Standards and Technology. The matrix market format.

[16]

I. Reguly and M. Giles. Efficient sparse matrix-vector multiplication on cache-based gpus. In Innovative Parallel Computing (InPar), pages 1--12, 2012.

[17]

D. M. Y. Roger G. Grimes, David Ronald Kincaid. ITPACK 2.0: User's Guide. 1980.

[18]

Y. Saad. Krylov subspace methods on supercomputers. SIAM J. SCI. STAT. COMPUT, 10:1200--1232, 1989.

Digital Library

[19]

Y. Saad. Sparskit: a basic tool kit for sparse matrix computations - version 2. 1994.

[20]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In Graphics Hardware, pages 97--106, 2007.

Digital Library

[21]

R. W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, January 2004.

Digital Library

[22]

S. Williams, L. Oliker, R. W. Vuduc, J. Shalf, K. A. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178--194, 2009.

Digital Library

[23]

X. Yang, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus: implications for graph mining. Proc. VLDB Endow., 4(4):231--242, 2011.

Digital Library

Cited By

Chen YYu J(2024)Accelerating SpMV for Scale-Free Graphs with Optimized Bins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00190(2407-2420)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00190
Guo JLiu JWang QZhu X(2024)Optimizing CSR-Based SpMV on a New MIMD Architecture Pezy-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_2(22-39)Online publication date: 1-Mar-2024
https://doi.org/10.1007/978-981-97-0801-7_2
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Show More Cited By

Index Terms

An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs

Recommendations

A model-driven blocking strategy for load balanced sparse matrix-vector multiplication on GPUs

Sparse Matrix-Vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and un-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on ...
On Implementing Sparse Matrix Multi-vector Multiplication on GPUs
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these ...
An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Sparse matrix vector multiplication, SpMV, is often a performance bottleneck in iterative solvers. Recently, Graphics Processing Units, GPUs, have been deployed to enhance the performance of this operation. We present a blocked version of the Transposed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

June 2014

378 pages

ISBN:9781450326421

DOI:10.1145/2597652

General Chairs:
Arndt Bode
Technische Universität München and Leibniz Rechenzentrum, Germany
,
Michael Gerndt
Technische Universität München, Germany
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Lawrence Rauchwerger
Texas A&M University, USA
,
Barton Miller
University of Wisconsin, USA
,
Martin Schulz
Lawrence Livermore National Laboratory, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'14

Sponsor:

SIGARCH

ICS'14: 2014 International Conference on Supercomputing

June 10 - 13, 2014

Munich, Germany

Acceptance Rates

ICS '14 Paper Acceptance Rate 34 of 160 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
622
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)5

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YYu J(2024)Accelerating SpMV for Scale-Free Graphs with Optimized Bins2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00190(2407-2420)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00190
Guo JLiu JWang QZhu X(2024)Optimizing CSR-Based SpMV on a New MIMD Architecture Pezy-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_2(22-39)Online publication date: 1-Mar-2024
https://doi.org/10.1007/978-981-97-0801-7_2
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Haq Rashed MThijssen SJha SZheng HEwetz R(2023)Path-Based Processing using In-Memory Systolic Arrays for Accelerating Data-Intensive Applications2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323622(1-9)Online publication date: 28-Oct-2023
https://doi.org/10.1109/ICCAD57390.2023.10323622
Hu SIto MYoshikawa THe YKondo M(2023)Memory Bandwidth Conservation for SpMV Kernels Through Adaptive Lossy Data CompressionParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_36(467-480)Online publication date: 8-Apr-2023
https://doi.org/10.1007/978-3-031-29927-8_36
You XLiu CYang HWang PLuan ZQian D(2022)Vectorizing SpMV by Exploiting Dynamic Regular PatternsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545042(1-12)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545042
Karimi EAgostini NDong SKaeli D(2022)VCSR: An Efficient GPU Memory-Aware Sparse FormatIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317729133:12(3977-3989)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3177291
Du ZLi JWang YLi XTan GSun N(2022)AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse MatricesSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00071(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00071
Namashivayam NMehta SYew PLee J(2021)Variable-sized blocks for locality-aware SpMVProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370327(211-221)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370327
Cui HWang NWang YHan QXu Y(2021)An effective SPMV based on block strategy and hybrid compression on GPUThe Journal of Supercomputing10.1007/s11227-021-04123-678:5(6318-6339)Online publication date: 18-Oct-2021
https://doi.org/10.1007/s11227-021-04123-6
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents