research-article

A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units

Authors:

Moritz Kreutzer,

Gerhard Wellein,

Alan R. BishopAuthors Info & Claims

SIAM Journal on Scientific Computing, Volume 36, Issue 5

Pages C401 - C423

https://doi.org/10.1137/130930352

Published: 01 January 2014 Publication History

Abstract

Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-$C$-$\sigma$, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming. We discuss the advantages of SELL-$C$-$\sigma$ compared to established formats like Compressed Row Storage and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi, and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-$C$-$\sigma$ spMVM kernel. SELL-$C$-$\sigma$ comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (``catch-all'') sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.

References

[1]

A. Lamecki, A. Dziekonski, and M. Mrozowski, A memory efficient and fast sparse matrix vector product on a GPU, Progr. Electromagnetics Research, 116 (2011), pp. 49--63.

[2]

R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994.

[3]

N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, New York, ACM, 2009, pp. 18:1--18:11.

[4]

J. Choi, A. Singh, and R. W. Vuduc, Model-driven autotuning of sparse matrix-vector multiply on GPUs, in ACM Sigplan Notices, R. Govindarajan, D. A. Padua, and M. W. Hall, eds., ACM, 2010, pp. 115--126.

[5]

E. Cuthill and J. McKee, Reducing the bandwidth of sparse symmetric matrices, in Proceedings of the 24th National Conference, ACM, New York, 1969, pp. 157--172.

[6]

G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris, Performance evaluation of the sparse matrix-vector multiplication on modern architectures, J. Supercomputing, 50 (2009), pp. 36--77.

[7]

G. Hager, J. Treibig, J. Habich, and G. Wellein, Exploring performance and power properties of modern multicore chips via simple machine models, http://onlinelibrary.wiley.com/doi/10.1002/cpe.3180/abstract doi/10.1002/cpe.3180/abstract.

[8]

D. R. Kincaid, T. C. Oppe, and D. M. Young, ITPACKV 2D user's guide, Report CNA-232, University of Texas at Austin, 1989.

[9]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop, Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation, in Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSW '12, Washington, DC, IEEE Computer Society, 2012, pp. 1696--1702.

[10]

X. Liu, E. Chow, K. Vaidyanathan, and M. Smelyanskiy, Improving the performance of dynamical simulations via multiple right-hand sides, in Proceedings of the 26th International Symposium on Parallel Distributed Processing, IEEE, 2012, pp. 36--47.

[11]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, Efficient sparse matrix-vector multiplication on ${\times}86$-based many-core processors, in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, New York, 2013, pp. 273--282.

[12]

A. Monakov, A. Lokhmotov, and A. Avetisyan, Automatically tuning sparse matrix-vector multiplication for GPU architectures, in High Performance Embedded Architectures and Compilers, Y. N. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell, eds., Lecture Notes in Comput. Sci. 5952, Springer, Berlin, 2010, pp. 111--125.

[13]

E. Saule, K. Kaya, and Ü. V. Çatalyürek, Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi, CoRR abs/1302.1078 (2013).

[14]

G. Schubert, G. Hager, H. Fehske, and G. Wellein, Parallel sparse matrix-vector multiplication as a test case for hybrid $\mbox{{\it MPI}}+\mbox{{\it OpenMP}}$ programming, in Proceedings of IPDPS Workshops, 2011, pp. 1751--1758.

[15]

B.-Y. Su and K. Keutzer, clSpMV: A cross-platform OpenCL SpMV framework on GPUs, in Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, New York, 2012, pp. 353--364.

[16]

F. Vázquez, J.-J. Fernández, and E. M. Garzón, A new approach for sparse matrix vector product on NVIDIA GPUs, Concurrency and Computation: Practice and Experience, 23 (2011), pp. 815--826.

[17]

V. Volkov and J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, in Proceedings of the ACM/IEEE Conference on Supercomputing, SC '08, Piscataway, NJ, 2008, pp. 31:1--31:11.

[18]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, in Proceedings of the ACM/IEEE Conference on Supercomputing, SC '07, New York, 2007, pp. 38:1--38:12.

[19]

S. Williams, A. Waterman, and D. Patterson, Roof line: An insightful visual performance model for multicore architectures, Commun. ACM, 52 (2009), pp. 65--76.

[20]

M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein, Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations, submitted.

Cited By

Bi DLi SDong DZhang PFang J(2024)Optimizing SpMV on Heterogeneous Multi-Core DSPs through Improved Locality and VectorizationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673061(1145-1155)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673061
Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Laut SBorrell RCasas MMencagli GDazzi PLowenthal DBadia R(2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658683
Show More Cited By

Recommendations

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
Highlights
- A speculative segmented sum strategy for the CSR-based SpMV.
- Utilizing both GPU ...
Abstract
Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their ...
Efficient sparse-matrix multi-vector product on GPUs
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Sparse Matrix-Vector (SpMV) and Sparse Matrix-Multivector (SpMM) products are key kernels for computational science and data science. While GPUs offer significantly higher peak performance and memory bandwidth than multicore CPUs, achieving high ...

Comments

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing

SIAM Journal on Scientific Computing Volume 36, Issue 5

^† Special Section on Two Themes: Planet Earth and Big Data

2014

977 pages

ISSN:1064-8275

DOI:10.1137/sjoce3.36.5

Issue’s Table of Contents

© 2014, Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2014

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bi DLi SDong DZhang PFang J(2024)Optimizing SpMV on Heterogeneous Multi-Core DSPs through Improved Locality and VectorizationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673061(1145-1155)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673061
Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Laut SBorrell RCasas MMencagli GDazzi PLowenthal DBadia R(2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658683
Xu LJia HZhang YWang LJiang XMencagli GDazzi PLowenthal DBadia R(2024)HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658680(160-173)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658680
Ghodrati SKinzer SXu HMahapatra RKim YAhn BWang DKarthikeyan LYazdanbakhsh APark JKim NEsmaeilzadeh HTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640365
Huang HChow E(2024)Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345247835:11(1977-1988)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3452478
Yang XLi SYuan FDong D(2024)DBSR: An Efficient Storage Format for Vectorizing Sparse Triangular Solvers on Structured GridsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00065(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00065
Lu YZeng LWang TFu XLi WCheng HYang DJin ZCasas MLiu W(2024)AmgT: Algebraic Multigrid Solver on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00058
Qiu HXu CFang JZhang JDeng LDing YWang QChen SChe YLiu J(2024)A Conflict-aware Divide-and-Conquer Algorithm for Symmetric Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00054(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00054
Zhao ZZhang GWu YHong RYang YFu Y(2024)Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUsThe Journal of Supercomputing10.1007/s11227-024-05949-680:10(13681-13713)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05949-6
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents