Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units

Published: 01 January 2014 Publication History

Abstract

Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-$C$-$\sigma$, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming. We discuss the advantages of SELL-$C$-$\sigma$ compared to established formats like Compressed Row Storage and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi, and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-$C$-$\sigma$ spMVM kernel. SELL-$C$-$\sigma$ comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (``catch-all'') sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.

References

[1]
A. Lamecki, A. Dziekonski, and M. Mrozowski, A memory efficient and fast sparse matrix vector product on a GPU, Progr. Electromagnetics Research, 116 (2011), pp. 49--63.
[2]
R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. Van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994.
[3]
N. Bell and M. Garland, Implementing sparse matrix-vector multiplication on throughput-oriented processors, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, New York, ACM, 2009, pp. 18:1--18:11.
[4]
J. Choi, A. Singh, and R. W. Vuduc, Model-driven autotuning of sparse matrix-vector multiply on GPUs, in ACM Sigplan Notices, R. Govindarajan, D. A. Padua, and M. W. Hall, eds., ACM, 2010, pp. 115--126.
[5]
E. Cuthill and J. McKee, Reducing the bandwidth of sparse symmetric matrices, in Proceedings of the 24th National Conference, ACM, New York, 1969, pp. 157--172.
[6]
G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris, Performance evaluation of the sparse matrix-vector multiplication on modern architectures, J. Supercomputing, 50 (2009), pp. 36--77.
[7]
G. Hager, J. Treibig, J. Habich, and G. Wellein, Exploring performance and power properties of modern multicore chips via simple machine models, http://onlinelibrary.wiley.com/doi/10.1002/cpe.3180/abstract doi/10.1002/cpe.3180/abstract.
[8]
D. R. Kincaid, T. C. Oppe, and D. M. Young, ITPACKV 2D user's guide, Report CNA-232, University of Texas at Austin, 1989.
[9]
M. Kreutzer, G. Hager, G. Wellein, H. Fehske, A. Basermann, and A. R. Bishop, Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation, in Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSW '12, Washington, DC, IEEE Computer Society, 2012, pp. 1696--1702.
[10]
X. Liu, E. Chow, K. Vaidyanathan, and M. Smelyanskiy, Improving the performance of dynamical simulations via multiple right-hand sides, in Proceedings of the 26th International Symposium on Parallel Distributed Processing, IEEE, 2012, pp. 36--47.
[11]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, Efficient sparse matrix-vector multiplication on ${\times}86$-based many-core processors, in Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, New York, 2013, pp. 273--282.
[12]
A. Monakov, A. Lokhmotov, and A. Avetisyan, Automatically tuning sparse matrix-vector multiplication for GPU architectures, in High Performance Embedded Architectures and Compilers, Y. N. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell, eds., Lecture Notes in Comput. Sci. 5952, Springer, Berlin, 2010, pp. 111--125.
[13]
E. Saule, K. Kaya, and Ü. V. Çatalyürek, Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi, CoRR abs/1302.1078 (2013).
[14]
G. Schubert, G. Hager, H. Fehske, and G. Wellein, Parallel sparse matrix-vector multiplication as a test case for hybrid $\mbox{{\it MPI}}+\mbox{{\it OpenMP}}$ programming, in Proceedings of IPDPS Workshops, 2011, pp. 1751--1758.
[15]
B.-Y. Su and K. Keutzer, clSpMV: A cross-platform OpenCL SpMV framework on GPUs, in Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, New York, 2012, pp. 353--364.
[16]
F. Vázquez, J.-J. Fernández, and E. M. Garzón, A new approach for sparse matrix vector product on NVIDIA GPUs, Concurrency and Computation: Practice and Experience, 23 (2011), pp. 815--826.
[17]
V. Volkov and J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, in Proceedings of the ACM/IEEE Conference on Supercomputing, SC '08, Piscataway, NJ, 2008, pp. 31:1--31:11.
[18]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, Optimization of sparse matrix-vector multiplication on emerging multicore platforms, in Proceedings of the ACM/IEEE Conference on Supercomputing, SC '07, New York, 2007, pp. 38:1--38:12.
[19]
S. Williams, A. Waterman, and D. Patterson, Roof line: An insightful visual performance model for multicore architectures, Commun. ACM, 52 (2009), pp. 65--76.
[20]
M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein, Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations, submitted.

Cited By

View all
  • (2024)Optimizing SpMV on Heterogeneous Multi-Core DSPs through Improved Locality and VectorizationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673061(1145-1155)Online publication date: 12-Aug-2024
  • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
  • (2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing
SIAM Journal on Scientific Computing  Volume 36, Issue 5
Special Section on Two Themes: Planet Earth and Big Data
2014
977 pages
ISSN:1064-8275
DOI:10.1137/sjoce3.36.5
Issue’s Table of Contents

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2014

Author Tags

  1. sparse matrix
  2. sparse matrix-vector multiplication
  3. data format
  4. performance model
  5. SIMD

Author Tags

  1. 65Y10
  2. 65Y20
  3. 65F50

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing SpMV on Heterogeneous Multi-Core DSPs through Improved Locality and VectorizationProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673061(1145-1155)Online publication date: 12-Aug-2024
  • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
  • (2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
  • (2024)HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUsProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658680(160-173)Online publication date: 3-Jun-2024
  • (2024)Tandem Processor: Grappling with Emerging Operators in Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640365(1165-1182)Online publication date: 27-Apr-2024
  • (2024)Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.345247835:11(1977-1988)Online publication date: 1-Nov-2024
  • (2024)DBSR: An Efficient Storage Format for Vectorizing Sparse Triangular Solvers on Structured GridsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00065(1-14)Online publication date: 17-Nov-2024
  • (2024)AmgT: Algebraic Multigrid Solver on Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00058(1-16)Online publication date: 17-Nov-2024
  • (2024)A Conflict-aware Divide-and-Conquer Algorithm for Symmetric Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00054(1-15)Online publication date: 17-Nov-2024
  • (2024)Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUsThe Journal of Supercomputing10.1007/s11227-024-05949-680:10(13681-13713)Online publication date: 1-Jul-2024
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media