research-article

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Authors:

Samuel Williams,

Katherine Yelick,

James DemmelAuthors Info & Claims

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Article No.: 38, Pages 1 - 12

https://doi.org/10.1145/1362622.1362674

Published: 10 November 2007 Publication History

Abstract

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

References

[1]

K. Asanovic, R. Bodik, B. Catanzaro, et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.

[2]

D. Bailey. Little's law and high performance computing. In RNR Technical Report, 1997.

[3]

S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202, 1997.

Digital Library

[4]

G. E. Blelloch, M. A. Heroux, and M. Zagha. Segmented operations for sparse matrix computations on vector multiprocessors. Technical Report CMU-CS-93-173, Department of Computer Science, CMU, 1993.

Digital Library

[5]

S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4):23--29, Jul-Aug, 1999.

Digital Library

[6]

R. Geus and S. Röllin. Towards a fast parallel sparse matrix-vector multiplication. In E. H. D'Hollander, J. R. Joubert, F. J. Peters, and H. Sips, editors, Proceedings of the International Conference on Parallel Computing (ParCo), pages 308--315. Imperial College Press, 1999.

[7]

M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006.

Digital Library

[8]

M. Gschwind, H. P. Hofstee, B. K. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2): 10--24, 2006.

Digital Library

[9]

J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach; fourth edition. Morgan Kaufmann, San Francisco, 2006.

Digital Library

[10]

E. J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 18(1):135--158, 2004.

Digital Library

[11]

B. C. Lee, R. Vuduc, J. Demmel, and K. Yelick. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In Proceedings of the International conference on Parallel Processing, Montreal, Canada, August 2004.

Digital Library

[12]

J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix vector multiply using unroll-and-jam. In Proc. LACSI Symposium, Santa Fe, NM, USA, October 2002.

[13]

R. Nishtala, R. Vuduc, J. W. Demmel, and K. A. Yelick. When cache blocking sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communication, and Computing, March 2007.

[14]

A. Pinar and M. Heath. Improving performance of sparse matrix-vector multiplication. In Proc. Supercomputing, 1999.

Digital Library

[15]

D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. Graph Theory and Computing, pages 183--217, 1973.

[16]

O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proc. Supercomputing, 1992.

Digital Library

[17]

S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.

Digital Library

[18]

B. Vastenhouw and R. H. Bisseling. A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review, 47(1):67--95, 2005.

Digital Library

[19]

R. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, Berkeley, CA, USA, December 2003.

Digital Library

[20]

R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, June 2005.

[21]

R. Vuduc, A. Gyulassy, J. W. Demmel, and K. A. Yelick. Memory hierarchy optimizations and bounds for sparse A^TAx. In Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume LNCS, Melbourne, Australia, June 2003. Springer.

Digital Library

[22]

R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J. W. Demmel, and K. A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries, New York, USA, June 2002.

[23]

J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In Proc. International Conference on Supercomputing (ICS), Cairns, Australia, June 2006.

Digital Library

[24]

J. W. Willenbring, A. A. Anda, and M. Heroux. Improving sparse matrix-vector product kernel performance and availabillity. In Proc. Midwest Instruction and Computing Symposium, Mt. Pleasant, IA, 2006.

[25]

S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. Scientific computing kernels on the Cell processor. International Journal of Parallel Programming, 35(3):263--298, 2007.

Digital Library

Cited By

Mahdavi K(2024)A Hybrid Machine Learning Method for Cross-Platform Performance Prediction of Parallel ApplicationsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673059(669-678)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673059
Hwang SBaek DPark JHuh J(2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3653020
Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Show More Cited By

Recommendations

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems,...
Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such ...
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
FPGA '14: Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

November 2007

723 pages

ISBN:9781595937643

DOI:10.1145/1362622

General Chair:
Becky Verastegui
Oak Ridge National Laboratory

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SC '07

Sponsor:

SIGARCH
IEEE-CS

SC '07: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2007

Nevada, Reno

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

336
Total Citations
View Citations
1,569
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)8

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mahdavi K(2024)A Hybrid Machine Learning Method for Cross-Platform Performance Prediction of Parallel ApplicationsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673059(669-678)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673059
Hwang SBaek DPark JHuh J(2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
https://dl.acm.org/doi/10.1145/3653020
Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640542
Pang MFei XQu PZhang YLi ZLee IChabbi MSteuwer M(2024)A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638470(377-389)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638470
Gottesbüren LHeuer TMaas NSanders PSchlag S(2024)Scalable High-Quality Hypergraph PartitioningACM Transactions on Algorithms10.1145/362652720:1(1-54)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3626527
Baek DHwang SHuh J(2024)pSyncPIM: Partially Synchronous Execution of Sparse Matrix Operations for All-Bank PIM Architectures2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00034(354-367)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00034
Isaac--Chassande VEvans ADurand YRousseau F(2024)SpDCache: Region-Based Reduction Cache for Outer-Product Sparse Matrix Kernels2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00012(3-7)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ASAP61560.2024.00012
Sgherzi FSiracusa MFernandez IArmejach AMoretó M(2024)SpChar: Characterizing the sparse puzzle via decision treesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104941(104941)Online publication date: Jun-2024
https://doi.org/10.1016/j.jpdc.2024.104941
Bi DLi SZhang YYang XDong D(2024)Efficiently Running SpMV on Multi-core DSPs for Banded MatrixAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0808-6_12(201-220)Online publication date: 27-Feb-2024
https://doi.org/10.1007/978-981-97-0808-6_12
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents