Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1362622.1362674acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Published: 10 November 2007 Publication History

Abstract

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

References

[1]
K. Asanovic, R. Bodik, B. Catanzaro, et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.
[2]
D. Bailey. Little's law and high performance computing. In RNR Technical Report, 1997.
[3]
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient management of parallelism in object oriented numerical software libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202, 1997.
[4]
G. E. Blelloch, M. A. Heroux, and M. Zagha. Segmented operations for sparse matrix computations on vector multiprocessors. Technical Report CMU-CS-93-173, Department of Computer Science, CMU, 1993.
[5]
S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4):23--29, Jul-Aug, 1999.
[6]
R. Geus and S. Röllin. Towards a fast parallel sparse matrix-vector multiplication. In E. H. D'Hollander, J. R. Joubert, F. J. Peters, and H. Sips, editors, Proceedings of the International Conference on Parallel Computing (ParCo), pages 308--315. Imperial College Press, 1999.
[7]
M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006.
[8]
M. Gschwind, H. P. Hofstee, B. K. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2): 10--24, 2006.
[9]
J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach; fourth edition. Morgan Kaufmann, San Francisco, 2006.
[10]
E. J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. International Journal of High Performance Computing Applications, 18(1):135--158, 2004.
[11]
B. C. Lee, R. Vuduc, J. Demmel, and K. Yelick. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In Proceedings of the International conference on Parallel Processing, Montreal, Canada, August 2004.
[12]
J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix vector multiply using unroll-and-jam. In Proc. LACSI Symposium, Santa Fe, NM, USA, October 2002.
[13]
R. Nishtala, R. Vuduc, J. W. Demmel, and K. A. Yelick. When cache blocking sparse matrix vector multiply works and why. Applicable Algebra in Engineering, Communication, and Computing, March 2007.
[14]
A. Pinar and M. Heath. Improving performance of sparse matrix-vector multiplication. In Proc. Supercomputing, 1999.
[15]
D. J. Rose. A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations. Graph Theory and Computing, pages 183--217, 1973.
[16]
O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proc. Supercomputing, 1992.
[17]
S. Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
[18]
B. Vastenhouw and R. H. Bisseling. A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review, 47(1):67--95, 2005.
[19]
R. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, Berkeley, CA, USA, December 2003.
[20]
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proc. SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, June 2005.
[21]
R. Vuduc, A. Gyulassy, J. W. Demmel, and K. A. Yelick. Memory hierarchy optimizations and bounds for sparse ATAx. In Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume LNCS, Melbourne, Australia, June 2003. Springer.
[22]
R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J. W. Demmel, and K. A. Yelick. Automatic performance tuning and analysis of sparse triangular solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries, New York, USA, June 2002.
[23]
J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In Proc. International Conference on Supercomputing (ICS), Cairns, Australia, June 2006.
[24]
J. W. Willenbring, A. A. Anda, and M. Heroux. Improving sparse matrix-vector product kernel performance and availabillity. In Proc. Midwest Instruction and Computing Symposium, Mt. Pleasant, IA, 2006.
[25]
S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. Scientific computing kernels on the Cell processor. International Journal of Parallel Programming, 35(3):263--298, 2007.

Cited By

View all
  • (2024)A Hybrid Machine Learning Method for Cross-Platform Performance Prediction of Parallel ApplicationsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673059(669-678)Online publication date: 12-Aug-2024
  • (2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
  • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
November 2007
723 pages
ISBN:9781595937643
DOI:10.1145/1362622
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '07
Sponsor:

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)8
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Hybrid Machine Learning Method for Cross-Platform Performance Prediction of Parallel ApplicationsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673059(669-678)Online publication date: 12-Aug-2024
  • (2024)Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector MultiplicationACM Transactions on Architecture and Code Optimization10.1145/365302021:2(1-24)Online publication date: 21-May-2024
  • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
  • (2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
  • (2024)A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638470(377-389)Online publication date: 2-Mar-2024
  • (2024)Scalable High-Quality Hypergraph PartitioningACM Transactions on Algorithms10.1145/362652720:1(1-54)Online publication date: 22-Jan-2024
  • (2024)pSyncPIM: Partially Synchronous Execution of Sparse Matrix Operations for All-Bank PIM Architectures2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00034(354-367)Online publication date: 29-Jun-2024
  • (2024)SpDCache: Region-Based Reduction Cache for Outer-Product Sparse Matrix Kernels2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP61560.2024.00012(3-7)Online publication date: 24-Jul-2024
  • (2024)SpChar: Characterizing the sparse puzzle via decision treesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104941(104941)Online publication date: Jun-2024
  • (2024)Efficiently Running SpMV on Multi-core DSPs for Banded MatrixAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0808-6_12(201-220)Online publication date: 27-Feb-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media