article

Exploiting dense substructures for fast sparse matrix vector multiplication

Authors:

Manu Shantharam,

Anirban Chatterjee,

Padma RaghavanAuthors Info & Claims

International Journal of High Performance Computing Applications, Volume 25, Issue 3

Pages 328 - 341

https://doi.org/10.1177/1094342011414748

Published: 01 August 2011 Publication History

Abstract

The execution time of many scientific computing applications is dominated by the time spent in performing sparse matrix vector multiplication (SMV; y â A · x). We consider improving the performance of SMV on multicores by exploiting the dense substructures that are inherently present in many sparse matrices derived from partial differential equation models. First, we identify indistinguishable vertices, i.e., vertices with the same adjacency structure, in a graph representation of the sparse matrix (A) and group them into a supernode. Next, we identify effectively dense blocks within the matrix by grouping rows and columns in each supernode. Finally, by using a suitable data structure for this representation of the matrix, we reduce the number of load operations during SMV while exactly preserving the original sparsity structure of A. In addition, we use ordering techniques to enhance locality in accesses to the vector, x, to yield an SMV kernel that exploits the effectively dense substructures in the matrix. We evaluate our scheme on Intel Nehalem and AMD Shanghai processors. We observe that for larger matrices on the Intel Nehalem processor, our method improves performance on average by 37.35% compared with the traditional compressed sparse row scheme (a blocked compressed form improves performance on average by 30.27%). Benefits of our new format are similar for the AMD processor. More importantly, if we pick for each matrix the best among our method and the blocked compressed scheme, the average performance improvements increase to 40.85%. Additional results indicate that the best performing scheme varies depending on the matrix and the system. We therefore propose an effective density measure that could be used for method selection, thus adding to the variety of options for an auto-tuned optimized SMV kernel that can exploit sparse matrix properties and hardware attributes for high performance.

References

[1]

Ashcraft C. (1995) Compressed graphs and the minimum degree algorithm . SIAM J Sci Comput 16: 1404-1411.

Digital Library

[2]

Belgin M., Back G and Ribbens CJ (2009) Pattern-based Sparse Matrix Representation for Memory-Efficient SMVM kernels. In ICS'09, pp. 100-109.

[3]

Brenner S. and Scott L. (2007) The Mathematical Theory of Finite Element Methods, 3rd edn. New York: Springer.

[4]

Browne S., Dongarra J., Garner N., Ho G. and Mucci P. (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14: 189-204.

Digital Library

[5]

Casazza J. and Intel Corporation (2008) First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem) . White Paper, Intel Corporation. Aavailable at: http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pdf.

[6]

Cuthill E. and McKee J. (1969) Reducing the bandwidth of sparse symmetric matrices . In Proceedings of the 1969 24th National Conference, pp. 157-172.

[7]

Das R., Mavriplis DJ, Saltz J., Gupta S. and Ponnusamy R. (1994) The design and implementation of a parallel unstructured Euler solver using software primitives. AIAA Journal 32: 489-496.

[8]

Davis T. (1997) The University of Florida Sparse Matrix Collection . NA Digest, p. 97.

[9]

Demmel J., Hoemmen M., Mohiyuddin M. and Yelick KA (2008) Avoiding communication in sparse matrix computations . In Proceedings of IPDPS. Washington, DC: IEEE, pp. 1-12.

[10]

Dongarra J., London K., Moore S., Mucci P. and Terpstra D. (2001) Using PAPI for hardware performance monitoring on Linux systems. In Conference on Linux Clusters: The HPC Revolution.

[11]

George A. and Liu J. (1981). Computer Solution of Large Sparse Positive Definite Systems . Englewood Cliffs, NJ: Prentice-Hall .

[12]

George A. and Liu JWH (1989) The evolution of the minimum degree ordering algorithm . SIAM Rev 31: 1-19.

Digital Library

[13]

Goumas G., Kourtis K., Anastopoulos N., Karakasis V. and Koziris N. (2008) Understanding the performance of sparse matrix- vector multiplication. In Euromicro Conference on Parallel, Distributed, and Network-Based Processing, pp. 283-292.

Digital Library

[14]

Guo D. and Gropp W. (2011) Optimizing sparse data structures for matrix-vector multiply . Int J High Perform Comput Appl 25: 115-131.

Digital Library

[15]

Hartigan JA and Wong MA (1979) A k-means clustering algorithm. Appl Stat 28: 100-108.

[16]

Heath MT, Ng E. and Peyton BW (1991) Parallel algorithms for sparse linear systems. SIAM Rev 33: 420-460.

Digital Library

[17]

Heroux MH, Raghavan P. and Simon HD (2006). Parallel Processing For Scientific Computing (Series on Software, Tools, and Environments). Philadelphia, PA: SIAM.

[18]

Im E. and Yelick KA (2001) Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Science (Lecture Notes in Computer Science, vol. 2073). Berlin : Springer, pp. 127-136.

[19]

Im E., Yelick KA and Vuduc R. (2004) Sparsity: optimization framework for sparse matrix kernels . Int J High Perform Comput Appl 18: 135-158.

Digital Library

[20]

Keyes DE (2000) Four horizons for enhancing the performance of parallel simulations based on partial differential equations. In Proceedings of the Euro-Par 2000 Parallel Processing. Berlin: Springer, pp. 1-17.

[21]

Kourtis K., Goumas G. and Koziris N. (2008) Improving the performance of multithreaded sparse matrix-vector multiplication using index and value compression. In 37th International Conference on Parallel Processing, 2008 (ICPP'08), pp. 511-519.

Digital Library

[22]

Langou J., Langou J., Luszczek P., Kurzak J., Buttari A. and Dongarra J. (2006) Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems) . In SuperComputing'06, p. 113.

[23]

Lee BC, Vuduc RW, Demmel JW and Yelick KA (2004) Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In International Conference on Parallel Processing, pp. 169-176.

[24]

Mucci PJ, Browne S., Deane C. and Ho G. (1999) PAPI: a portable interface to hardware performance counters . In Department of Defense HPCMP Users Group Conference .

[25]

Pinar A. and Heath MT (1999) Improving performance of sparse matrix-vector multiplication . In Supercomputing'99, 30.

[26]

Shewchuk JR (1994) An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Technical Report, Carnegie Mellon University .

Digital Library

[27]

Toledo S. (1997) Improving memory-system performance of sparse matrix-vector multiplication. In IBM J Res Dev 41: 711-725.

Digital Library

[28]

Vuduc R., Demmel JW and Yelick KA (2005). OSKI: A library of automatically tuned sparse matrix kernels. J Phys Conf Ser 16: 521.

[29]

Vuduc R. and Moon H. (2005) Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the High Performance Computing and Communications (Lecture Notes in Computer Science, vol. 3726). Berlin: Springer, pp. 807-816.

[30]

Wasson S. (2008) AMD's Shanghai 45nm Opterons versus Intel's latest 45nm Xeons. Available at: http://techreport.com/articles.x/15905.

[31]

Willcock J. and Lumsdaine A. (2006) Accelerating sparse matrix computations via data compression . In ICS'06, pp. 307-316.

Digital Library

[32]

Williams S., Oliker L., Vuduc R., Shalf J., Yelick K. and Demmel J. (2007) Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SuperComputing'07 . New York: ACM Press, pp. 1-12.

[33]

Yzelman AN and Bisseling RH (2009) Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods. SIAM J Sci Comput 31: 3128-3154.

Digital Library

Cited By

Aggarwal KBondhugula U(2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
https://dl.acm.org/doi/10.1145/3418075
Aggarwal KBondhugula UEigenmann RDing CMcKee S(2019)Optimizing the linear fascicle evaluation algorithm for many-core systemsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3332469(425-437)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3332469
Venkat AHall MStrout M(2015)Loop and data transformations for sparse matrix codeACM SIGPLAN Notices10.1145/2813885.273800350:6(521-532)Online publication date: 3-Jun-2015
https://dl.acm.org/doi/10.1145/2813885.2738003
Show More Cited By

Recommendations

Heterogeneous sparse matrix–vector multiplication via compressed sparse row format
Abstract
Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special ...
Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU
ICPP '11: Proceedings of the 2011 International Conference on Parallel Processing

Multiplying a sparse matrix with a vector (spmv for short) is a fundamental operation in many linear algebra kernels. Having an efficient spmv kernel on modern architectures such as the GPUs is therefore of principal interest. The computational ...
GPU Sparse Matrix Vector Multiplication Optimization Based on ELLB Storage Format
ICSCA '23: Proceedings of the 2023 12th International Conference on Software and Computer Applications

ELLPACK(ELL) sparse matrix storage format has problems such as high storage consumption and low efficiency of sparse matrix vector multiplication(SpMV). To solve this problem, we propose a Graphic Processing Unit(GPU)-based efficient ELLPACK-Block(ELLB)...

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 25, Issue 3

August 2011

92 pages

ISSN:1094-3420

Issue’s Table of Contents

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 August 2011

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aggarwal KBondhugula U(2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
https://dl.acm.org/doi/10.1145/3418075
Aggarwal KBondhugula UEigenmann RDing CMcKee S(2019)Optimizing the linear fascicle evaluation algorithm for many-core systemsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3332469(425-437)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3332469
Venkat AHall MStrout M(2015)Loop and data transformations for sparse matrix codeACM SIGPLAN Notices10.1145/2813885.273800350:6(521-532)Online publication date: 3-Jun-2015
https://dl.acm.org/doi/10.1145/2813885.2738003
Venkat AHall MStrout MGrove DBlackburn S(2015)Loop and data transformations for sparse matrix codeProceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2737924.2738003(521-532)Online publication date: 3-Jun-2015
https://dl.acm.org/doi/10.1145/2737924.2738003

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents