Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Exploiting dense substructures for fast sparse matrix vector multiplication

Published: 01 August 2011 Publication History

Abstract

The execution time of many scientific computing applications is dominated by the time spent in performing sparse matrix vector multiplication (SMV; y â A · x). We consider improving the performance of SMV on multicores by exploiting the dense substructures that are inherently present in many sparse matrices derived from partial differential equation models. First, we identify indistinguishable vertices, i.e., vertices with the same adjacency structure, in a graph representation of the sparse matrix (A) and group them into a supernode. Next, we identify effectively dense blocks within the matrix by grouping rows and columns in each supernode. Finally, by using a suitable data structure for this representation of the matrix, we reduce the number of load operations during SMV while exactly preserving the original sparsity structure of A. In addition, we use ordering techniques to enhance locality in accesses to the vector, x, to yield an SMV kernel that exploits the effectively dense substructures in the matrix. We evaluate our scheme on Intel Nehalem and AMD Shanghai processors. We observe that for larger matrices on the Intel Nehalem processor, our method improves performance on average by 37.35% compared with the traditional compressed sparse row scheme (a blocked compressed form improves performance on average by 30.27%). Benefits of our new format are similar for the AMD processor. More importantly, if we pick for each matrix the best among our method and the blocked compressed scheme, the average performance improvements increase to 40.85%. Additional results indicate that the best performing scheme varies depending on the matrix and the system. We therefore propose an effective density measure that could be used for method selection, thus adding to the variety of options for an auto-tuned optimized SMV kernel that can exploit sparse matrix properties and hardware attributes for high performance.

References

[1]
Ashcraft C. (1995) Compressed graphs and the minimum degree algorithm . SIAM J Sci Comput 16: 1404-1411.
[2]
Belgin M., Back G and Ribbens CJ (2009) Pattern-based Sparse Matrix Representation for Memory-Efficient SMVM kernels. In ICS'09, pp. 100-109.
[3]
Brenner S. and Scott L. (2007) The Mathematical Theory of Finite Element Methods, 3rd edn. New York: Springer.
[4]
Browne S., Dongarra J., Garner N., Ho G. and Mucci P. (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14: 189-204.
[5]
Casazza J. and Intel Corporation (2008) First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem) . White Paper, Intel Corporation. Aavailable at: http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pdf.
[6]
Cuthill E. and McKee J. (1969) Reducing the bandwidth of sparse symmetric matrices . In Proceedings of the 1969 24th National Conference, pp. 157-172.
[7]
Das R., Mavriplis DJ, Saltz J., Gupta S. and Ponnusamy R. (1994) The design and implementation of a parallel unstructured Euler solver using software primitives. AIAA Journal 32: 489-496.
[8]
Davis T. (1997) The University of Florida Sparse Matrix Collection . NA Digest, p. 97.
[9]
Demmel J., Hoemmen M., Mohiyuddin M. and Yelick KA (2008) Avoiding communication in sparse matrix computations . In Proceedings of IPDPS. Washington, DC: IEEE, pp. 1-12.
[10]
Dongarra J., London K., Moore S., Mucci P. and Terpstra D. (2001) Using PAPI for hardware performance monitoring on Linux systems. In Conference on Linux Clusters: The HPC Revolution.
[11]
George A. and Liu J. (1981). Computer Solution of Large Sparse Positive Definite Systems . Englewood Cliffs, NJ: Prentice-Hall .
[12]
George A. and Liu JWH (1989) The evolution of the minimum degree ordering algorithm . SIAM Rev 31: 1-19.
[13]
Goumas G., Kourtis K., Anastopoulos N., Karakasis V. and Koziris N. (2008) Understanding the performance of sparse matrix- vector multiplication. In Euromicro Conference on Parallel, Distributed, and Network-Based Processing, pp. 283-292.
[14]
Guo D. and Gropp W. (2011) Optimizing sparse data structures for matrix-vector multiply . Int J High Perform Comput Appl 25: 115-131.
[15]
Hartigan JA and Wong MA (1979) A k-means clustering algorithm. Appl Stat 28: 100-108.
[16]
Heath MT, Ng E. and Peyton BW (1991) Parallel algorithms for sparse linear systems. SIAM Rev 33: 420-460.
[17]
Heroux MH, Raghavan P. and Simon HD (2006). Parallel Processing For Scientific Computing (Series on Software, Tools, and Environments). Philadelphia, PA: SIAM.
[18]
Im E. and Yelick KA (2001) Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Science (Lecture Notes in Computer Science, vol. 2073). Berlin : Springer, pp. 127-136.
[19]
Im E., Yelick KA and Vuduc R. (2004) Sparsity: optimization framework for sparse matrix kernels . Int J High Perform Comput Appl 18: 135-158.
[20]
Keyes DE (2000) Four horizons for enhancing the performance of parallel simulations based on partial differential equations. In Proceedings of the Euro-Par 2000 Parallel Processing. Berlin: Springer, pp. 1-17.
[21]
Kourtis K., Goumas G. and Koziris N. (2008) Improving the performance of multithreaded sparse matrix-vector multiplication using index and value compression. In 37th International Conference on Parallel Processing, 2008 (ICPP'08), pp. 511-519.
[22]
Langou J., Langou J., Luszczek P., Kurzak J., Buttari A. and Dongarra J. (2006) Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems) . In SuperComputing'06, p. 113.
[23]
Lee BC, Vuduc RW, Demmel JW and Yelick KA (2004) Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In International Conference on Parallel Processing, pp. 169-176.
[24]
Mucci PJ, Browne S., Deane C. and Ho G. (1999) PAPI: a portable interface to hardware performance counters . In Department of Defense HPCMP Users Group Conference .
[25]
Pinar A. and Heath MT (1999) Improving performance of sparse matrix-vector multiplication . In Supercomputing'99, 30.
[26]
Shewchuk JR (1994) An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Technical Report, Carnegie Mellon University .
[27]
Toledo S. (1997) Improving memory-system performance of sparse matrix-vector multiplication. In IBM J Res Dev 41: 711-725.
[28]
Vuduc R., Demmel JW and Yelick KA (2005). OSKI: A library of automatically tuned sparse matrix kernels. J Phys Conf Ser 16: 521.
[29]
Vuduc R. and Moon H. (2005) Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the High Performance Computing and Communications (Lecture Notes in Computer Science, vol. 3726). Berlin: Springer, pp. 807-816.
[30]
Wasson S. (2008) AMD's Shanghai 45nm Opterons versus Intel's latest 45nm Xeons. Available at: http://techreport.com/articles.x/15905.
[31]
Willcock J. and Lumsdaine A. (2006) Accelerating sparse matrix computations via data compression . In ICS'06, pp. 307-316.
[32]
Williams S., Oliker L., Vuduc R., Shalf J., Yelick K. and Demmel J. (2007) Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SuperComputing'07 . New York: ACM Press, pp. 1-12.
[33]
Yzelman AN and Bisseling RH (2009) Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods. SIAM J Sci Comput 31: 3128-3154.

Cited By

View all
  • (2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
  • (2019)Optimizing the linear fascicle evaluation algorithm for many-core systemsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3332469(425-437)Online publication date: 26-Jun-2019
  • (2015)Loop and data transformations for sparse matrix codeACM SIGPLAN Notices10.1145/2813885.273800350:6(521-532)Online publication date: 3-Jun-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications  Volume 25, Issue 3
August 2011
92 pages

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 August 2011

Author Tags

  1. compressed storage formats
  2. envelope ordering
  3. performance
  4. sparse matrix vector multiplication
  5. supernodes

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core SystemsACM Transactions on Parallel Computing10.1145/34180757:4(1-45)Online publication date: 25-Nov-2020
  • (2019)Optimizing the linear fascicle evaluation algorithm for many-core systemsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3332469(425-437)Online publication date: 26-Jun-2019
  • (2015)Loop and data transformations for sparse matrix codeACM SIGPLAN Notices10.1145/2813885.273800350:6(521-532)Online publication date: 3-Jun-2015
  • (2015)Loop and data transformations for sparse matrix codeProceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2737924.2738003(521-532)Online publication date: 3-Jun-2015

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media