research-article

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

Authors:

Isuru Ranawaka,

Md Taufique Hussain,

Gerasimos Gerogiannis,

Josep Torrellas,

Ariful AzadAuthors Info & Claims

SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Article No.: 46, Pages 1 - 17

https://doi.org/10.1109/SC41406.2024.00052

Published: 17 November 2024 Publication History

Abstract

We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5× performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.

References

[1]

N. Bell, S. Dalton, and L. N. Olson, "Exposing fine-grained parallelism in algebraic multigrid methods," SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C123--C152, 2012.

Digital Library

[2]

K. D. Devine, E. G. Boman, R. T. Heaphy, R. H. Bisseling, and U. V. Catalyurek, "Parallel hypergraph partitioning for scientific computing," in Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 2006, pp. 10-pp.

[3]

E. Solomonik, M. Besta, F. Vella, and T. Hoefler, "Scaling betweenness centrality using communication-efficient sparse matrix multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1--14.

[4]

A. Azad, A. Buluç, G. A. Pavlopoulos, N. C. Kyrpides, and C. A. Ouzounis, "HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks," Nucleic Acids Research, vol. 46, no. 6, pp. e33-e33, 01 2018.

[5]

A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluç, "Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems," IEEE Transactions on Parallel and Distributed Systems, vol. 33, pp. 989--1001, 2021.

[6]

E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, "SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 58--70.

[7]

A. Azad, A. Buluç, and J. Gilbert, "Parallel triangle counting and enumeration using matrix algebra," in IEEE International Parallel and Distributed Processing Symposium Workshop, 2015, pp. 804--811.

[8]

M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Rätsch, T. Hoefler, and E. Solomonik, "Communication-efficient jaccard similarity for high-performance distributed genome comparisons," in IPDPS. IEEE, 2020.

[9]

E. Hassani, M. T. Hussain, and A. Azad, "Parallel algorithms for computing jaccard weights on graphs using linear algebra," in 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2023, pp. 1--7.

[10]

O. Selvitopi, S. Ekanayake, G. Guidi, G. Pavlopoulos, A. Azad, and A. Buluç, "Distributed many-to-many protein sequence alignment using sparse matrices," in SC, 2020.

[11]

M. Then, M. Kaufmann, F. Chirigati, T.-A. Hoang-Vu, K. Pham, A. Kemper, T. Neumann, and H. T. Vo, "The more the merrier: Efficient multi-source graph traversal," Proceedings of the VLDB Endowment, vol. 8, no. 4, pp. 449--460, 2014.

Digital Library

[12]

M. Minutoli, M. Halappanavar, A. Kalyanaraman, A. Sathanur, R. Mcclure, and J. McDermott, "Fast and scalable implementations of influence maximization algorithms," in 2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2019, pp. 1--12.

[13]

M. K. Rahman and A. Azad, "Triple sparsification of graph convolutional networks without sacrificing the accuracy," arXiv preprint arXiv:2208.03559, 2022.

[14]

A. Buluç and J. R. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments," SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C170-C191, 2012.

Digital Library

[15]

A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication," SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624-C651, 2016.

[16]

M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley, "An overview of the trilinos project," ACM Transactions on Mathematical Software, vol. 31, no. 3, pp. 397--423, 2005.

Digital Library

[17]

S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, and H. Zhang, "PETSc users manual," Argonne National Laboratory, Tech. Rep. ANL-95/11 -Revision 3.5, 2014. [Online]. Available: http://www.mcs.anl.gov/petsc

[18]

F. G. Gustavson, "Two fast algorithms for sparse matrices: Multiplication and permuted transposition," ACM Transactions on Mathematical Software, vol. 4, no. 3, pp. 250--269, Sep. 1978.

Digital Library

[19]

J. R. Gilbert, C. Moler, and R. Schreiber, "Sparse matrices in matlab: Design and implementation," SIAM journal on matrix analysis and applications, vol. 13, no. 1, pp. 333--356, 1992.

[20]

Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç, "Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors," Parallel Computing, vol. 90, p. 102545, 2019.

Digital Library

[21]

M. K. Rahman, M. H. Sujon, and A. Azad, "Force2vec: Parallel force-directed graph embedding," in 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 2020, pp. 442--451.

[22]

J. Gao, W. Ji, F. Chang, S. Han, B. Wei, Z. Liu, and Y. Wang, "A systematic survey of general sparse matrix-matrix multiplication," ACM Computing Surveys, vol. 55, no. 12, pp. 1--36, 2023.

Digital Library

[23]

M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, "Parallel efficient sparse matrix-matrix multiplication on multicore platforms," in International Conference on High Performance Computing. Springer, 2015, pp. 48--57.

[24]

M. Deveci, C. Trott, and S. Rajamanickam, "Performance-portable sparse matrix-matrix multiplication for many-core architectures," in IPDPSW. IEEE, 2017, pp. 693--702.

[25]

T. A. Davis, "Algorithm 1000: SuiteSparse:GraphBLAS: Graph algorithms in the language of sparse linear algebra," ACM Transactions on Mathematical Software (TOMS), vol. 45, no. 4, pp. 1--25, 2019.

Digital Library

[26]

A. Buluc and J. R. Gilbert, "On the representation and multiplication of hypersparse matrices," in 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008, pp. 1--11.

[27]

S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, "Outerspace: An outer product based sparse matrix multiplication accelerator," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018.

[28]

S. Dalton, L. Olson, and N. Bell, "Optimizing sparse matrix---matrix multiplication for the GPU," ACM Transactions on Mathematical Software (TOMS), vol. 41, no. 4, p. 25, 2015.

Digital Library

[29]

J. Liu, X. He, W. Liu, and G. Tan, "Register-aware optimizations for parallel sparse matrix-matrix multiplication," International Journal of Parallel Programming, vol. 47, no. 3, pp. 403--417, 2019.

Digital Library

[30]

Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, "Bandwidth-optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking," in SPAA, 2020, pp. 293--303.

[31]

A. Buluç and J. R. Gilbert, "Challenges and advances in parallel sparse matrix-matrix multiplication," in The 37th International Conference on Parallel Processing (ICPP'08), 2008, pp. 503--510.

[32]

K. Akbudak, O. Selvitopi, and C. Aykanat, "Partitioning models for scaling parallel sparse matrix-matrix multiplication," ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 3, pp. 1--34, 2018.

Digital Library

[33]

A. Buluç and J. R. Gilbert, "The Combinatorial BLAS: design, implementation, and applications," The International Journal of High Performance Computing Applications, vol. 25, pp. 496 - 509, 2011.

Digital Library

[34]

R. A. van de Geijn and J. Watts, "SUMMA: Scalable universal matrix multiplication algorithm," Austin, TX, USA, Tech. Rep., 1995.

[35]

U. Borštnik, J. VandeVondele, V. Weber, and J. Hutter, "Sparse matrix multiplication: The distributed block-compressed sparse row library," Parallel Computing, vol. 40, no. 5--6, pp. 47--58, 2014.

[36]

L. E. Cannon, "A cellular computer to implement the Kalman filter algorithm," Ph.D. dissertation, Montana State University, 1969.

[37]

A. Lazzaro, J. VandeVondele, J. Hutter, and O. Schütt, "Increasing the efficiency of sparse matrix-matrix multiplication with a 2.5 d algorithm and one-sided MPI," in PASC, 2017, pp. 1--9.

[38]

M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad, "Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale," in IPDPS. IEEE, 2021.

[39]

E. Solomonik and T. Hoefler, "Sparse tensor algebra as a parallel programming model," arXiv preprint arXiv:1512.00066, 2015.

[40]

M. Rasouli, R. M. Kirby, and H. Sundar, "A compressed, divide and conquer algorithm for scalable distributed matrix-matrix multiplication," in The International Conference on High Performance Computing in Asia-Pacific Region, 2021, pp. 110--119.

Digital Library

[41]

K. L. Nusbaum, "Optimizing tpetra's sparse matrix-matrix multiplication routine," SAND2011-6036, Sandia National Laboratories, Tech. Rep, 2011.

[42]

M. T. Hussain, G. S. Abhishek, A. Buluç, and A. Azad, "Parallel algorithms for adding a collection of sparse matrices," in 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2022, pp. 285--294.

[43]

R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in mpich," The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49--66, 2005.

Digital Library

[44]

A. Azad and A. Buluç, "A work-efficient parallel sparse matrix-sparse vector multiplication algorithm," in Proceedings of the IPDPS. IEEE, 2017.

[45]

J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. Owens, C. Yang, M. Zalewski, and T. Mattson, "Mathematical foundations of the GraphBLAS," in IEEE HPEC, 2016.

[46]

S. Beamer, K. Asanovic, and D. Patterson, "Direction-optimizing breadth-first search," in SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012, pp. 1--10.

[47]

I. Ranawaka and A. Azad, "Scalable node embedding algorithms using distributed sparse matrix operations," in 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2024, pp. 1199--1201.

[48]

J. Leskovec and R. Sosič, "Snap: A general-purpose network analysis and graph-mining library," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 8, no. 1, pp. 1--20, 2016.

Digital Library

[49]

T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Trans. Math. Softw. (TOMS), vol. 38, no. 1, pp. 1--25, 2011.

Digital Library

[50]

A. Azad, G. Ballard, A. Buluč, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication," SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624-C651, 2016.

[51]

O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluč, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication," in Proceedings of the ACM International Conference on Supercomputing, ser. ICS '21. New York, NY, USA: Association for Computing Machinery, 2021, p. 431--442.

Digital Library

[52]

C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, "Two-face: Combining collective and one-sided communication for efficient distributed spmm," in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200--1217.

[53]

M. K. Rahman, M. H. Sujon, and A. Azad, "FusedMM: A unified SDDMM-SpMM kernel for graph embedding and graph neural networks," in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 256--266.

Index Terms

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

Index terms have been assigned to the content through auto-classification.

Recommendations

A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix Multiplication
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

Multiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. Distributed-memory parallel algorithms for SpGEMM have ...
Highly Scalable Parallel Algorithms for Sparse Matrix Factorization

In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Cray T3D parallel computer. Through our ...
Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks
SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures

This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector. Our ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

November 2024

1758 pages

ISBN:9798350352917

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

IEEE Press

Publication History

Published: 17 November 2024

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SC '24

Sponsor:

SIGHPC

SC '24: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 22, 2024

GA, Atlanta, USA

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
98
Total Downloads

Downloads (Last 12 months)98
Downloads (Last 6 weeks)84

Reflects downloads up to 31 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents