Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC41406.2024.00052acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

Published: 17 November 2024 Publication History

Abstract

We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we develop a novel distributed-memory algorithm tailored for TS-SpGEMM. Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5× performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.

References

[1]
N. Bell, S. Dalton, and L. N. Olson, "Exposing fine-grained parallelism in algebraic multigrid methods," SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C123--C152, 2012.
[2]
K. D. Devine, E. G. Boman, R. T. Heaphy, R. H. Bisseling, and U. V. Catalyurek, "Parallel hypergraph partitioning for scientific computing," in Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, 2006, pp. 10-pp.
[3]
E. Solomonik, M. Besta, F. Vella, and T. Hoefler, "Scaling betweenness centrality using communication-efficient sparse matrix multiplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1--14.
[4]
A. Azad, A. Buluç, G. A. Pavlopoulos, N. C. Kyrpides, and C. A. Ouzounis, "HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks," Nucleic Acids Research, vol. 46, no. 6, pp. e33-e33, 01 2018.
[5]
A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluç, "Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems," IEEE Transactions on Parallel and Distributed Systems, vol. 33, pp. 989--1001, 2021.
[6]
E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, "SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 58--70.
[7]
A. Azad, A. Buluç, and J. Gilbert, "Parallel triangle counting and enumeration using matrix algebra," in IEEE International Parallel and Distributed Processing Symposium Workshop, 2015, pp. 804--811.
[8]
M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Rätsch, T. Hoefler, and E. Solomonik, "Communication-efficient jaccard similarity for high-performance distributed genome comparisons," in IPDPS. IEEE, 2020.
[9]
E. Hassani, M. T. Hussain, and A. Azad, "Parallel algorithms for computing jaccard weights on graphs using linear algebra," in 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2023, pp. 1--7.
[10]
O. Selvitopi, S. Ekanayake, G. Guidi, G. Pavlopoulos, A. Azad, and A. Buluç, "Distributed many-to-many protein sequence alignment using sparse matrices," in SC, 2020.
[11]
M. Then, M. Kaufmann, F. Chirigati, T.-A. Hoang-Vu, K. Pham, A. Kemper, T. Neumann, and H. T. Vo, "The more the merrier: Efficient multi-source graph traversal," Proceedings of the VLDB Endowment, vol. 8, no. 4, pp. 449--460, 2014.
[12]
M. Minutoli, M. Halappanavar, A. Kalyanaraman, A. Sathanur, R. Mcclure, and J. McDermott, "Fast and scalable implementations of influence maximization algorithms," in 2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2019, pp. 1--12.
[13]
M. K. Rahman and A. Azad, "Triple sparsification of graph convolutional networks without sacrificing the accuracy," arXiv preprint arXiv:2208.03559, 2022.
[14]
A. Buluç and J. R. Gilbert, "Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments," SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C170-C191, 2012.
[15]
A. Azad, G. Ballard, A. Buluç, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication," SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624-C651, 2016.
[16]
M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley, "An overview of the trilinos project," ACM Transactions on Mathematical Software, vol. 31, no. 3, pp. 397--423, 2005.
[17]
S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, and H. Zhang, "PETSc users manual," Argonne National Laboratory, Tech. Rep. ANL-95/11 -Revision 3.5, 2014. [Online]. Available: http://www.mcs.anl.gov/petsc
[18]
F. G. Gustavson, "Two fast algorithms for sparse matrices: Multiplication and permuted transposition," ACM Transactions on Mathematical Software, vol. 4, no. 3, pp. 250--269, Sep. 1978.
[19]
J. R. Gilbert, C. Moler, and R. Schreiber, "Sparse matrices in matlab: Design and implementation," SIAM journal on matrix analysis and applications, vol. 13, no. 1, pp. 333--356, 1992.
[20]
Y. Nagasaka, S. Matsuoka, A. Azad, and A. Buluç, "Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors," Parallel Computing, vol. 90, p. 102545, 2019.
[21]
M. K. Rahman, M. H. Sujon, and A. Azad, "Force2vec: Parallel force-directed graph embedding," in 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 2020, pp. 442--451.
[22]
J. Gao, W. Ji, F. Chang, S. Han, B. Wei, Z. Liu, and Y. Wang, "A systematic survey of general sparse matrix-matrix multiplication," ACM Computing Surveys, vol. 55, no. 12, pp. 1--36, 2023.
[23]
M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey, "Parallel efficient sparse matrix-matrix multiplication on multicore platforms," in International Conference on High Performance Computing. Springer, 2015, pp. 48--57.
[24]
M. Deveci, C. Trott, and S. Rajamanickam, "Performance-portable sparse matrix-matrix multiplication for many-core architectures," in IPDPSW. IEEE, 2017, pp. 693--702.
[25]
T. A. Davis, "Algorithm 1000: SuiteSparse:GraphBLAS: Graph algorithms in the language of sparse linear algebra," ACM Transactions on Mathematical Software (TOMS), vol. 45, no. 4, pp. 1--25, 2019.
[26]
A. Buluc and J. R. Gilbert, "On the representation and multiplication of hypersparse matrices," in 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008, pp. 1--11.
[27]
S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, "Outerspace: An outer product based sparse matrix multiplication accelerator," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018.
[28]
S. Dalton, L. Olson, and N. Bell, "Optimizing sparse matrix---matrix multiplication for the GPU," ACM Transactions on Mathematical Software (TOMS), vol. 41, no. 4, p. 25, 2015.
[29]
J. Liu, X. He, W. Liu, and G. Tan, "Register-aware optimizations for parallel sparse matrix-matrix multiplication," International Journal of Parallel Programming, vol. 47, no. 3, pp. 403--417, 2019.
[30]
Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, "Bandwidth-optimized parallel algorithms for sparse matrix-matrix multiplication using propagation blocking," in SPAA, 2020, pp. 293--303.
[31]
A. Buluç and J. R. Gilbert, "Challenges and advances in parallel sparse matrix-matrix multiplication," in The 37th International Conference on Parallel Processing (ICPP'08), 2008, pp. 503--510.
[32]
K. Akbudak, O. Selvitopi, and C. Aykanat, "Partitioning models for scaling parallel sparse matrix-matrix multiplication," ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 3, pp. 1--34, 2018.
[33]
A. Buluç and J. R. Gilbert, "The Combinatorial BLAS: design, implementation, and applications," The International Journal of High Performance Computing Applications, vol. 25, pp. 496 - 509, 2011.
[34]
R. A. van de Geijn and J. Watts, "SUMMA: Scalable universal matrix multiplication algorithm," Austin, TX, USA, Tech. Rep., 1995.
[35]
U. Borštnik, J. VandeVondele, V. Weber, and J. Hutter, "Sparse matrix multiplication: The distributed block-compressed sparse row library," Parallel Computing, vol. 40, no. 5--6, pp. 47--58, 2014.
[36]
L. E. Cannon, "A cellular computer to implement the Kalman filter algorithm," Ph.D. dissertation, Montana State University, 1969.
[37]
A. Lazzaro, J. VandeVondele, J. Hutter, and O. Schütt, "Increasing the efficiency of sparse matrix-matrix multiplication with a 2.5 d algorithm and one-sided MPI," in PASC, 2017, pp. 1--9.
[38]
M. T. Hussain, O. Selvitopi, A. Buluç, and A. Azad, "Communication-avoiding and memory-constrained sparse matrix-matrix multiplication at extreme scale," in IPDPS. IEEE, 2021.
[39]
E. Solomonik and T. Hoefler, "Sparse tensor algebra as a parallel programming model," arXiv preprint arXiv:1512.00066, 2015.
[40]
M. Rasouli, R. M. Kirby, and H. Sundar, "A compressed, divide and conquer algorithm for scalable distributed matrix-matrix multiplication," in The International Conference on High Performance Computing in Asia-Pacific Region, 2021, pp. 110--119.
[41]
K. L. Nusbaum, "Optimizing tpetra's sparse matrix-matrix multiplication routine," SAND2011-6036, Sandia National Laboratories, Tech. Rep, 2011.
[42]
M. T. Hussain, G. S. Abhishek, A. Buluç, and A. Azad, "Parallel algorithms for adding a collection of sparse matrices," in 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2022, pp. 285--294.
[43]
R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of collective communication operations in mpich," The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49--66, 2005.
[44]
A. Azad and A. Buluç, "A work-efficient parallel sparse matrix-sparse vector multiplication algorithm," in Proceedings of the IPDPS. IEEE, 2017.
[45]
J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. Owens, C. Yang, M. Zalewski, and T. Mattson, "Mathematical foundations of the GraphBLAS," in IEEE HPEC, 2016.
[46]
S. Beamer, K. Asanovic, and D. Patterson, "Direction-optimizing breadth-first search," in SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012, pp. 1--10.
[47]
I. Ranawaka and A. Azad, "Scalable node embedding algorithms using distributed sparse matrix operations," in 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2024, pp. 1199--1201.
[48]
J. Leskovec and R. Sosič, "Snap: A general-purpose network analysis and graph-mining library," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 8, no. 1, pp. 1--20, 2016.
[49]
T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Trans. Math. Softw. (TOMS), vol. 38, no. 1, pp. 1--25, 2011.
[50]
A. Azad, G. Ballard, A. Buluč, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, and S. Williams, "Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication," SIAM Journal on Scientific Computing, vol. 38, no. 6, pp. C624-C651, 2016.
[51]
O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluč, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication," in Proceedings of the ACM International Conference on Supercomputing, ser. ICS '21. New York, NY, USA: Association for Computing Machinery, 2021, p. 431--442.
[52]
C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, "Two-face: Combining collective and one-sided communication for efficient distributed spmm," in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200--1217.
[53]
M. K. Rahman, M. H. Sujon, and A. Azad, "FusedMM: A unified SDDMM-SpMM kernel for graph embedding and graph neural networks," in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 256--266.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2024
1758 pages
ISBN:9798350352917

Sponsors

Publisher

IEEE Press

Publication History

Published: 17 November 2024

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 98
    Total Downloads
  • Downloads (Last 12 months)98
  • Downloads (Last 6 weeks)84
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media