research-article

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication

Authors:

Oguz Selvitopi,

Cevdet AykanatAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 4, Issue 3

Article No.: 13, Pages 1 - 34

https://doi.org/10.1145/3155292

Published: 03 January 2018 Publication History

Abstract

We investigate outer-product--parallel, inner-product--parallel, and row-by-row-product--parallel formulations of sparse matrix-matrix multiplication (SpGEMM) on distributed memory architectures. For each of these three formulations, we propose a hypergraph model and a bipartite graph model for distributing SpGEMM computations based on one-dimensional (1D) partitioning of input matrices. We also propose a communication hypergraph model for each formulation for distributing communication operations. The computational graph and hypergraph models adopted in the first phase aim at minimizing the total message volume and balancing the computational loads of processors, whereas the communication hypergraph models adopted in the second phase aim at minimizing the total message count and balancing the message volume loads of processors. That is, the computational partitioning models reduce the bandwidth cost and the communication hypergraph models reduce the latency cost. Our extensive parallel experiments on up to 2048 processors for a wide range of realistic SpGEMM instances show that although the outer-product--parallel formulation scales better, the row-by-row-product--parallel formulation is more viable due to its significantly lower partitioning overhead and competitive scalability. For computational partitioning models, our experimental findings indicate that the proposed bipartite graph models are attractive alternatives to their hypergraph counterparts because of their lower partitioning overhead. Finally, we show that by reducing the latency cost besides the bandwidth cost through using the communication hypergraph models, the parallel SpGEMM time can be further improved up to 32%.

References

[1]

Kadir Akbudak and Cevdet Aykanat. 2014. Simultaneous input and output matrix partitioning for outer-product--parallel sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing 36, 5 (2014), C568--C590. arXiv:http://dx.doi.org/10.1137/13092589X

[2]

Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (2017), 2258--2271.

[3]

Cevdet Aykanat, B. Barla Cambazoglu, and Bora Uçar. 2008. Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices. Journal of Parallel and Distributed Computing 68, 5 (May 2008), 609--625.

Digital Library

[4]

Ariful Azad, Aydın Buluç, and John R. Gilbert. 2015. Parallel triangle counting and enumeration using matrix algebra. In Proceedings of the IPDPSW, Workshop on Graph Algorithm Building Blocks (GABB’15). 804--811.

Digital Library

[5]

Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Benjamin Lipshitz, Oded Schwartz, and Sivan Toledo. 2013. Communication optimal parallel multiplication of sparse random matrices. In Proceedings of the 25th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’13). ACM, New York, 222--231.

Digital Library

[6]

Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2015. Brief announcement: Hypergraph partitioning for parallel sparse matrix-matrix multiplication. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA’15). ACM, New York, 86--88.

Digital Library

[7]

Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing 34, 4 (2012), C123--C152.

Digital Library

[8]

Rob H. Bisseling, Timothy M. Doup, and L. Daniel J. C. Loyens. 1993. A parallel interior point algorithm for linear programming on a network of transputers. Annals of Operations Research 43 (1993), 51--86.

[9]

Erik G. Boman, Ojas Parekh, and Cynthia Phillips. 2005. LDRD Final Report on Massively-Parallel Linear Programming: The parPCx System. Technical Report. SAND2004-6440, Sandia National Laboratories.

[10]

Urban Borštnik, Joost VandeVondele, Valéry Weber, and Jürg Hutter. 2014. Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Computing 40, 5 (2014), 47--58.

[11]

William L. Briggs, Van Emden Henson, and Steve F. McCormick. 2000. A Multigrid Tutorial. 2nd ed. Siam, Philadelphia.

[12]

Aydın Buluç and John R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--11.

[13]

Aydin Buluç and John R. Gilbert. 2011. The combinatorial BLAS: Design, implementation, and applications. International Journal of High Performance Computing Applications 25, 4 (2011), 496--509.

Digital Library

[14]

Aydın Buluç and John R. Gilbert. 2012. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal of Scientific Computing (SISC) 34, 4 (2012), 170--191.

[15]

Lynn Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Montana State University, Bozeman, MN.

[16]

Ümit V. Çatalyürek and Cevdet Aykanat. 1999a. Hypergraph-partitioning based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on Parallel Distributed Systems 10, 7 (1999), 673--693.

Digital Library

[17]

Ümit V. Çatalyürek and Cevdet Aykanat. 1999b. PaToH: A Multilevel Hypergraph Partitioning Tool, Version 3.0. Computer Engineering Department, Bilkent University, Ankara, Turkey.

[18]

Matt Challacombe. 1999. A simplified density matrix minimization for linear scaling self-consistent field theory. Journal of Chemical Physics 110, 5 (1999), 2332--2342.

[19]

Matt Challacombe. 2000. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory. Computer Physics Communications 128, 12 (2000), 93--107.

[20]

CP2K. 2016. CP2K home page. Retrieved from http://www.cp2k.org/.

[21]

Steven Dalton, Nathan Bell, and Luke Olson. 2013. Optimizing Sparse Matrix-Matrix Multiplication for the GPU. Technical report. NVIDIA.

[22]

Andrew D. Daniels, John M. Millam, and Gustavo E. Scuseria. 1997. Semiempirical methods with conjugate gradient density matrix search to replace diagonalization for molecular systems containing thousands of atoms. Journal of Chemical Physics 107, 2 (1997), 425--431.

[23]

Timothy A. Davis. 2006. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. .arXiv:http://epubs.siam.org/doi/pdf/10.1137/1.9780898718881

[24]

Timothy A. Davis and Yifan Hu. 2011. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software 38, 1 (2011), 1.

Digital Library

[25]

James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. In Proceedings of 27th International Parallel Distributed Processing Symposium. IEEE, 261--272.

Digital Library

[26]

Elizabeth D. Dolan and Jorge J. Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical Programming 91, 2 (2002), 201--213.

[27]

Felix Gremse, Andreas Hofter, Lars Ole Schwen, Fabian Kiessling, and Uwe Naumann. 2015. GPU-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM Journal on Scientific Computing 37, 1 (2015), C54--C71.

Digital Library

[28]

Fred G. Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software 4, 3 (1978), 250--269.

Digital Library

[29]

Václav Hapla, David Horák, and Michal Merta. 2013. Use of direct solvers in TFETI massively parallel implementation. In Applied Parallel and Scientific Computing, Pekka Manninen and Per Ster (Eds.). Springer, Berlin, 192--205.

Digital Library

[30]

Michael Heroux, Roscoe Bartlett, Vicki Howle, Robert Hoekstra, Jonathan Hu, Tamara Kolda, Richard Lehoucq, Kevin Long, Roger Pawlowski, Eric Phipps, Andrew Salinger, Heidi Thornquist, Ray Tuminaro, James Willenbring, and Alan Williams. 2003. An Overview of Trilinos. Technical Report SAND2003-2927. Sandia National Laboratories, Albuquerque, NM.

[31]

Satoshi Itoh, Pablo Ordejn, and Richard M. Martin. 1995. Order-N tight-binding molecular dynamics on parallel computers. Computer Physics Communications 88, 2--3 (1995), 173--185. 10.1016/0010-4655(95)00031-A

[32]

George Karypis, Anshul Gupta, and Vipin Kumar. 1994. A parallel formulation of interior point algorithms. In Supercomputing’94.

[33]

George Karypis and Vipin Kumar. 1998. A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. Journal of Parallel and Distributed Computing 48, 1 (1998), 71--95.

Digital Library

[34]

George Karypis and Vipin Kumar. 1999. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (1999), 359--392. Also available at http://www.cs.umn.edu/&sim;karypis. A short version appears in International Conference on Parallel Processing 1995.

Digital Library

[35]

Thomas Lengauer. 1990. Combinatorial Algorithms for Integrated Circuit Layout. Willey--Teubner, Chichester, UK.

[36]

X.-P. Li, R. W. Nunes, and David Vanderbilt. 1993. Density-matrix electronic-structure method with linear system-size scaling. Physical Review B 47, 16 (Apr. 1993), 10891--10894.

[37]

Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7, 1 (2003), 76--80.

Digital Library

[38]

Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 370--381.

Digital Library

[39]

John M. Millam and Gustavo E. Scuseria. 1997. Linear scaling conjugate gradient density matrix search as an alternative to diagonalization for first principles electronic structure calculations. Journal of Chemical Physics 106, 13 (1997), 5569--5577.

[40]

Intel MKL. 2015. Math Kernel Library (MKL). Retrieved from http://software.intel.com/en-us/articles/intel-mkl/.

[41]

Kurtis L. Nusbaum. 2011. Optimizing Tpetra’s Sparse Matrix-Matrix Multiplication Routine. Technical Report. SAND2011-6036, Sandia National Laboratories.

[42]

Carlos Ordonez. 2010. Optimization of linear recursive queries in SQL. IEEE Transactions on Knowledge and Data Engineering 22, 2 (2010), 264--277.

Digital Library

[43]

Carlos Ordonez, Yiqun Zhang, and Wellington Cabrera. 2016. The gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Transactions on Knowledge and Data Engineering 28, 7 (2016), 1905--1918.

[44]

Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G. Pudov, Vadim O. Pirogov, and Pradeep Dubey. 2015. Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In High Performance Computing. Springer, 48--57.

[45]

H. Bernhard Schlegel, John M. Millam, Srinivasan S. Iyengar, Gregory A. Voth, Andrew D. Daniels, Gustavo E. Scuseria, and Michael J. Frisch. 2001. Ab initio molecular dynamics: Propagating the density matrix with Gaussian orbitals. Journal of Chemical Physics 114, 22 (2001), 9758--9763.

[46]

Oguz Selvitopi and Cevdet Aykanat. 2016. Reducing latency cost in 2D sparse matrix partitioning models. Parallel Computing 57, Supplement C (2016), 1--24.

Digital Library

[47]

Edgar Solomonik, Abhinav Bhatele, and James Demmel. 2011. Improving communication performance in dense linear algebra via topology aware collectives. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, Article 77, 11 pages.

Digital Library

[48]

Total-FETI. 2016. Total-FETI Massively Parallel Implementation Research Group. Retrieved from http://spomech.vsb.cz/feti/.

[49]

Bora Uçar and Cevdet Aykanat. 2004. Encapsulating multiple communication-cost metrics in partitioning sparse rectangular matrices for parallel matrix-vector multiplies. SIAM Journal on Scientific Computing 26, 6 (2004), 1837--1859.

Digital Library

[50]

Robert A. van de Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency - Practice and Experience 9, 4 (1997), 255--274.

[51]

Joost VandeVondele, Urban Borstnik, and Jurg Hutter. 2012. Linear scaling self-consistent field calculations with millions of atoms in the condensed phase. Journal of Chemical Theory and Computation 8, 10 (2012), 3565--3573.

Cited By

Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Hong YBuluç A(2024)A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00053(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00053
Ranawaka IHussain MBlock CGerogiannis GTorrellas JAzad A(2024)Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00052(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00052
Show More Cited By

Index Terms

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication

Recommendations

Adaptive sparse tiling for sparse matrix multiplication
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern ...
Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication

We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the ...
Scaling sparse matrix-matrix multiplication in the accumulo database
Abstract
We propose and implement a sparse matrix-matrix multiplication (SpGEMM) algorithm running on top of Accumulo’s iterator framework which enables high performance distributed parallelism. The proposed algorithm provides write-locality while ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 4, Issue 3

September 2017

120 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3175004

Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 January 2018

Accepted: 01 August 2017

Revised: 01 February 2017

Received: 01 February 2016

Published in TOPC Volume 4, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

the Scientific and Technological Research Council of Turkey (TUBITAK)
European Cooperation in Science and Technology

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
556
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)7

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Hong YBuluç A(2024)A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00053(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00053
Ranawaka IHussain MBlock CGerogiannis GTorrellas JAzad A(2024)Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00052(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00052
Kim JJang MNam HKim S(2023)HARP: Hardware-Based Pseudo-Tiling for Sparse Matrix Multiplication AcceleratorProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623790(1148-1162)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623790
Le Fèvre VCasas MButt AMi NChard K(2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593000
Gao JJi WChang FHan SWei BLiu ZWang Y(2023)A Systematic Survey of General Sparse Matrix-matrix MultiplicationACM Computing Surveys10.1145/357115755:12(1-36)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3571157
He XChen KFeng SKim HBlaauw DDreslinski RMudge TKloeckner AMoreira J(2022)Squaring the circleProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569665(148-159)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569665
Malkin NWagner DEgelman S(2022)Can Humans Detect Malicious Always-Listening Assistants? A Framework for Crowdsourcing Test DrivesProceedings of the ACM on Human-Computer Interaction10.1145/35556136:CSCW2(1-28)Online publication date: 11-Nov-2022
https://dl.acm.org/doi/10.1145/3555613
Sharma NColucci-Gray Lvan der Wal RSiddharthan A(2022)Consensus Building in On-Line Citizen ScienceProceedings of the ACM on Human-Computer Interaction10.1145/35555356:CSCW2(1-26)Online publication date: 11-Nov-2022
https://dl.acm.org/doi/10.1145/3555535
Tran ABoone ALe Dantec CDiSalvo C(2022)Careful Data TinkeringProceedings of the ACM on Human-Computer Interaction10.1145/35555326:CSCW2(1-29)Online publication date: 11-Nov-2022
https://dl.acm.org/doi/10.1145/3555532
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents