Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication

Published: 03 January 2018 Publication History

Abstract

We investigate outer-product--parallel, inner-product--parallel, and row-by-row-product--parallel formulations of sparse matrix-matrix multiplication (SpGEMM) on distributed memory architectures. For each of these three formulations, we propose a hypergraph model and a bipartite graph model for distributing SpGEMM computations based on one-dimensional (1D) partitioning of input matrices. We also propose a communication hypergraph model for each formulation for distributing communication operations. The computational graph and hypergraph models adopted in the first phase aim at minimizing the total message volume and balancing the computational loads of processors, whereas the communication hypergraph models adopted in the second phase aim at minimizing the total message count and balancing the message volume loads of processors. That is, the computational partitioning models reduce the bandwidth cost and the communication hypergraph models reduce the latency cost. Our extensive parallel experiments on up to 2048 processors for a wide range of realistic SpGEMM instances show that although the outer-product--parallel formulation scales better, the row-by-row-product--parallel formulation is more viable due to its significantly lower partitioning overhead and competitive scalability. For computational partitioning models, our experimental findings indicate that the proposed bipartite graph models are attractive alternatives to their hypergraph counterparts because of their lower partitioning overhead. Finally, we show that by reducing the latency cost besides the bandwidth cost through using the communication hypergraph models, the parallel SpGEMM time can be further improved up to 32%.

References

[1]
Kadir Akbudak and Cevdet Aykanat. 2014. Simultaneous input and output matrix partitioning for outer-product--parallel sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing 36, 5 (2014), C568--C590. arXiv:http://dx.doi.org/10.1137/13092589X
[2]
Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Transactions on Parallel and Distributed Systems 28, 8 (2017), 2258--2271.
[3]
Cevdet Aykanat, B. Barla Cambazoglu, and Bora Uçar. 2008. Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices. Journal of Parallel and Distributed Computing 68, 5 (May 2008), 609--625.
[4]
Ariful Azad, Aydın Buluç, and John R. Gilbert. 2015. Parallel triangle counting and enumeration using matrix algebra. In Proceedings of the IPDPSW, Workshop on Graph Algorithm Building Blocks (GABB’15). 804--811.
[5]
Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Benjamin Lipshitz, Oded Schwartz, and Sivan Toledo. 2013. Communication optimal parallel multiplication of sparse random matrices. In Proceedings of the 25th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’13). ACM, New York, 222--231.
[6]
Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2015. Brief announcement: Hypergraph partitioning for parallel sparse matrix-matrix multiplication. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA’15). ACM, New York, 86--88.
[7]
Nathan Bell, Steven Dalton, and Luke N. Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing 34, 4 (2012), C123--C152.
[8]
Rob H. Bisseling, Timothy M. Doup, and L. Daniel J. C. Loyens. 1993. A parallel interior point algorithm for linear programming on a network of transputers. Annals of Operations Research 43 (1993), 51--86.
[9]
Erik G. Boman, Ojas Parekh, and Cynthia Phillips. 2005. LDRD Final Report on Massively-Parallel Linear Programming: The parPCx System. Technical Report. SAND2004-6440, Sandia National Laboratories.
[10]
Urban Borštnik, Joost VandeVondele, Valéry Weber, and Jürg Hutter. 2014. Sparse matrix multiplication: The distributed block-compressed sparse row library. Parallel Computing 40, 5 (2014), 47--58.
[11]
William L. Briggs, Van Emden Henson, and Steve F. McCormick. 2000. A Multigrid Tutorial. 2nd ed. Siam, Philadelphia.
[12]
Aydın Buluç and John R. Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). 1--11.
[13]
Aydin Buluç and John R. Gilbert. 2011. The combinatorial BLAS: Design, implementation, and applications. International Journal of High Performance Computing Applications 25, 4 (2011), 496--509.
[14]
Aydın Buluç and John R. Gilbert. 2012. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM Journal of Scientific Computing (SISC) 34, 4 (2012), 170--191.
[15]
Lynn Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. Dissertation. Montana State University, Bozeman, MN.
[16]
Ümit V. Çatalyürek and Cevdet Aykanat. 1999a. Hypergraph-partitioning based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on Parallel Distributed Systems 10, 7 (1999), 673--693.
[17]
Ümit V. Çatalyürek and Cevdet Aykanat. 1999b. PaToH: A Multilevel Hypergraph Partitioning Tool, Version 3.0. Computer Engineering Department, Bilkent University, Ankara, Turkey.
[18]
Matt Challacombe. 1999. A simplified density matrix minimization for linear scaling self-consistent field theory. Journal of Chemical Physics 110, 5 (1999), 2332--2342.
[19]
Matt Challacombe. 2000. A general parallel sparse-blocked matrix multiply for linear scaling SCF theory. Computer Physics Communications 128, 12 (2000), 93--107.
[20]
CP2K. 2016. CP2K home page. Retrieved from http://www.cp2k.org/.
[21]
Steven Dalton, Nathan Bell, and Luke Olson. 2013. Optimizing Sparse Matrix-Matrix Multiplication for the GPU. Technical report. NVIDIA.
[22]
Andrew D. Daniels, John M. Millam, and Gustavo E. Scuseria. 1997. Semiempirical methods with conjugate gradient density matrix search to replace diagonalization for molecular systems containing thousands of atoms. Journal of Chemical Physics 107, 2 (1997), 425--431.
[23]
Timothy A. Davis. 2006. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics. .arXiv:http://epubs.siam.org/doi/pdf/10.1137/1.9780898718881
[24]
Timothy A. Davis and Yifan Hu. 2011. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software 38, 1 (2011), 1.
[25]
James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. In Proceedings of 27th International Parallel Distributed Processing Symposium. IEEE, 261--272.
[26]
Elizabeth D. Dolan and Jorge J. Moré. 2002. Benchmarking optimization software with performance profiles. Mathematical Programming 91, 2 (2002), 201--213.
[27]
Felix Gremse, Andreas Hofter, Lars Ole Schwen, Fabian Kiessling, and Uwe Naumann. 2015. GPU-accelerated sparse matrix-matrix multiplication by iterative row merging. SIAM Journal on Scientific Computing 37, 1 (2015), C54--C71.
[28]
Fred G. Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software 4, 3 (1978), 250--269.
[29]
Václav Hapla, David Horák, and Michal Merta. 2013. Use of direct solvers in TFETI massively parallel implementation. In Applied Parallel and Scientific Computing, Pekka Manninen and Per Ster (Eds.). Springer, Berlin, 192--205.
[30]
Michael Heroux, Roscoe Bartlett, Vicki Howle, Robert Hoekstra, Jonathan Hu, Tamara Kolda, Richard Lehoucq, Kevin Long, Roger Pawlowski, Eric Phipps, Andrew Salinger, Heidi Thornquist, Ray Tuminaro, James Willenbring, and Alan Williams. 2003. An Overview of Trilinos. Technical Report SAND2003-2927. Sandia National Laboratories, Albuquerque, NM.
[31]
Satoshi Itoh, Pablo Ordejn, and Richard M. Martin. 1995. Order-N tight-binding molecular dynamics on parallel computers. Computer Physics Communications 88, 2--3 (1995), 173--185. 10.1016/0010-4655(95)00031-A
[32]
George Karypis, Anshul Gupta, and Vipin Kumar. 1994. A parallel formulation of interior point algorithms. In Supercomputing’94.
[33]
George Karypis and Vipin Kumar. 1998. A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. Journal of Parallel and Distributed Computing 48, 1 (1998), 71--95.
[34]
George Karypis and Vipin Kumar. 1999. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (1999), 359--392. Also available at http://www.cs.umn.edu/∼karypis. A short version appears in International Conference on Parallel Processing 1995.
[35]
Thomas Lengauer. 1990. Combinatorial Algorithms for Integrated Circuit Layout. Willey--Teubner, Chichester, UK.
[36]
X.-P. Li, R. W. Nunes, and David Vanderbilt. 1993. Density-matrix electronic-structure method with linear system-size scaling. Physical Review B 47, 16 (Apr. 1993), 10891--10894.
[37]
Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7, 1 (2003), 76--80.
[38]
Weifeng Liu and Brian Vinter. 2014. An efficient GPU general sparse matrix-matrix multiplication for irregular data. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 370--381.
[39]
John M. Millam and Gustavo E. Scuseria. 1997. Linear scaling conjugate gradient density matrix search as an alternative to diagonalization for first principles electronic structure calculations. Journal of Chemical Physics 106, 13 (1997), 5569--5577.
[40]
Intel MKL. 2015. Math Kernel Library (MKL). Retrieved from http://software.intel.com/en-us/articles/intel-mkl/.
[41]
Kurtis L. Nusbaum. 2011. Optimizing Tpetra’s Sparse Matrix-Matrix Multiplication Routine. Technical Report. SAND2011-6036, Sandia National Laboratories.
[42]
Carlos Ordonez. 2010. Optimization of linear recursive queries in SQL. IEEE Transactions on Knowledge and Data Engineering 22, 2 (2010), 264--277.
[43]
Carlos Ordonez, Yiqun Zhang, and Wellington Cabrera. 2016. The gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Transactions on Knowledge and Data Engineering 28, 7 (2016), 1905--1918.
[44]
Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jongsoo Park, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, Sergey G. Pudov, Vadim O. Pirogov, and Pradeep Dubey. 2015. Parallel efficient sparse matrix-matrix multiplication on multicore platforms. In High Performance Computing. Springer, 48--57.
[45]
H. Bernhard Schlegel, John M. Millam, Srinivasan S. Iyengar, Gregory A. Voth, Andrew D. Daniels, Gustavo E. Scuseria, and Michael J. Frisch. 2001. Ab initio molecular dynamics: Propagating the density matrix with Gaussian orbitals. Journal of Chemical Physics 114, 22 (2001), 9758--9763.
[46]
Oguz Selvitopi and Cevdet Aykanat. 2016. Reducing latency cost in 2D sparse matrix partitioning models. Parallel Computing 57, Supplement C (2016), 1--24.
[47]
Edgar Solomonik, Abhinav Bhatele, and James Demmel. 2011. Improving communication performance in dense linear algebra via topology aware collectives. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, Article 77, 11 pages.
[48]
Total-FETI. 2016. Total-FETI Massively Parallel Implementation Research Group. Retrieved from http://spomech.vsb.cz/feti/.
[49]
Bora Uçar and Cevdet Aykanat. 2004. Encapsulating multiple communication-cost metrics in partitioning sparse rectangular matrices for parallel matrix-vector multiplies. SIAM Journal on Scientific Computing 26, 6 (2004), 1837--1859.
[50]
Robert A. van de Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency - Practice and Experience 9, 4 (1997), 255--274.
[51]
Joost VandeVondele, Urban Borstnik, and Jurg Hutter. 2012. Linear scaling self-consistent field calculations with millions of atoms in the condensed phase. Journal of Chemical Theory and Computation 8, 10 (2012), 3565--3573.

Cited By

View all
  • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
  • (2024)A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00053(1-14)Online publication date: 17-Nov-2024
  • (2024)Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00052(1-17)Online publication date: 17-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 4, Issue 3
September 2017
120 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3175004
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 January 2018
Accepted: 01 August 2017
Revised: 01 February 2017
Received: 01 February 2016
Published in TOPC Volume 4, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SpGEMM
  2. Sparse matrix-matrix multiplication
  3. bandwidth
  4. communication cost
  5. graph partitioning
  6. hypergraph partitioning
  7. latency

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)7
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
  • (2024)A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00053(1-14)Online publication date: 17-Nov-2024
  • (2024)Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00052(1-17)Online publication date: 17-Nov-2024
  • (2023)HARP: Hardware-Based Pseudo-Tiling for Sparse Matrix Multiplication AcceleratorProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623790(1148-1162)Online publication date: 28-Oct-2023
  • (2023)Efficient Execution of SpGEMM on Long Vector ArchitecturesProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593000(101-113)Online publication date: 7-Aug-2023
  • (2023)A Systematic Survey of General Sparse Matrix-matrix MultiplicationACM Computing Surveys10.1145/357115755:12(1-36)Online publication date: 2-Mar-2023
  • (2022)Squaring the circleProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569665(148-159)Online publication date: 8-Oct-2022
  • (2022)Can Humans Detect Malicious Always-Listening Assistants? A Framework for Crowdsourcing Test DrivesProceedings of the ACM on Human-Computer Interaction10.1145/35556136:CSCW2(1-28)Online publication date: 11-Nov-2022
  • (2022)Consensus Building in On-Line Citizen ScienceProceedings of the ACM on Human-Computer Interaction10.1145/35555356:CSCW2(1-26)Online publication date: 11-Nov-2022
  • (2022)Careful Data TinkeringProceedings of the ACM on Human-Computer Interaction10.1145/35555326:CSCW2(1-29)Online publication date: 11-Nov-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media