Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
SUMMA: Scalable Universal Matrix Multiplication AlgorithmApril 1995
1995 Technical Report
Publisher:
  • University of Texas at Austin
  • Computer Science Dept. Taylor Hall 2.124 Austin, TX
  • United States
Published:01 April 1995
Reflects downloads up to 25 Jan 2025Bibliometrics
Skip Abstract Section
Abstract

In this paper, we give a straight forward, highly efficient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system.

Cited By

  1. Okanovic P, Kwasniewski G, Labini P, Besta M, Vella F and Hoefler T High Performance Unstructured SpMM Computation Using Tensor Cores Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-14)
  2. Hong Y and Buluç A A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix Multiplication Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, (1-14)
  3. Akbudak K (2024). Hypergraph-based locality-enhancing methods for graph operations in Big Data applications, International Journal of High Performance Computing Applications, 38:3, (210-224), Online publication date: 1-May-2024.
  4. ACM
    Munch P, Kormann K and Kronbichler M (2021). hyper.deal: An Efficient, Matrix-free Finite-element Library for High-dimensional Partial Differential Equations, ACM Transactions on Mathematical Software, 47:4, (1-34), Online publication date: 31-Dec-2022.
  5. Liu Y, Shi L, Zhang J and Robertazzi T Layer based partition for matrix multiplication on heterogeneous mesh networks Proceedings of the High Performance Computing Symposium, (1-12)
  6. ACM
    Wang M, Huang C and Li J Supporting Very Large Models using Automatic Dataflow Graph Partitioning Proceedings of the Fourteenth EuroSys Conference 2019, (1-17)
  7. Huang J, Smith T, Henry G and van de Geijn R Strassen's algorithm reloaded Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-12)
  8. Rico-Gallego J, Díaz-Martín J and Lastovetsky A (2016). Extending τ -Lop to model concurrent MPI communications in multicore clusters, Future Generation Computer Systems, 61:C, (66-82), Online publication date: 1-Aug-2016.
  9. Kelefouras V, Kritikakou A, Mporas I and Kolonias V (2016). A high-performance matrix---matrix multiplication methodology for CPU and GPU architectures, The Journal of Supercomputing, 72:3, (804-844), Online publication date: 1-Mar-2016.
  10. ACM
    Pellauer M, Parashar A, Adler M, Ahsan B, Allmon R, Crago N, Fleming K, Gambhir M, Jaleel A, Krishna T, Lustig D, Maresh S, Pavlov V, Rayess R, Zhai A and Emer J (2015). Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures, ACM Transactions on Computer Systems, 33:3, (1-32), Online publication date: 11-Sep-2015.
  11. ACM
    Bae S, Choi J, Qiu J and Fox G Dimension reduction and visualization of large high-dimensional data via interpolation Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, (203-214)
  12. ACM
    Nishtala R, Almasi G and Cascaval C Performance without pain = productivity Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, (99-110)
  13. ACM
    Krishnan M and Nieplocha J Memory efficient parallel matrix multiplication operation for irregular problems Proceedings of the 3rd conference on Computing frontiers, (229-240)
  14. Zekri A and Sedukhin S The general matrix multiply-add operation on 2D torus Proceedings of the 20th international conference on Parallel and distributed processing, (309-309)
  15. ACM
    Bikshandi G, Guo J, Hoeflinger D, Almasi G, Fraguela B, Garzarán M, Padua D and von Praun C Programming for parallelism and locality with hierarchically tiled arrays Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, (48-57)
  16. Zekri A and Sedukhin S Computationally efficient parallel matrix-matrix multiplication on the torus Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems, (219-226)
  17. El-Qawasmeh E, Al-Ayyoub A and Abu-Ghazaleh N (2004). Quick Matrix Multiplication on Clusters of Workstations, Informatica, 15:2, (203-218), Online publication date: 1-Apr-2004.
  18. Wang C and Sahni S (2001). Matrix Multiplication on the OTIS-Mesh Optoelectronic Computer, IEEE Transactions on Computers, 50:7, (635-646), Online publication date: 1-Jul-2001.
  19. Chen Y (2000). Automated Tuning of Parallel I/O Systems, IEEE Transactions on Software Engineering, 26:4, (362-383), Online publication date: 1-Apr-2000.
  20. Petitet A and Dongarra J (1999). Algorithmic Redistribution Methods for Block-Cyclic Decompositions, IEEE Transactions on Parallel and Distributed Systems, 10:12, (1201-1216), Online publication date: 1-Dec-1999.
  21. Choi J A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
  22. Agarwal R, Balle S, Gustavson F, Joshi M and Palkar P (1995). A three-dimensional approach to parallel matrix multiplication, IBM Journal of Research and Development, 39:5, (575-582), Online publication date: 1-Sep-1995.
Contributors
  • The University of Texas at Austin
  • California Institute of Technology

Recommendations