Article

Memory efficient parallel matrix multiplication operation for irregular problems

Authors:

Manojkumar Krishnan,

Jarek NieplochaAuthors Info & Claims

CF '06: Proceedings of the 3rd conference on Computing frontiers

Pages 229 - 240

https://doi.org/10.1145/1128022.1128054

Published: 03 May 2006 Publication History

Abstract

Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting data locality and achieves performance as good as the traditional approach which relies on temporary arrays with regular distribution, data redistribution, and matrix multiplication for regular matrices to handle the irregular case. The proposed algorithm is memory-efficient because temporary matrices are not needed. This feature is critical for systems like the IBM Blue Gene/L that offer very limited amount of memory per node. The experimental results demonstrate very good performance across the range of matrix distributions and problem sizes motivated by real applications.

References

[1]

M. Krishnan, J. Nieplocha, "SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems", Proc. IPDPS'04, 2004.]]

[2]

L. E. Cannon, "A cellular computer to implement the Kalman Filter Algorithm", Ph.D. dissertation, Montana State University, 1969.]]

Digital Library

[3]

G. C. Fox, S. W. Otto, A. J. G. Hey, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, vol. 4, pp. 17--31, 1987.]]

[4]

G. C. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors. vol. 1, Prentice Hall, 1988.]]

Digital Library

[5]

G.H. Golub, C.H Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.]]

[6]

J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Computing, vol. 12, 1989.]]

[7]

A. Gupta and V. Kumar, "Scalability of Parallel Algorithms for Matrix Multiplication", ICPP'93, 1993.]]

Digital Library

[8]

C. Lin and L.Snyder, "A matrix product algorithm and its comparative performance on hypercubes", SHPCC '92.]]

[9]

Q. Luo and J. B. Drake, A Scalable Parallel Strassen's Matrix Multiply Algorithm for Distributed Memory Computers, http://citeseer.nj.nec.com/517382.html]]

[10]

S. Huss-Lederman, E. M. Jacobson, and A. Tsao, "Comparison of Scalable Parallel Matrix Multiplication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, 1994.]]

[11]

C. T. Ho, S. L. Johnsson, A. Edelman, Matrix multiplication on hypercubes using full bandwidth and constant storage, Proc. 6 Distributed Memory Computing Conference. 1991.]]

[12]

H. Gupta and P. Sadayappan, "Communication Efficient Matrix Multiplication on Hypercubes", in Proceedings of the Sixth ACM Symposium on Parallel Algorithms and Architectures, 1994.]]

Digital Library

[13]

J. Li, A. Skjellum, and R. D. Falgout, "A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies," Concurrency, Practice and Experience, vol. 9(5), pp. 345--389, 1997.]]

[14]

E. Dekel, D. Nassimi, S. Sahni, Parallel matrix and graph algorithms, SIAM J. Computing, vol. 10, 1981.]]

[15]

S. Ranka, S. Sahni. Hypercube Algorithms for Image Processing and Pattern Recognition. Springer-Verlag, 1990.]]

Digital Library

[16]

J. Choi, J. Dongarra, and D. W. Walker, "PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers," Concurrency: Practice and Experience, vol. 6(7), 1994.]]

[17]

S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, "Matrix Multiplication on the Intel Touchstone DELTA", Concurrency: Practice and Experience, vol. 6 (7). Oct 1994.]]

[18]

R. C. Agarwal, F. Gustavson, M. Zubair, A high performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication, IBM J. Research and Development, vol. 38 (6), 1994.]]

Digital Library

[19]

R. van de Geijn, R. and J. Watts, "SUMMA: Scalable Universal Matrix Multiplication Algorithm," Concurrency: Practice and Experience, vol. 9(4), pp. 255--274, April 1997.]]

Digital Library

[20]

J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and, R. C. Whaley, "A Proposal for a Set of Parallel Basic Linear Algebra Subprograms", University of Tennessee, Tech. Rep. CS-95-292, May 1995.]]

Digital Library

[21]

L. S. Blackford et. al., ScaLAPACK Users' Guide, SIAM, 1997, Philadelphia, PA.]]

Digital Library

[22]

J. Choi, "A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers", in Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997.]]

Digital Library

[23]

C. Addison and Y. Ren, "OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries", in Proceedings EWOMP'01. 2001.]]

[24]

G.R. Luecke, W. Lin, "Scalability and Performance of OpenMP and MPI on a 128-Processor SGI Origin 2000", Concurrency and Computation: Practice and Experience, vol. 13, pp 905--928. 2001.]]

[25]

M. Wu, S. Aluru, and R. A. Kendall, "Mixed Mode Matrix Multiplication", Intl. Conf. Cluster Computing '02.]]

Digital Library

[26]

T. Betcke, "Performance analysis of various parallelization methods for BLAS3 routines on cluster architectures",John von Neumann-Instituts fur Computing, Tech. Rep. FZJ-ZAM-IB-2000-15, 2000.]]

[27]

J. L. Träff, H. Ritzdorf, R. Hempel "The Implementation of MPI-2 One-Sided Communication for the NEC SX-5", in Proceedings of Supercomputing, 2000.]]

[28]

J. Liu, J. Wu, S. P. Kinis, P. Wyckoff, and D. K. Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, ACM Intl. Conference on Supercomputing, 2003.]]

Digital Library

[29]

J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D.K. Panda," Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters", IEEE CLUSTER'03, 2003.]]

[30]

A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Addison Wesley, 2003.]]

Digital Library

[31]

Cray Online documentation. Optimizing Applications on the Cray X1TM System. http://www.cray.com/craydoc/20/manuals/S-2315-50/html-S-2315-50/S-2315-50-toc.html]]

[32]

J. Nieplocha, B. Carpenter, ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems, RTSPP IPPS/SDP, 1999.]]

Digital Library

[33]

ARMCI Web page. http://www.emsl.pnl.gov/docs/parsoft/armci/]]

[34]

J. Nieplocha, V. Tipparaju, J. Ju, and E. Apra, "One-sided communication on Myrinet", Cluster Computing, 2003.]]

Digital Library

[35]

J. Nieplocha, V. Tipparaju, A. Saify, and D. Panda, "Protocols and Strategies for Optimizing Remote Memory Operations on Clusters", Proc CAC/IPDPS'02.2002.]]

Digital Library

[36]

M. Krishnan and J. Nieplocha, "Optimizing Parallel Multiplication Operation for Rectangular and Transposed Matrices," In Proceedings of 10th IEEE ICPADS. 2004.]]

Digital Library

[37]

T.H. Dunning, Jr. J. Chem. Phys. 90, 1007 (1989).]]

[38]

Y. Alexeev, M. Valiev, D. A. Dixon, T. L. Windus, "Ab initio study of catalytic GTP hydrolysis", J. of ACS, '04.]]

[39]

I. Foster et al. "Toward High-Performance Computational Chemistry: I. Scalable Fock Matrix Construction Algorithms", Journal of computational chemistry, vol. 17, No. 1, 109--123, 1996.]]

[40]

C. Hsu, Y. Chung, and C. Dow, Efficient Methods for Multi-Dimensional Array Redistribution, Journal of Supercomputing, 17, 23--46, 2000.]]

Digital Library

[41]

C. Edwards, P. Geng, A. Patra, and R.Van De Geign, "Parallel Matrix Distributions: Have we been doing it all wrong?", TR-95-39, Department of Computer Sciences, University of Texas, Oct. 1995.]]

Digital Library

[42]

Hyuk-Jae Lee, J.A.B. Fortes, Toward data distribution independent parallel matrix multiplication, IPDPS, 1995.]]

[43]

S. D. Kaushik, C.-H. Huangl, R. W. Johnson2, and P. Sadayappan, "An Approach to Communication-Efficient Data Redistribution", Proc. Supercomputing'94, pp: 364--373,1994.]]

Digital Library

[44]

Banicescu, Ioana and R. Lu, Experiences with Fractiling in N-Body Simulations, HPC Symposium, 1998.]]

[45]

Kendall et al, "High Performance Computational Chemistry: an Overview of NWChem a Distributed Parallel Application", Computer Phys. Comm., 2000, 128, 260--283.]]

Cited By

Kelefouras VKritikakou AMporas IKolonias V(2016)A high-performance matrix---matrix multiplication methodology for CPU and GPU architecturesThe Journal of Supercomputing10.1007/s11227-015-1613-772:3(804-844)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1007/s11227-015-1613-7
Kelefouras VKritikakou AGoutis C(2014)A Matrix---Matrix Multiplication methodology for single/multi-core architectures using SIMDThe Journal of Supercomputing10.1007/s11227-014-1098-968:3(1418-1440)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1007/s11227-014-1098-9
Hafeez MYounus MRehman AMohsin A(2007)Optimal solution to matrix parenthesization problem employing parallel processing approachProceedings of the 8th Conference on 8th WSEAS International Conference on Evolutionary Computing - Volume 810.5555/1347992.1347994(235-240)Online publication date: 19-Jun-2007
https://dl.acm.org/doi/10.5555/1347992.1347994

Index Terms

Memory efficient parallel matrix multiplication operation for irregular problems
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Scaling Linear Algebra Kernels Using Remote Memory Access
ICPPW '10: Proceedings of the 2010 39th International Conference on Parallel Processing Workshops

This paper describes the scalability of linear algebra kernels based on remote memory access approach. The current approach differs from the other linear algebra algorithms by the explicit use of shared memory and remote memory access (RMA) ...
Toward data distribution independent parallel matrix multiplication
IPPS '95: Proceedings of the 9th International Symposium on Parallel Processing

To eliminate or reduce initial data redistribution overheads for distributed memory parallel computers, this paper considers the problem of writing data distribution independent (DDI) programs whose functionality and execution time are independent of ...
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers

Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N ), where 2< 3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(logN) time by using ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '06: Proceedings of the 3rd conference on Computing frontiers

May 2006

430 pages

ISBN:1595933026

DOI:10.1145/1128022

General Chairs:
Monica Alderighi
IASF - INAF
,
Valentina Salapura
IBM
,
Program Chair:
Sally A. McKee
Cornell University

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CF06

Sponsor:

CF06: Computing Frontiers Conference

May 3 - 5, 2006

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
708
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kelefouras VKritikakou AMporas IKolonias V(2016)A high-performance matrix---matrix multiplication methodology for CPU and GPU architecturesThe Journal of Supercomputing10.1007/s11227-015-1613-772:3(804-844)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1007/s11227-015-1613-7
Kelefouras VKritikakou AGoutis C(2014)A Matrix---Matrix Multiplication methodology for single/multi-core architectures using SIMDThe Journal of Supercomputing10.1007/s11227-014-1098-968:3(1418-1440)Online publication date: 1-Jun-2014
https://dl.acm.org/doi/10.1007/s11227-014-1098-9
Hafeez MYounus MRehman AMohsin A(2007)Optimal solution to matrix parenthesization problem employing parallel processing approachProceedings of the 8th Conference on 8th WSEAS International Conference on Evolutionary Computing - Volume 810.5555/1347992.1347994(235-240)Online publication date: 19-Jun-2007
https://dl.acm.org/doi/10.5555/1347992.1347994

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents