Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1128022.1128054acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

Memory efficient parallel matrix multiplication operation for irregular problems

Published: 03 May 2006 Publication History
  • Get Citation Alerts
  • Abstract

    Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting data locality and achieves performance as good as the traditional approach which relies on temporary arrays with regular distribution, data redistribution, and matrix multiplication for regular matrices to handle the irregular case. The proposed algorithm is memory-efficient because temporary matrices are not needed. This feature is critical for systems like the IBM Blue Gene/L that offer very limited amount of memory per node. The experimental results demonstrate very good performance across the range of matrix distributions and problem sizes motivated by real applications.

    References

    [1]
    M. Krishnan, J. Nieplocha, "SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems", Proc. IPDPS'04, 2004.]]
    [2]
    L. E. Cannon, "A cellular computer to implement the Kalman Filter Algorithm", Ph.D. dissertation, Montana State University, 1969.]]
    [3]
    G. C. Fox, S. W. Otto, A. J. G. Hey, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, vol. 4, pp. 17--31, 1987.]]
    [4]
    G. C. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors. vol. 1, Prentice Hall, 1988.]]
    [5]
    G.H. Golub, C.H Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.]]
    [6]
    J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Computing, vol. 12, 1989.]]
    [7]
    A. Gupta and V. Kumar, "Scalability of Parallel Algorithms for Matrix Multiplication", ICPP'93, 1993.]]
    [8]
    C. Lin and L.Snyder, "A matrix product algorithm and its comparative performance on hypercubes", SHPCC '92.]]
    [9]
    Q. Luo and J. B. Drake, A Scalable Parallel Strassen's Matrix Multiply Algorithm for Distributed Memory Computers, http://citeseer.nj.nec.com/517382.html]]
    [10]
    S. Huss-Lederman, E. M. Jacobson, and A. Tsao, "Comparison of Scalable Parallel Matrix Multiplication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, 1994.]]
    [11]
    C. T. Ho, S. L. Johnsson, A. Edelman, Matrix multiplication on hypercubes using full bandwidth and constant storage, Proc. 6 Distributed Memory Computing Conference. 1991.]]
    [12]
    H. Gupta and P. Sadayappan, "Communication Efficient Matrix Multiplication on Hypercubes", in Proceedings of the Sixth ACM Symposium on Parallel Algorithms and Architectures, 1994.]]
    [13]
    J. Li, A. Skjellum, and R. D. Falgout, "A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies," Concurrency, Practice and Experience, vol. 9(5), pp. 345--389, 1997.]]
    [14]
    E. Dekel, D. Nassimi, S. Sahni, Parallel matrix and graph algorithms, SIAM J. Computing, vol. 10, 1981.]]
    [15]
    S. Ranka, S. Sahni. Hypercube Algorithms for Image Processing and Pattern Recognition. Springer-Verlag, 1990.]]
    [16]
    J. Choi, J. Dongarra, and D. W. Walker, "PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers," Concurrency: Practice and Experience, vol. 6(7), 1994.]]
    [17]
    S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, "Matrix Multiplication on the Intel Touchstone DELTA", Concurrency: Practice and Experience, vol. 6 (7). Oct 1994.]]
    [18]
    R. C. Agarwal, F. Gustavson, M. Zubair, A high performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication, IBM J. Research and Development, vol. 38 (6), 1994.]]
    [19]
    R. van de Geijn, R. and J. Watts, "SUMMA: Scalable Universal Matrix Multiplication Algorithm," Concurrency: Practice and Experience, vol. 9(4), pp. 255--274, April 1997.]]
    [20]
    J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and, R. C. Whaley, "A Proposal for a Set of Parallel Basic Linear Algebra Subprograms", University of Tennessee, Tech. Rep. CS-95-292, May 1995.]]
    [21]
    L. S. Blackford et. al., ScaLAPACK Users' Guide, SIAM, 1997, Philadelphia, PA.]]
    [22]
    J. Choi, "A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers", in Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997.]]
    [23]
    C. Addison and Y. Ren, "OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries", in Proceedings EWOMP'01. 2001.]]
    [24]
    G.R. Luecke, W. Lin, "Scalability and Performance of OpenMP and MPI on a 128-Processor SGI Origin 2000", Concurrency and Computation: Practice and Experience, vol. 13, pp 905--928. 2001.]]
    [25]
    M. Wu, S. Aluru, and R. A. Kendall, "Mixed Mode Matrix Multiplication", Intl. Conf. Cluster Computing '02.]]
    [26]
    T. Betcke, "Performance analysis of various parallelization methods for BLAS3 routines on cluster architectures",John von Neumann-Instituts fur Computing, Tech. Rep. FZJ-ZAM-IB-2000-15, 2000.]]
    [27]
    J. L. Träff, H. Ritzdorf, R. Hempel "The Implementation of MPI-2 One-Sided Communication for the NEC SX-5", in Proceedings of Supercomputing, 2000.]]
    [28]
    J. Liu, J. Wu, S. P. Kinis, P. Wyckoff, and D. K. Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, ACM Intl. Conference on Supercomputing, 2003.]]
    [29]
    J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D.K. Panda," Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters", IEEE CLUSTER'03, 2003.]]
    [30]
    A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Addison Wesley, 2003.]]
    [31]
    Cray Online documentation. Optimizing Applications on the Cray X1TM System. http://www.cray.com/craydoc/20/manuals/S-2315-50/html-S-2315-50/S-2315-50-toc.html]]
    [32]
    J. Nieplocha, B. Carpenter, ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems, RTSPP IPPS/SDP, 1999.]]
    [33]
    ARMCI Web page. http://www.emsl.pnl.gov/docs/parsoft/armci/]]
    [34]
    J. Nieplocha, V. Tipparaju, J. Ju, and E. Apra, "One-sided communication on Myrinet", Cluster Computing, 2003.]]
    [35]
    J. Nieplocha, V. Tipparaju, A. Saify, and D. Panda, "Protocols and Strategies for Optimizing Remote Memory Operations on Clusters", Proc CAC/IPDPS'02.2002.]]
    [36]
    M. Krishnan and J. Nieplocha, "Optimizing Parallel Multiplication Operation for Rectangular and Transposed Matrices," In Proceedings of 10th IEEE ICPADS. 2004.]]
    [37]
    T.H. Dunning, Jr. J. Chem. Phys. 90, 1007 (1989).]]
    [38]
    Y. Alexeev, M. Valiev, D. A. Dixon, T. L. Windus, "Ab initio study of catalytic GTP hydrolysis", J. of ACS, '04.]]
    [39]
    I. Foster et al. "Toward High-Performance Computational Chemistry: I. Scalable Fock Matrix Construction Algorithms", Journal of computational chemistry, vol. 17, No. 1, 109--123, 1996.]]
    [40]
    C. Hsu, Y. Chung, and C. Dow, Efficient Methods for Multi-Dimensional Array Redistribution, Journal of Supercomputing, 17, 23--46, 2000.]]
    [41]
    C. Edwards, P. Geng, A. Patra, and R.Van De Geign, "Parallel Matrix Distributions: Have we been doing it all wrong?", TR-95-39, Department of Computer Sciences, University of Texas, Oct. 1995.]]
    [42]
    Hyuk-Jae Lee, J.A.B. Fortes, Toward data distribution independent parallel matrix multiplication, IPDPS, 1995.]]
    [43]
    S. D. Kaushik, C.-H. Huangl, R. W. Johnson2, and P. Sadayappan, "An Approach to Communication-Efficient Data Redistribution", Proc. Supercomputing'94, pp: 364--373,1994.]]
    [44]
    Banicescu, Ioana and R. Lu, Experiences with Fractiling in N-Body Simulations, HPC Symposium, 1998.]]
    [45]
    Kendall et al, "High Performance Computational Chemistry: an Overview of NWChem a Distributed Parallel Application", Computer Phys. Comm., 2000, 128, 260--283.]]

    Cited By

    View all
    • (2016)A high-performance matrix---matrix multiplication methodology for CPU and GPU architecturesThe Journal of Supercomputing10.1007/s11227-015-1613-772:3(804-844)Online publication date: 1-Mar-2016
    • (2014)A Matrix---Matrix Multiplication methodology for single/multi-core architectures using SIMDThe Journal of Supercomputing10.1007/s11227-014-1098-968:3(1418-1440)Online publication date: 1-Jun-2014
    • (2007)Optimal solution to matrix parenthesization problem employing parallel processing approachProceedings of the 8th Conference on 8th WSEAS International Conference on Evolutionary Computing - Volume 810.5555/1347992.1347994(235-240)Online publication date: 19-Jun-2007

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '06: Proceedings of the 3rd conference on Computing frontiers
    May 2006
    430 pages
    ISBN:1595933026
    DOI:10.1145/1128022
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 May 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SRUMMA
    2. global arrays
    3. irregular distribution
    4. parallel linear algebra
    5. parallel matrix multiplication
    6. parallel programming
    7. remote memory access

    Qualifiers

    • Article

    Conference

    CF06
    Sponsor:
    CF06: Computing Frontiers Conference
    May 3 - 5, 2006
    Ischia, Italy

    Acceptance Rates

    Overall Acceptance Rate 273 of 785 submissions, 35%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 29 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)A high-performance matrix---matrix multiplication methodology for CPU and GPU architecturesThe Journal of Supercomputing10.1007/s11227-015-1613-772:3(804-844)Online publication date: 1-Mar-2016
    • (2014)A Matrix---Matrix Multiplication methodology for single/multi-core architectures using SIMDThe Journal of Supercomputing10.1007/s11227-014-1098-968:3(1418-1440)Online publication date: 1-Jun-2014
    • (2007)Optimal solution to matrix parenthesization problem employing parallel processing approachProceedings of the 8th Conference on 8th WSEAS International Conference on Evolutionary Computing - Volume 810.5555/1347992.1347994(235-240)Online publication date: 19-Jun-2007

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media