research-article

Distributed SBP Cholesky factorization algorithms with near-optimal scheduling

Authors:

Fred G. Gustavson,

Bo KågströmAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 36, Issue 2

Article No.: 11, Pages 1 - 25

https://doi.org/10.1145/1499096.1499100

Published: 07 April 2009 Publication History

Abstract

The minimal block storage Distributed Square Block Packed (DSBP) format for distributed memory computing on symmetric and triangular matrices is presented. Three algorithm variants (Basic, Static, and Dynamic) of the blocked right-looking Cholesky factorization are designed for the DSBP format, implemented, and evaluated. On our target machine, all variants outperform standard full-storage implementations while saving almost half the storage. Communication overhead is shown to be virtually eliminated by the Static and Dynamic variants, both of which take advantage of hardware parallelism to hide communication costs. The Basic variant is shown to yield comparable or slightly better performance than the full-storage ScaLAPACK routine PDPOTRF while clearly outperformed by both Static and Dynamic. Models of execution assuming zero communication costs and overhead are developed. For medium- and larger-sized problems, the Static schedule is near optimal on our target machine based on comparisons with these models and measurements of synchronization overhead.

References

[1]

Agarwal, R. C. and Gustavson, F. G. 1988. A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In Aspects of Computation on Asynchronous and Parallel Processors, M. Wright, Ed. IFIP, North-Holland, Amsterdam, The Netherlands, 217--221.

[2]

Agarwal, R. C. and Gustavson, F. G. 1989. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing '89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing. ACM Press, New York, NY, 225--233.

Digital Library

[3]

Agarwal, R. C., Gustavson, F. G., and Zubair, M. 1994. A high performance matrix multiplication algorithm on a distributed memory parallel machine using overlapped communication. IBM J. Res. Develop. 38, 6 (Nov.), 673--681.

Digital Library

[4]

Baboulin, M., Giraud, L., and Gratton, S. 2005a. A parallel distributed solver for large dense symmetric systems: Applications to geodesy and electromagnetism problems. Int. J. High Perform. Comput. Appl./19, 353--363.

[5]

Baboulin, M., Giraud, L., Gratton, S., and Langou, J. 2005b. A distributed packed storage for large parallel calculations. Tech. rep. TR/PA/05/30. CERFACS, Toulouse, France.

[6]

Brent, R. P. and Luk, F. T. 1982. Computing the Cholesky factorization using a systolic architecture. Tech. rep. TR 82--H521. Department of Computer Science, Cornell University, Cornell University, Ithaca, NY.

Digital Library

[7]

Brightwell, R. and Underwood, K. D. 2004. An analysis of the impact of MPI overlap and independent progress. In ICS '04: Proceedings of the 18th Annual International Conference on Supercomputing. ACM Press, New York, NY, 298--305.

Digital Library

[8]

Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. Tech. rep. UT-CS-07-600. University of Tennessee, Knoxville, TN.

[9]

Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R. 2007. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures. (San Diego, CA). 116--125.

Digital Library

[10]

Choi, J., Dongarra, J. J., Ostrouchov, S., Petitet, A. P., Walker, D. W., and Whaley, R. C. 1996. Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5, 3 (Fall), 173--184.

Digital Library

[11]

Dackland, K., Elmroth, E., Kågström, B., and Van Loan, C. 1992. Parallel block factorizations on the shared memory multiprocessor IBM 3090 VF/600J. Int. J. Supercomput. Appl. 6.1, 69--97.

Digital Library

[12]

Dackland, K., Elmroth, E., and Kågström, B. 1993. A ring--oriented approach for block matrix factorizations on shared and distributed memory architectures. In SIAM Conference on Parallel Processing for Scientific Computing, R. S. et al, Ed. SIAM Publications, Philadelphia, PA, 330--338.

[13]

D'Azevedo, E. and Dongarra, J. 1998. Packed storage extension for ScaLAPACK. Tech. rep. UT-CS-98-385. University of Tennessee, Knoxville, TN.

Digital Library

[14]

Dongarra, J. J., Duff, I. S., Sorenson, D. C., and van der Vorst, H. A. 1998. Numerical Linear Algebra on High-Performance Computers. SIAM, Philadelphia, PA.

Digital Library

[15]

Geist, G. A. and Heath, M. T. 1985. Parallel Cholesky factorization on a hypercube multiprocessor. Tech. rep. ORNL--6190. Oak Ridge National Lab., Oak Ridge, TN.

[16]

Gerasoulis, A. and Nelken, I. 1989. Scheduling linear algebra parallel algorithms on MIMD architectures. In Proceedings of the fourth SIAM Conference on Parallel Processing for Scientific Computing. 68--95.

Digital Library

[17]

Golub, G. H. and van Loan, C. F. 1996. Matrix Computations, 3rd ed. Johns Hopkins University Press, Baltimore, MD.

Digital Library

[18]

Goto, K. and van de Geijn, R. A. 2007. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software. To appear.

Digital Library

[19]

Grama, A. Y., Gupta, A., and Kumar, V. 1993. Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parall. Distrib. Tech. Syst. Appl. 1, 3, 12--21.

Digital Library

[20]

Gustavson, F. G., Gunnels, J. A., and Sexton, J. C. 2007a. Minimal data copy for dense linear algebra factorization. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol. 4699. Springer, Berlin, Germany, 540--549.

Digital Library

[21]

Gustavson, F. G., Karlsson, L., and Kågström, B. 2007b. Three algorithms for Cholesky factorization on distributed memory using packed storage. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol 4699. Springer. Berlin, Germany, 550--559. Also IBM Tech. rep. RC24137.

Digital Library

[22]

MPI Forum. 1995. MPI: A message passing interface standard. http://www.mpi-forum.org/.

[23]

O'Leary, D. P. and Stewart, G. W. 1985. Data-flow algorithms for parallel matrix computation. Commun. ACM 28, 840--853.

Digital Library

[24]

Strazdins, P. 1998. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. rep. TR-CS-98-07. Canberra 0200 ACT, Australia.

[25]

van de Geijn, R. A. 1997. Using PLAPACK. MIT Press, Cambridge, MA.

Cited By

Fraguela BAndrade D(2021)A software cache autotuning strategy for dataflow computing with UPC++ DepSpawnComputational and Mathematical Methods10.1002/cmm4.1148Online publication date: 22-Feb-2021
https://doi.org/10.1002/cmm4.1148
Beaumont OLangou JQuach WShilova A(2020)A Makespan Lower Bound for the Tiled Cholesky Factorization Based on ALAP ScheduleEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_9(134-150)Online publication date: 24-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-57675-2_9
Fraguela BAndrade D(2019)Easy Dataflow Programming in Clusters with UPC++ DepSpawnIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.288471630:6(1267-1282)Online publication date: 1-Jun-2019
https://doi.org/10.1109/TPDS.2018.2884716
Show More Cited By

Index Terms

Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software

Recommendations

Rectangular full packed format for cholesky's algorithm: factorization, solution, and inversion

We describe a new data format for storing triangular, symmetric, and Hermitian matrices called Rectangular Full Packed Format (RFPF). The standard two-dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular ...
Level-3 Cholesky Factorization Routines Improve Performance of Many Cholesky Algorithms

Four routines called DPOTF3i, i = a,b,c,d, are presented. DPOTF3i are a novel type of level-3 BLAS for use by BPF (Blocked Packed Format) Cholesky factorization and LAPACK routine DPOTRF. Performance of routines DPOTF3i are still increasing when the ...
Algorithm 865: Fortran 95 subroutines for Cholesky factorization in block hybrid format

We present subroutines for the Cholesky factorization of a positive-definite symmetric matrix and for solving corresponding sets of linear equations. They exploit cache memory by using the block hybrid format proposed by the authors in a companion ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 36, Issue 2

March 2009

149 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/1499096

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2009

Accepted: 01 October 2008

Revised: 01 May 2008

Received: 01 July 2007

Published in TOMS Volume 36, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fraguela BAndrade D(2021)A software cache autotuning strategy for dataflow computing with UPC++ DepSpawnComputational and Mathematical Methods10.1002/cmm4.1148Online publication date: 22-Feb-2021
https://doi.org/10.1002/cmm4.1148
Beaumont OLangou JQuach WShilova A(2020)A Makespan Lower Bound for the Tiled Cholesky Factorization Based on ALAP ScheduleEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_9(134-150)Online publication date: 24-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-57675-2_9
Fraguela BAndrade D(2019)Easy Dataflow Programming in Clusters with UPC++ DepSpawnIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.288471630:6(1267-1282)Online publication date: 1-Jun-2019
https://doi.org/10.1109/TPDS.2018.2884716
Martín AReyes RBadia RQuintana-Ortí E(2014)Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSsParallel Computing10.1016/j.parco.2014.04.00140:5-6(113-128)Online publication date: May-2014
https://doi.org/10.1016/j.parco.2014.04.001
Baboulin MBecker DBosilca GDanalis ADongarra J(2014)An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systemsParallel Computing10.1016/j.parco.2013.12.00340:7(213-223)Online publication date: 1-Jul-2014
https://dl.acm.org/doi/10.1016/j.parco.2013.12.003
Barthou DJeannot E(2014)SPAGHETtI: Scheduling/Placement Approach for Task-Graphs on HETerogeneous archItectureEuro-Par 2014 Parallel Processing10.1007/978-3-319-09873-9_15(174-185)Online publication date: 2014
https://doi.org/10.1007/978-3-319-09873-9_15
Gustavson FKarlsson LKågström B(2012)Parallel and Cache-Efficient In-Place Matrix Storage Format ConversionACM Transactions on Mathematical Software10.1145/2168773.216877538:3(1-32)Online publication date: 1-Apr-2012
https://dl.acm.org/doi/10.1145/2168773.2168775
Bosilca GBouteiller ADanalis AHerault TLemarinier PDongarra J(2012)DAGuEParallel Computing10.1016/j.parco.2011.10.00338:1-2(37-51)Online publication date: 1-Jan-2012
https://dl.acm.org/doi/10.1016/j.parco.2011.10.003
Bosilca GBouteiller ADanalis AHerault TDongarra J(2012)From serial loops to parallel execution on distributed systemsProceedings of the 18th international conference on Parallel Processing10.1007/978-3-642-32820-6_25(246-257)Online publication date: 27-Aug-2012
https://dl.acm.org/doi/10.1007/978-3-642-32820-6_25
Dongarra JKurzak JLuszczek PTomov S(2012)Dense Linear Algebra on Accelerated Multicore HardwareHigh-Performance Scientific Computing10.1007/978-1-4471-2437-5_5(123-146)Online publication date: 2012
https://doi.org/10.1007/978-1-4471-2437-5_5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents