Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Distributed SBP Cholesky factorization algorithms with near-optimal scheduling

Published: 07 April 2009 Publication History

Abstract

The minimal block storage Distributed Square Block Packed (DSBP) format for distributed memory computing on symmetric and triangular matrices is presented. Three algorithm variants (Basic, Static, and Dynamic) of the blocked right-looking Cholesky factorization are designed for the DSBP format, implemented, and evaluated. On our target machine, all variants outperform standard full-storage implementations while saving almost half the storage. Communication overhead is shown to be virtually eliminated by the Static and Dynamic variants, both of which take advantage of hardware parallelism to hide communication costs. The Basic variant is shown to yield comparable or slightly better performance than the full-storage ScaLAPACK routine PDPOTRF while clearly outperformed by both Static and Dynamic. Models of execution assuming zero communication costs and overhead are developed. For medium- and larger-sized problems, the Static schedule is near optimal on our target machine based on comparisons with these models and measurements of synchronization overhead.

References

[1]
Agarwal, R. C. and Gustavson, F. G. 1988. A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In Aspects of Computation on Asynchronous and Parallel Processors, M. Wright, Ed. IFIP, North-Holland, Amsterdam, The Netherlands, 217--221.
[2]
Agarwal, R. C. and Gustavson, F. G. 1989. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In Supercomputing '89: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing. ACM Press, New York, NY, 225--233.
[3]
Agarwal, R. C., Gustavson, F. G., and Zubair, M. 1994. A high performance matrix multiplication algorithm on a distributed memory parallel machine using overlapped communication. IBM J. Res. Develop. 38, 6 (Nov.), 673--681.
[4]
Baboulin, M., Giraud, L., and Gratton, S. 2005a. A parallel distributed solver for large dense symmetric systems: Applications to geodesy and electromagnetism problems. Int. J. High Perform. Comput. Appl./19, 353--363.
[5]
Baboulin, M., Giraud, L., Gratton, S., and Langou, J. 2005b. A distributed packed storage for large parallel calculations. Tech. rep. TR/PA/05/30. CERFACS, Toulouse, France.
[6]
Brent, R. P. and Luk, F. T. 1982. Computing the Cholesky factorization using a systolic architecture. Tech. rep. TR 82--H521. Department of Computer Science, Cornell University, Cornell University, Ithaca, NY.
[7]
Brightwell, R. and Underwood, K. D. 2004. An analysis of the impact of MPI overlap and independent progress. In ICS '04: Proceedings of the 18th Annual International Conference on Supercomputing. ACM Press, New York, NY, 298--305.
[8]
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. Tech. rep. UT-CS-07-600. University of Tennessee, Knoxville, TN.
[9]
Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R. 2007. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA '07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures. (San Diego, CA). 116--125.
[10]
Choi, J., Dongarra, J. J., Ostrouchov, S., Petitet, A. P., Walker, D. W., and Whaley, R. C. 1996. Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5, 3 (Fall), 173--184.
[11]
Dackland, K., Elmroth, E., Kågström, B., and Van Loan, C. 1992. Parallel block factorizations on the shared memory multiprocessor IBM 3090 VF/600J. Int. J. Supercomput. Appl. 6.1, 69--97.
[12]
Dackland, K., Elmroth, E., and Kågström, B. 1993. A ring--oriented approach for block matrix factorizations on shared and distributed memory architectures. In SIAM Conference on Parallel Processing for Scientific Computing, R. S. et al, Ed. SIAM Publications, Philadelphia, PA, 330--338.
[13]
D'Azevedo, E. and Dongarra, J. 1998. Packed storage extension for ScaLAPACK. Tech. rep. UT-CS-98-385. University of Tennessee, Knoxville, TN.
[14]
Dongarra, J. J., Duff, I. S., Sorenson, D. C., and van der Vorst, H. A. 1998. Numerical Linear Algebra on High-Performance Computers. SIAM, Philadelphia, PA.
[15]
Geist, G. A. and Heath, M. T. 1985. Parallel Cholesky factorization on a hypercube multiprocessor. Tech. rep. ORNL--6190. Oak Ridge National Lab., Oak Ridge, TN.
[16]
Gerasoulis, A. and Nelken, I. 1989. Scheduling linear algebra parallel algorithms on MIMD architectures. In Proceedings of the fourth SIAM Conference on Parallel Processing for Scientific Computing. 68--95.
[17]
Golub, G. H. and van Loan, C. F. 1996. Matrix Computations, 3rd ed. Johns Hopkins University Press, Baltimore, MD.
[18]
Goto, K. and van de Geijn, R. A. 2007. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software. To appear.
[19]
Grama, A. Y., Gupta, A., and Kumar, V. 1993. Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parall. Distrib. Tech. Syst. Appl. 1, 3, 12--21.
[20]
Gustavson, F. G., Gunnels, J. A., and Sexton, J. C. 2007a. Minimal data copy for dense linear algebra factorization. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol. 4699. Springer, Berlin, Germany, 540--549.
[21]
Gustavson, F. G., Karlsson, L., and Kågström, B. 2007b. Three algorithms for Cholesky factorization on distributed memory using packed storage. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. Kågström et al., Eds. Lecture Notes in Computer Science, vol 4699. Springer. Berlin, Germany, 550--559. Also IBM Tech. rep. RC24137.
[22]
MPI Forum. 1995. MPI: A message passing interface standard. http://www.mpi-forum.org/.
[23]
O'Leary, D. P. and Stewart, G. W. 1985. Data-flow algorithms for parallel matrix computation. Commun. ACM 28, 840--853.
[24]
Strazdins, P. 1998. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. rep. TR-CS-98-07. Canberra 0200 ACT, Australia.
[25]
van de Geijn, R. A. 1997. Using PLAPACK. MIT Press, Cambridge, MA.

Cited By

View all
  • (2021)A software cache autotuning strategy for dataflow computing with UPC++ DepSpawnComputational and Mathematical Methods10.1002/cmm4.1148Online publication date: 22-Feb-2021
  • (2020)A Makespan Lower Bound for the Tiled Cholesky Factorization Based on ALAP ScheduleEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_9(134-150)Online publication date: 24-Aug-2020
  • (2019)Easy Dataflow Programming in Clusters with UPC++ DepSpawnIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.288471630:6(1267-1282)Online publication date: 1-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 36, Issue 2
March 2009
149 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/1499096
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2009
Accepted: 01 October 2008
Revised: 01 May 2008
Received: 01 July 2007
Published in TOMS Volume 36, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cholesky factorization
  2. Real symmetric matrices
  3. distributed square block format
  4. packed storage
  5. parallel algorithms
  6. parallel computing
  7. positive definite matrices

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2021)A software cache autotuning strategy for dataflow computing with UPC++ DepSpawnComputational and Mathematical Methods10.1002/cmm4.1148Online publication date: 22-Feb-2021
  • (2020)A Makespan Lower Bound for the Tiled Cholesky Factorization Based on ALAP ScheduleEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_9(134-150)Online publication date: 24-Aug-2020
  • (2019)Easy Dataflow Programming in Clusters with UPC++ DepSpawnIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.288471630:6(1267-1282)Online publication date: 1-Jun-2019
  • (2014)Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSsParallel Computing10.1016/j.parco.2014.04.00140:5-6(113-128)Online publication date: May-2014
  • (2014)An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systemsParallel Computing10.1016/j.parco.2013.12.00340:7(213-223)Online publication date: 1-Jul-2014
  • (2014)SPAGHETtI: Scheduling/Placement Approach for Task-Graphs on HETerogeneous archItectureEuro-Par 2014 Parallel Processing10.1007/978-3-319-09873-9_15(174-185)Online publication date: 2014
  • (2012)Parallel and Cache-Efficient In-Place Matrix Storage Format ConversionACM Transactions on Mathematical Software10.1145/2168773.216877538:3(1-32)Online publication date: 1-Apr-2012
  • (2012)DAGuEParallel Computing10.1016/j.parco.2011.10.00338:1-2(37-51)Online publication date: 1-Jan-2012
  • (2012)From serial loops to parallel execution on distributed systemsProceedings of the 18th international conference on Parallel Processing10.1007/978-3-642-32820-6_25(246-257)Online publication date: 27-Aug-2012
  • (2012)Dense Linear Algebra on Accelerated Multicore HardwareHigh-Performance Scientific Computing10.1007/978-1-4471-2437-5_5(123-146)Online publication date: 2012
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media