research-article

Programming matrix algorithms-by-blocks for thread-level parallelism

Authors:

Gregorio Quintana-Ortí,

Enrique S. Quintana-Ortí,

Robert A. Van De Geijn,

Field G. Van Zee,

Ernie ChanAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 36, Issue 3

Article No.: 14, Pages 1 - 26

https://doi.org/10.1145/1527286.1527288

Published: 23 July 2009 Publication History

Get Access

Abstract

With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievable.

References

[1]

]]Addison, C., Ren, Y., and van Waveren, M. 2003. OpenMP issues arising in the development of parallel BLAS and LAPACK libraries. Sci. Program. 11, 2, 95--104.

Digital Library

Google Scholar

[2]

]]Agarwal, R. C. and Gustavson, F. G. 1989. Vector and parallel algorithms for Cholesky factorization on IBM 3090. In SC '89: Proceedings of the ACM/IEEE Conference on Supercomputing, New York. 225--233.

Digital Library

Google Scholar

[3]

]]Anderson, E., Bai, Z., Bischof, C., Blackford, L. S., Demmel, J., Dongarra, J. J., Croz, J. D., Hammarling, S., Greenbaum, A., McKenney, A., and Sorensen, D. 1999. LAPACK Users' Guide, 3rd Ed. Society for Industrial and Applied Mathematics, Philadelphia.

Digital Library

Google Scholar

[4]

]]Balay, S., Buschelman, K., Eijkhout, V., Gropp, W. D., Kaushik, D., Knepley, M. G., McInnes, L. C., Smith, B. F., and Zhang, H. 2004. PETSc users manual. Tech. rep. ANL-95/11—Revision 2.1.5, Argonne National Laboratory, Argonne.

Google Scholar

[5]

]]Bientinesi, P., Gunter, B., and van de Geijn, R. A. 2008. Families of algorithms related to the inversion of a symmetric positive definite matrix. ACM Trans. Math. Softw. 35, 1, 1--22.

Digital Library

Google Scholar

[6]

]]Bientinesi, P., Quintana-Ortí, E. S., and van de Geijn, R. A. 2005. Representing linear algebra algorithms in code: The FLAME application programming interfaces. ACM Trans. Math. Softw. 31, 1, 27--59.

Digital Library

Google Scholar

[7]

]]Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. LAPACK Working Note 191 UT-CS-07-600. University of Knoxville.

Google Scholar

[8]

]]Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. 2008. Parallel tiled QR factorization for multicore architectures. Concurr. Computat. Pract. Experi. 20, 13, 1573--1590.

Digital Library

Google Scholar

[9]

]]Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R. 2007a. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In Proceedings of the 19th ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, 116--125.

Digital Library

Google Scholar

[10]

]]Chan, E., Van Zee, F. G., Bientinesi, P., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R. 2008. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, 123--132.

Digital Library

Google Scholar

[11]

]]Chan, E., Van Zee, F. G., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R. 2007b. Satisfying your dependencies with SuperMatrix. In Proceedings of the 2007 IEEE International Conference on Cluster Computing Austin, 91--99.

Digital Library

Google Scholar

[12]

]]Choi, J., Dongarra, J. J., Pozo, R., and Walker, D. W. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the 4th Symposium on the Frontiers of Massively Parallel Computation, McLean, 120--127.

Google Scholar

[13]

]]Collins, T. and Browne, J. C. 1995. Matrix++: An object-oriented environment for parallel high-perfomance matrix computations. In Proceedings of the 28th Annual Hawaii International Conference on System Sciences, Maui, 202--211.

Digital Library

Google Scholar

[14]

]]Dongarra, J. J., Bunch, J. R., Moler, C. B., and Stewart, G. W. 1979. LINPACK Users' Guide. SIAM, Philadelphia.

Google Scholar

[15]

]]Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. 1990. A set of level 3 basic linear algebra sub-programs. ACM Trans. Math. Softw. 16, 1, 1--17.

Digital Library

Google Scholar

[16]

]]Dongarra, J. J., Du Croz, J., Hammarling, S., and Hanson, R. J. 1988. An extended set of Fortran basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1, 1--17.

Digital Library

Google Scholar

[17]

]]Edwards, H. C. and van de Geijn, R. A. 2006. On application interfaces to parallel dense matrix libraries: Just let me solve my problem&excl; FLAME Working Note &num;18 TR-2006-15. Department of Computer Sciences, University of Texas at Austin.

Google Scholar

[18]

]]Elmroth, E., Gustavson, F., Jonsson, I., and Kagstrom, B. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46, 1, 3--45.

Crossref

Google Scholar

[19]

]]Golub, G. H. and Van Loan, C. F. 1996. Matrix Computations, 3rd Ed. Johns Hopkins University Press, Baltimore.

Digital Library

Google Scholar

[20]

]]Goto, K. and van de Geijn, R. A. 2008. Anatomy of a high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3, 1--25.

Digital Library

Google Scholar

[21]

]]Gropp, W., Lusk, E., and Skjellum, A. 1994. Using MPI. MIT Press, Cambridge.

Google Scholar

[22]

]]Gunnels, J. A., Gustavson, F. G., Henry, G. M., and van de Geijn, R. A. 2001. FLAME: Formal linear algebra methods environment. ACM Trans. Math. Softw. 27, 4, 422--455.

Digital Library

Google Scholar

[23]

]]Gunter, B. C. and van de Geijn, R. A. 2005. Parallel out-of-core computation and updating the QR factorization. ACM Trans. Math. Softw. 31, 1, 60--78.

Digital Library

Google Scholar

[24]

]]Guo, J., Bikshandi, G., Fraguela, B., Garzaran, M., and Padua, D. 2008. Programming with tiles. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, 111--122.

Digital Library

Google Scholar

[25]

]]Gustavson, F. G., Karlsson, L., and Kagstrom, B. 2007. Three algorithms for Cholesky factorization on distributed memory using packed storage. In Proceedings of the Workshop on State-of-the-Art in Scientific Computing. Lecture Notes in Computer Science, vol. 4699. Springer, Berlin/Heidelberg, Germany, 550--559.

Digital Library

Google Scholar

[26]

]]Herrero, J. R. 2006. A framework for efficient execution of matrix computations. Ph.D. dissertation. Polytechnic University of Catalonia, Barcelona, Spain.

Google Scholar

[27]

]]Joffrain, T., Quintana-Ortí, E. S., and van de Geijn, R. A. 2004. Rapid development of high-performance out-of-core solvers. In Proceedings of the Workshop on State-of-the-Art in Scientific Computing. Lecture Notes in Computer Science, vol. 3732. Springer, Berlin/Heidelberg, Germany, 413--422.

Digital Library

Google Scholar

[28]

]]Kurzak, J. and Dongarra, J. 2006. Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178 UT-CS-06-581. University of Tennessee, Knoxville.

Google Scholar

[29]

]]Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3, 308--323.

Digital Library

Google Scholar

[30]

]]Low, T. M. and van de Geijn, R. 2004. An API for manipulating matrices stored by blocks. FLAME Working Note &num;12 TR-2004-15. Department of Computer Sciences, University of Texas at Austin, Austin.

Google Scholar

[31]

]]Lu, H., Cox, A. L., Dwarkadas, S., Rajamony, R., and Zwaenepoel, W. 1997. Compiler and software distributed shared memory support for irregular applications. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Las Vegas, 48--56.

Digital Library

Google Scholar

[32]

]]Marker, B. A., Van Zee, F. G., Goto, K., Quintana-Ortí, G., and van de Geijn, R. A. 2007. Toward scalable matrix multiply on multithreaded architectures. In Proceedings of the 13th International European Conference on Parallel and Distributed Computing (Rennes, France). 748--757.

Digital Library

Google Scholar

[33]

]]Nieplocha, J., Harrison, R., and Littlefield, R. 1996. Global arrays: A nonuniform memory access programming model for high-performance computers. J. Supercomput. 10, 2 (June), 197--220.

Digital Library

Google Scholar

[34]

]]Quintana-Ortí, E. S. and van de Geijn, R. 2009. Updating an LU factorization with pivoting. ACM Trans. Math. Softw.

Digital Library

Google Scholar

[35]

]]Quintana-Ortí, G., Quintana-Ortí, E. S., Chan, E., van de Geijn, R., and Van Zee, F. G. 2008a. Design of scalable dense linear algebra libraries for multithreaded architectures: The LU factorization. In Proceedings of the Workshop on Multithreaded Architectures and Applications, Miami, 1--8.

Crossref

Google Scholar

[36]

]]Quintana-Ortí, G., Quintana-Ortí, E. S., Chan, E., Van Zee, F. G., and van de Geijn, R. A. 2008b. Scheduling of QR factorization algorithms on SMP and multi-core architectures. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (Toulouse, France). 301--307.

Digital Library

Google Scholar

[37]

]]Quintana-Ortí, G., Quintana-Ortí, E. S., Remón, A., and van de Geijn, R. 2008c. An algorithm-by-blocks for SuperMatrix band Cholesky factorization. In Proceedings of the 8th International Meeting on High-Performance Computing for Computational Science (Toulouse, France). 1--13.

Digital Library

Google Scholar

[38]

]]Skagestein, G. 1972. Rekursiv unterteilte matrizen sowie methoden zur erstellung von rechnerprogrammen fur ihre verarbeitung. Ph.D. dissertation. Universität Stuttgart, Stuttgart, Germany.

Google Scholar

[39]

]]Sorensen, D. C. 1985. Analysis of pairwise pivoting in Gaussian elimination. IEEE Trans. Comput. 34, 3, 274--278.

Digital Library

Google Scholar

[40]

]]Strazdins, P. 2001. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Int. J. Parall. Distrib. Syst. Netw. 4, 1, 26--35.

Google Scholar

[41]

]]Toledo, S. 1999. A survey of out-of-core algorithms in numerical linear algebra. In External Memory Algorithms, J. Abello and J. S. Vitter, Eds. American Mathematical Society, Boston, 161--179.

Digital Library

Google Scholar

[42]

]]Valsalam, V. and Skjellum, A. 2002. A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurr. Computat. Pract. Exper. 14, 10, 805--840.

Crossref

Google Scholar

[43]

]]van de Geijn, R. A. 1997. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press.

Digital Library

Google Scholar

[44]

]]van de Geijn, R. A. and Quintana-Ortí, E. S. 2008. The Science of Programming Matrix Computations. www.lulu.com.

Google Scholar

[45]

]]von Hanxleden, R., Kennedy, K., Koelbel, C. H., Das, R., and Saltz, J. H. 1992. Compiler analysis for irregular problems in Fortran D. In Proceedings of the 5th Workshop on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 757. Springer, Berlin/Heidelberg, Germany, 97--111.

Digital Library

Google Scholar

[46]

]]Wise, D. S., Frens, J. D., Gu, Y., and Alexander, G. A. 2001. Language support for Morton-order matrices. In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Snowbird, 24--33.

Digital Library

Google Scholar

Cited By

View all

Carretero Perez JRodríguez-Sánchez RCastelló ACatalán SIgual FQuintana-Ortí E(2024)Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processorsInternational Journal of High Performance Computing Applications10.1177/1094342023115765338:2(55-68)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1177/10943420231157653
Alawneh SZeng LArefifar S(2023)A Review of High-Performance Computing Methods for Power Flow AnalysisMathematics10.3390/math1111246111:11(2461)Online publication date: 26-May-2023
https://doi.org/10.3390/math11112461
Agullo EButtari AGuermouche AHerrmann JJego A(2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1145/3583560
Show More Cited By

Index Terms

Programming matrix algorithms-by-blocks for thread-level parallelism
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software

Recommendations

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

We study the high-performance implementation of the inversion of a Symmetric Positive Definite (SPD) matrix on architectures ranging from sequential processors to Symmetric MultiProcessors to distributed memory parallel computers. This inversion is ...
Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple ...

Reviews

Reviewer: Wolfgang Schreiner

For the last few years, decomposing processors into multiple cores that operate independently, in parallel, within a shared address space, has increased the power of computer processors. This paper presents a new method for programming dense linear algebra algorithms that gives modern architectures, in this context, better performance than the traditional approach of using libraries such as linear algebra package (LAPACK). The core of this algorithms-by-blocks method is to express matrix algorithms in terms of operations on submatrices rather than on scalars. FLASH, a corresponding programming interface, allows these algorithms to be conveniently implemented as C programs that internally represent matrices as hierarchies of submatrices. The programs are executed in parallel by a runtime system that stores each invocation of a sub-algorithm as a task in a queue and manages dependencies between tasks in a dataflow-like fashion. The paper is very well organized and provides links to numerous detailed conference publications. After a motivating example, the first part of the paper presents the overall methodology. The second part reports on practical experience with programming concrete linear algebra subprograms, and demonstrates that the performance of FLASH programs is considerably better than that of their LAPACK counterparts. Unfortunately, the authors only present summarizing giga floating point operations per second (GFLOPS) figures, rather than execution times with a varying number of processors; thus, the scalability of the programs can only be estimated. The methodology and programming library clearly demonstrate how architectural developments influence algorithm and program design?this is an important message, for a wide audience. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 36, Issue 3

July 2009

122 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/1527286

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2009

Accepted: 01 November 2008

Revised: 01 July 2008

Received: 01 December 2007

Published in TOMS Volume 36, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

104
Total Citations
View Citations
837
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Carretero Perez JRodríguez-Sánchez RCastelló ACatalán SIgual FQuintana-Ortí E(2024)Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processorsInternational Journal of High Performance Computing Applications10.1177/1094342023115765338:2(55-68)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1177/10943420231157653
Alawneh SZeng LArefifar S(2023)A Review of High-Performance Computing Methods for Power Flow AnalysisMathematics10.3390/math1111246111:11(2461)Online publication date: 26-May-2023
https://doi.org/10.3390/math11112461
Agullo EButtari AGuermouche AHerrmann JJego A(2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1145/3583560
Catalán SIgual FHerrero JRodríguez-Sánchez RQuintana-Ortí E(2023)Programming parallel dense matrix factorizations and inversion for new-generation NUMA architecturesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.01.004175:C(51-65)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2023.01.004
Tomás AQuintana-Ortí E(2023)Tall-and-Skinny QR Factorization for Clusters of GPUs Using High-Performance Building BlocksEuro-Par 2023: Parallel Processing Workshops10.1007/978-3-031-50684-0_24(306-317)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-50684-0_24
Heavner NMartinsson PQuintana‐Ortí G(2023)Computing rank‐revealing factorizations of matrices stored out‐of‐coreConcurrency and Computation: Practice and Experience10.1002/cpe.772635:22Online publication date: 17-Apr-2023
https://doi.org/10.1002/cpe.7726
Apriansyah MYokota R(2022)Parallel QR Factorization of Block Low-rank MatricesACM Transactions on Mathematical Software10.1145/353864748:3(1-28)Online publication date: 10-Sep-2022
https://dl.acm.org/doi/10.1145/3538647
Heavner NIgual FQuintana-Ortí GMartinsson P(2022)Algorithm 1022: Efficient Algorithms for Computing a Rank-Revealing UTV Factorization on Parallel Computing ArchitecturesACM Transactions on Mathematical Software10.1145/350746648:2(1-42)Online publication date: 26-May-2022
https://dl.acm.org/doi/10.1145/3507466
Gates MYarKhan ASukkari DAkbudak KCayrols SBielich DAbdelfattah AFarhan MDongarra J(2022)Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC56579.2022.00009(36-46)Online publication date: Nov-2022
https://doi.org/10.1109/P3HPC56579.2022.00009
Castelló ACatalán SIgual FQuintana-Ortí ERodríguez-Sánchez R(2022)QR Factorization Using Malleable BLAS on Multicore ProcessorsHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_12(176-189)Online publication date: 29-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-23220-6_12
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Reviews

Access critical reviews of Computing literature here

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations