Programming matrix algorithms-by-blocks for thread-level parallelism

Published: 23 July 2009 Publication History


With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievable.


Wolfgang Schreiner

For the last few years, decomposing processors into multiple cores that operate independently, in parallel, within a shared address space, has increased the power of computer processors. This paper presents a new method for programming dense linear algebra algorithms that gives modern architectures, in this context, better performance than the traditional approach of using libraries such as linear algebra package (LAPACK). The core of this algorithms-by-blocks method is to express matrix algorithms in terms of operations on submatrices rather than on scalars. FLASH, a corresponding programming interface, allows these algorithms to be conveniently implemented as C programs that internally represent matrices as hierarchies of submatrices. The programs are executed in parallel by a runtime system that stores each invocation of a sub-algorithm as a task in a queue and manages dependencies between tasks in a dataflow-like fashion. The paper is very well organized and provides links to numerous detailed conference publications. After a motivating example, the first part of the paper presents the overall methodology. The second part reports on practical experience with programming concrete linear algebra subprograms, and demonstrates that the performance of FLASH programs is considerably better than that of their LAPACK counterparts. Unfortunately, the authors only present summarizing giga floating point operations per second (GFLOPS) figures, rather than execution times with a varying number of processors; thus, the scalability of the programs can only be estimated. The methodology and programming library clearly demonstrate how architectural developments influence algorithm and program design?this is an important message, for a wide audience. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.


Information & Contributors


Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 36, Issue 3
July 2009
122 pages
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2009
Accepted: 01 November 2008
Revised: 01 July 2008
Received: 01 December 2007
Published in TOMS Volume 36, Issue 3


Request permissions for this article.

Check for updates

Author Tags

  1. Linear algebra
  2. high-performance
  3. libraries
  4. multithreaded architectures


  • Research-article
  • Research
  • Refereed

