research-article

Tiled QR factorization algorithms

Authors:

Henricus Bouwmeester,

Mathias Jacquelin,

Yves RobertAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 7, Pages 1 - 11

https://doi.org/10.1145/2063384.2063393

Published: 12 November 2011 Publication History

Abstract

This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p × q tiles, where p ≥ q. Within this framework, we study the critical paths and performance of algorithms such as Sameh-Kuck, Fibonacci, Greedy, and those found within PLASMA. Although neither Fibonacci nor Greedy is optimal, both are shown to be asymptotically optimal for all matrices of size p = q²f(q), where f is any function such that lim_+∞ f = 0. This novel and important complexity result applies to all matrices where p and q are proportional, p = λq, with λ ≥ 1, thereby encompassing many important situations in practice (least squares). We provide an extensive set of experiments that show the superiority of the new algorithms for tall matrices.

References

[1]

E. Agullo, J. Dongarra, R. Nath, and S. Tomov. A fully empirical autotuned dense QR factorization for multicore architectures. Technical Report 242, LAPACK Working Note, 2011.

[2]

E. Agullo, B. Hadri, H. Ltaief, and J. Dongarra. Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09), pages 1--12. IEEE Computer Society Press, 2009.

Digital Library

[3]

S. Blackford and J. J. Dongarra. Installation guide for LAPACK. Technical Report 41, LAPACK Working Note, June 1999. originally released March 1992.

[4]

H. Bouwmeester, M. Jacquelin, J. Langou, and Y. Robert. Tiled QR factorization algorithms. Technical Report 7601, INRIA, France, Apr. 2011. Available at http://hal.inria.fr/docs/00/58/62/39/PDF/RR-7601.pdf.

Digital Library

[5]

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. Parallel tiled QR factorization for multicore architectures. Concurrency Computat.: Pract. Exper., 20(13):1573--1590, 2008.

[6]

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38--53, 2009.

Digital Library

[7]

M. Cosnard, J.-M. Muller, and Y. Robert. Parallel QR decomposition of a rectangular matrix. Numerische Mathematik, 48:239--249, 1986.

Digital Library

[8]

M. Cosnard and Y. Robert. Complexity of parallel QR factorization. Journal of the A. C. M., 33(4):712--723, 1986.

Digital Library

[9]

J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-avoiding parallel and sequential QR and LU factorizations: theory and practice. Technical Report 204, LAPACK Working Note, 2008.

[10]

B. Hadri, H. Ltaief, E. Agullo, and J. Dongarra. Enhancing parallelism of tile QR factorization for multicore architectures. Technical Report 222, LAPACK Working Note, 2009.

[11]

B. Hadri, H. Ltaief, E. Agullo, and J. Dongarra. Tile QR factorization with parallel panel processing for multicore architectures. In 24th IEEE Int. Parallel Distributed Processing Symposium IPDPS'10, 2010.

[12]

J. Modi and M. Clarke. An alternative Givens ordering. Numerische Mathematik, 43:83--90, 1984.

Digital Library

[13]

G. Quintana-Ortí, E. S. Quintana-Ortí, R. A. van de Geijn, F. G. V. Zee, and E. Chan. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software, 36(3), 2009.

Digital Library

[14]

A. Sameh and D. Kuck. On stable parallel linear systems solvers. J. ACM, 25:81--91, 1978.

Digital Library

[15]

SimGrid. URL: http://simgrid.gforge.inria.fr.

[16]

R. C. Whaley and A. M. Castaldo. Achieving accurate and context-sensitive timing for code optimization. Softw. Pract. Exper., 38:1621--1642, December 2008.

Digital Library

[17]

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52:65--76, April 2009.

Digital Library

Cited By

Leng YZou GWang HWu PZhang S(2025)High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352277636:3(422-436)Online publication date: Mar-2025
https://doi.org/10.1109/TPDS.2024.3522776
Wang RXu CLi RDuan SZhang X(2024)Cooperative Localization and Mapping Based on UWB/IMU Fusion Using Factor GraphsIEEE Sensors Journal10.1109/JSEN.2023.331627824:14(21931-21940)Online publication date: 15-Jul-2024
https://doi.org/10.1109/JSEN.2023.3316278
Kurzak JGates MCharara AYarKhan ADongarra JEigenmann RDing CMcKee S(2019)Least squares solvers for distributed-memory machines with GPU acceleratorsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330356(117-126)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330356
Show More Cited By

Recommendations

Computing rank-revealing QR factorizations of dense matrices

We develop algorithms and implementations for computing rank-revealing QR (RRQR) factorizations of dense matrices. First, we develop an efficient block algorithm for approximating an RRQR factorization, employing a windowed version of the commonly used ...
A BLAS-3 Version of the QR Factorization with Column Pivoting

The QR factorization with column pivoting (QRP), originally suggested by Golub [Numer. Math., 7 (1965), 206--216], is a popular approach to computing rank-revealing factorizations. Using Level 1 BLAS, it was implemented in LINPACK, and, using Level 2 BLAS,...
Shifted Cholesky QR for Computing the QR Factorization of Ill-Conditioned Matrices

The Cholesky QR algorithm is an efficient communication-minimizing algorithm for computing the QR factorization of a tall-skinny matrix $X\in\mathbb{R}^{m\times n}$, where $m\gg n$. Unfortunately it is inherently unstable and often breaks down when the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
326
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Leng YZou GWang HWu PZhang S(2025)High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor CoresIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.352277636:3(422-436)Online publication date: Mar-2025
https://doi.org/10.1109/TPDS.2024.3522776
Wang RXu CLi RDuan SZhang X(2024)Cooperative Localization and Mapping Based on UWB/IMU Fusion Using Factor GraphsIEEE Sensors Journal10.1109/JSEN.2023.331627824:14(21931-21940)Online publication date: 15-Jul-2024
https://doi.org/10.1109/JSEN.2023.3316278
Kurzak JGates MCharara AYarKhan ADongarra JEigenmann RDing CMcKee S(2019)Least squares solvers for distributed-memory machines with GPU acceleratorsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330356(117-126)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330356
Sukkari DLtaief HKeyes DFaverge M(2019)Leveraging Task-Based Polar Decomposition Using PARSEC on Massively Parallel Systems2019 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2019.8891024(1-12)Online publication date: Sep-2019
https://doi.org/10.1109/CLUSTER.2019.8891024
Dongarra JFaverge MHéRault TJacquelin MLangou JRobert Y(2018)Hierarchical QR factorization algorithms for multi-core clustersParallel Computing10.1016/j.parco.2013.01.00339:4-5(212-232)Online publication date: 31-Dec-2018
https://dl.acm.org/doi/10.1016/j.parco.2013.01.003
Faverge MLangou JRobert YDongarra J(2017)Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.46(668-677)Online publication date: May-2017
https://doi.org/10.1109/IPDPS.2017.46
Agullo EButtari AGuermouche ALopez F(2016)Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime SystemsACM Transactions on Mathematical Software10.1145/289834843:2(1-22)Online publication date: 16-Aug-2016
https://dl.acm.org/doi/10.1145/2898348
Park KHwang WSeok HKim CShin DKim DMaeng MKim S(2015)MN-MATEACM Journal on Emerging Technologies in Computing Systems10.1145/270142912:1(1-25)Online publication date: 3-Aug-2015
https://dl.acm.org/doi/10.1145/2701429
Aupy GFaverge MRobert YKurzak JLuszczek PDongarra J(2014)Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSECEuro-Par 2013: Parallel Processing Workshops10.1007/978-3-642-54420-0_64(657-667)Online publication date: 2014
https://doi.org/10.1007/978-3-642-54420-0_64
Kim DPark K(2013)Tiled QR Decomposition and Its Optimization on CPU and GPU Computing SystemProceedings of the 2013 42nd International Conference on Parallel Processing10.1109/ICPP.2013.88(744-753)Online publication date: 1-Oct-2013
https://dl.acm.org/doi/10.1109/ICPP.2013.88
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten