Direct Methods For Sparse Linear Systems by Timothy A. Davis
Direct Methods For Sparse Linear Systems by Timothy A. Davis
for Sparse
Linear Systems
fundamentals of Algorithms
Editor-in-Chief: Nicholas J. Higham, University of Manchester
The goal of the series is to produce a collection of short books, written by experts on numerical
methods, that include an explanation of each method and a summary of theoretical background.
What distinguishes a book in this series is both its emphasis on explaining how to best choose
a method, algorithm, or software package to solve a specific type of problem and its descrip-
tions of when a given algorithm or method succeeds or fails.
Editorial Board
Peter Benner Dianne P. O'Leary
Technische Universitat Chemnitz University of Maryland
John R. Gilbert Robert D. Russell
University of California, Santa Barbara Simon Fraser University
Michael T. Heath Robert D. Skeel
University of Illinois—Urbana-Champaign Purdue University
C. T. Kelley Danny Sorensen
North Carolina State University Rice University
Cleve Moler Andrew J. Wathen
The MathWorks, Inc. Oxford University
James G. Nagy Henry Wolkowicz
Emory University University of Waterloo
Series Volumes
Davis, T. A. Direct Methods for Sparse Linear Systems
Kelley, C. T. Solving Nonlinear Equations with Newton's Method
Timothy A. Davis
University of Florida
Gainesville, Florida
Direct Methods
for Sparse
Linear Systems
10 9 8 7 6 5 4 3 2 1
All rights reserved. Printed in the United States of America. No part of this book may be
reproduced, stored, or transmitted in any manner without the written permission of the publisher.
For information, write to the Society for Industrial and Applied Mathematics, 3600 University City
Science Center, Philadelphia, PA 19104-2688 USA.
Trademarked names may be used in this book without the inclusion of a trademark symbol. These
names are used in an editorial context only; no infringement of trademark is intended.
MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please
contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA,
508-647-7000, Fax: 508-647-7101, wfa@mathworks.com, www.mathworks.com
No warranties, expressed or implied, are made by the publisher, author, and their employers that the
programs contained in this volume are free of error. They should not be relied on as the sole basis to
solve a problem whose incorrect solution could result in injury to person or property. If the programs
are employed in such a manner, it is at the user's own risk and the publisher, author, and their
employers disclaim all liability for such misuse.
The algorithms presented in this book were developed with support from various sources, including
Sandia National Laboratory, the National Science Foundation (ASC-9111263, DMS-9223088,
DMS-9504974, DMS-9803599, and CCR-0203720), The MathWorks, Inc., and the University of Florida.
Davis, Timothy A.
Direct methods for sparse linear systems / Timothy A. Davis.
p. cm. — (Fundamentals of algorithms)
Includes bibliographical references and index.
ISBN-13: 978-0-898716-13-9 (pbk.)
ISBN-10: 0-89871-613-6 (pbk.)
1. Sparse matrices. 2. Linear systems. I. Title.
QA188.D386 2006
512.9'434—dc22
2006044387
ISBN-13: 978-0-898716-13-9
ISBN-10: 0-89871-613-6
is a registered trademark.
For Connie, Emily, and Timothy 3. ("TJ")
This page intentionally left blank
Contents
Preface xi
1 Introduction 1
1.1 Linear algebra 2
1.2 Graph theory, algorithms, and data structures 4
1.3 Further reading 6
2 Basic algorithms 7
2.1 Sparse matrix data structures 7
2.2 Matrix-vector multiplication 9
2.3 Utilities 10
2.4 Triplet form 12
2.5 Transpose 14
2.6 Summing up duplicate entries 15
2.7 Removing entries from a matrix 16
2.8 Matrix multiplication 17
2.9 Matrix addition 19
2.10 Vector permutation 20
2.11 Matrix permutation 21
2.12 Matrix norm 22
2.13 Reading a matrix from a file 23
2.14 Printing a matrix 23
2.15 Sparse matrix collections 24
2.16 Further reading 24
Exercises 24
4 Cholesky factorization 37
4.1 Elimination tree 38
vii
viii Contents
5 Orthogonal methods 69
5.1 Householder reflections 69
5.2 Left- and right-looking QR factorization 70
5.3 Householder-based sparse QR factorization 71
5.4 Givens rotations 79
5.5 Row-merge sparse QR factorization 79
5.6 Further reading 81
Exercises 82
6 LU factorization 83
6.1 Upper bound on fill-in 83
6.2 Left-looking LU 85
6.3 Right-looking and multifrontal LU 88
6.4 Further reading 94
Exercises 95
7 Fill-reducing orderings 99
7.1 Minimum degree ordering 99
7.2 Maximum matching 112
7.3 Block triangular form 118
7.4 Dulmage-Mendelsohn decomposition 122
7.5 Bandwidth and profile reduction 127
7.6 Nested dissection 128
7.7 Further reading 130
Exercises 133
9 CSparse 145
9.1 Primary CSparse routines and definitions 146
9.2 Secondary CSparse routines and definitions 149
9.3 Tertiary CSparse routines and definitions 154
9.4 Examples 158
Bibliography 195
Index 211
This page intentionally left blank
Preface
This book presents the fundamentals of sparse matrix algorithms, from theory
to algorithms and data structures to working code. The focus is on direct methods
for solving systems of linear equations; iterative methods and solvers for eigenvalue
problems are beyond the scope of this book.
The goal is to impart a working knowledge of the underlying theory and prac-
tice of sparse matrix algorithms, so that you will have the foundation to understand
more complex (but faster) algorithms. Methods that operate on dense submatrices
of a larger sparse matrix (multifrontal and supernodal methods) are much faster, but
a complete sparse matrix package based on these methods can be tens of thousands
of lines long. The sparse LU, Cholesky, and QR factorization codes in MATLAB®,
for example, total about 100,000 lines of code. Trying to understand the sparse
matrix technique by starting with such huge codes is a daunting task. To overcome
this obstacle, a sparse matrix package, CSparse,1 has been written specifically for
this book.2 It can solve Ax = b when A is unsymmetric, symmetric positive defi-
nite, or rectangular, using about 2,200 lines of code. Although simple and concise,
it is based on recently developed methods and theory. All of CSparse is printed in
this book. Take your time to read and understand these codes; do not gloss over
them. You will find them much easier to comprehend and learn from than their
larger (yet faster) cousins. The larger packages you may use in practice are based
on much of the theory and some of the algorithms presented more concisely and
simply in CSparse. For example, the MATLAB statement x = A b relies on the the-
ory and algorithms from almost every section of this book. Parallel sparse matrix
algorithms are excluded, yet they too rely on the theory discussed here.
For the computational scientist with a problem to solve using sparse matrix
methods, these larger packages may be faster, but you need to understand how
they work to use them effectively. They might not have every function needed to
interface them into your application. You may need to write some code of your own
to manipulate your matrix prior to or after using a large sparse matrix package.
One of the goals of this book is to equip you for these tasks. The same question
applies to MATLAB. You might ask, "What is the most efficient way of solving
my sparse matrix problem in MATLAB?" The short answer is to always operate on
whole matrices, large submatrices, or column vectors in MATLAB and to not rely
1
CSparse: a Concise Sparse matrix package.
2
The index gives page numbers in bold that contain CSparse and related software.
xi
xii Preface
heavily on accessing the rows or individual entries of a sparse matrix. The long
answer to this question is to read this book. MATLAB and the C programming
language are a strong emphasis of this book. In particular, one goal of the book is
to explain how MATLAB performs its sparse matrix computations.
Algorithms are presented in a mixture of pseudocode, MATLAB, and C, so
knowledge of these is assumed. Also required is a basic knowledge of linear algebra,
graph theory, algorithms, and data structures. A short review of these topics is
provided. Each chapter includes a set of exercises to reinforce the topic.3
CSparse is written in C, using a spartan coding style. Using C instead of
(say) Java or C++ allows for concise exposition, full disclosure of time and memory
complexity, efficiency, and portability. CSparse can be downloaded from SIAM at
www.siam.org/books/fa02. MATLAB 7.2 (R2006a) was used for this book. CSparse
handles only real matrices and int integers. CXSparse is an extended version that
includes support for real and complex matrices and int and long integers and can
also be downloaded from www.siam.org/books/fa02.
The genesis of this book was a collection of lecture notes for a course on sparse
matrix algorithms I taught at Stanford in 2003. I would like to thank Gene Golub,
Esmond Ng, and Horst Simon for enabling me to spend a sabbatical at Stanford and
Lawrence Berkeley National Laboratory for the 2002-2003 academic year. Several
extended visits to Sandia National Laboratory at Mike Heroux's invitation enabled
me to develop my versions of the left-looking sparse LU factorization algorithm
and the Dulmage-Mendelsohn decomposition for use in Sandia's circuit simulation
efforts. The algorithms presented here were developed with support from various
sources, including Sandia National Laboratory, the National Science Foundation
(ASC-9111263, DMS-9223088, DMS-9504974, DMS-9803599, and CCR-0203720),
The MathWorks, Inc., and the University of Florida. I would like to thank David
Bateman for adding support for complex matrices and long integers to CXSparse.
Nick Higham, Cleve Moler, and the other members of the Editorial Board
of the SIAM Fundamentals of Algorithms book series encouraged me to turn these
lecture notes and codes into the printed page before you by inviting me to write this
book for the series. Finally, I would like to thank David Day, John Gilbert, Chen
Greif, Nick Higham, Sara Murphy, Pat Quillen, David Riegelhaupt, Ken Stanley,
Linda Thiel, and my Spring 2006 sparse matrix class (Suranjit Adhikari, Pawan
Aurora, Okiemute Brume, Yanqing "Morris" Chen, Eric Dattoli, Bing Jian, Nick
Lord, Siva Rajamanickam, and Ozlem Subakan), who provided helpful feedback on
the content and presentation of the book.
Tim Davis
University of Florida, Gainesville, Florida
www.cise.ufl.edu/~davis
April 2006
3
Instructors: please do not post solutions on the web where they are publicly readable. Use a
password-protected web page instead.
Chapter 1
Introduction
This book presents the fundamentals of sparse matrix algorithms for the direct
solution of sparse linear systems, from theory to algorithms and data structures to
working code. The algorithms presented here have been chosen with these goals
in mind: they must embody much of the theory behind sparse matrix algorithms;
they must be either asymptotically optimal in their run time and memory usage or
be fast in practice; they must be concise so as to be easy to understand and short
enough to print in their entirety in this book; they must cover a wide spectrum of
matrix operations; and they must be accurate and robust.
Algorithms are presented in a mixture of pseudocode, MATLAB, and C, so
knowledge of these is assumed. Also required is a basic knowledge of linear algebra,
graph theory, algorithms, and data structures. This background is reviewed below
and in an appendix on the C programming language.
Chapter 2 presents basic data structures and algorithms, including matrix
multiplication, addition, transpose, and data structure manipulations. Chapter 3
considers the solution of triangular systems of equations. Chapters 4 through 6
present the three most commonly used decompositions: Cholesky, QR, and LU.
Factorization methods specifically for symmetric indefinite matrices are not dis-
cussed. Section 4.10 presents a method for updating and downdating a sparse
Cholesky factorization after a low-rank change. Chapter 7 discusses ordering meth-
ods that reduce work and memory requirements. Chapter 8 draws on the theory
and algorithms presented in Chapters 1 through 7 to solve a sparse linear system
Ax — b, where A can be symmetric positive definite, unsymmetric, or rectangu-
lar, just like the backslash operator in MATLAB, x=A\b, when A is sparse and b
is a dense column vector. Chapter 9 is a summary of the CSparse sparse matrix
package. Finally, Chapter 10 explains how to use sparse matrices in MATLAB.
To avoid breaking the flow of discussion, few citations appear in the body of
each chapter. They are discussed at the end of each chapter instead in a "Further
reading" section, which gives an overview of software, books, and papers related to
that chapter. Notable exceptions to this rule are the theorems stated in the book.
The final section in each chapter is a set of exercises for that chapter.
1
2 Chapter 1. Introduction
where Aij is raj-by-n^, if rrii is the size of the ith row subset and nj is the size of
the jih column subset. Two block matrices can be added if they are partitioned
identically. Two block matrices can be multiplied, C = AB, if the columns of A
are partitioned identically to the rows of B; the rows of C and A are partitioned
identically, as are the columns of C and B. If c is the number of partitions of the
columns of A and rows of B, (1.1) becomes
1.1. Linear algebra 3
A set of vectors a1, a2, • • •, an is linearly independent if >3 ajaj = 0 implies Q.J
is zero for all j. The span of a set of vectors is the set of vectors that can be written
as a linear combination of vectors in the set; span (a i, a?,..., an) = {^ a^a^}. The
range of a matrix A is the span of its column vectors. The rank of a matrix A is
the maximal size of the subsets of the columns of A that are linearly independent.
An n-by-n matrix is singular if its rank is less than n. An ra-by-n matrix is rank
deficient if its rank is less than min(m, n); it has full rank otherwise.
The 1-norm of a column vector x or row vector x1 is ||o;||i = ]P \Xi\, its 2-norm
is ||a;||2 = VTT^F' and its co-norm is ||a;||oo = max \Xi\. The 1-norm of a matrix is
the largest 1-norm of its column vectors. The co-norm of a matrix is the largest
1-norm of its row vectors.
The inverse of a matrix A is A"1, where A A"1 = A"1 A = I. It exists only
if A is square and nonsingular. Two vectors x and y are orthogonal if xTy = 0. A
matrix Q is orthonormal if QTQ = /. A real square orthonormal Q matrix is called
orthogonal, in which case QJ Q = QQT = / (that is, QT = Q~l if Q is orthogonal).
The 2-norm of a vector x and the product Qx are identical if Q is orthogonal.
The fcth diagonal of an ra-by-n matrix A is a vector d consisting of the set
of entries {a^-}, where j — i = k. The term diagonal, by itself, refers to the Oth
diagonal, or main diagonal, of a matrix. The kih diagonal entry of A is a^fc •
The number of nonzero entries (nonzeros for short) in a matrix or vector is
|-A|, and |a| denotes the absolute value of a scalar.
A permutation matrix P is a row or column permutation of the identity matrix.
Any given row or column of P contains a single nonzero entry, equal to 1. The
L U factorization of a square nonsingular matrix A has the form LU = A, where
L is lower triangular and U is upper triangular. With partial pivoting and row
interchanges, the factorization is LU = PA. A matrix A is positive definite if and
only if xTAx > 0 for all nonzero vectors x. It is positive semidefinite if xTAx > 0.
The Cholesky factorization of a square symmetric positive definite matrix A has the
form LLT = A, where L is lower triangular with positive diagonal entries. Pivoting
is not required for stability. A square matrix A is diagonally dominant by rows
if |flti| > ^j^i \aij\ f°r an< *• Jt *s strictly diagonally dominant by rows if (0^) >
£),--£i \aij\ f°r a^ *• ^ is (strictly) diagonally dominant by columns if A1 is (strictly)
diagonally dominant by rows. A square strictly diagonally dominant matrix is
nonsingular. Gaussian elimination without pivoting (a form of LU factorization) is
stable for any diagonally dominant matrix (by rows or by columns).
A QR factorization of a rectangular matrix A is QR = A, where Q is or-
thogonal and R is upper triangular. For a square matrix A, Ax = \x holds for an
eigenvalue A and its eigenvector x.
Sets are denoted in calligraphic letters A, B, C, £, 7?., V, W, X, and y. These
typically arise from the nonzero pattern of the corresponding matrix or vector. For
example, A*j = {i\&ij ^ 0}, and X = {i\Xi ^ 0}. The * in the subscript is
dropped when the context is clear.
The terms dense and sparse refer to the data structure used to store a matrix.
A matrix A e R m x n is dense if it is stored as a full array of ra rows and n columns
with ran entries. This is called a full matrix in MATLAB. All entries are stored,
4 Chapter 1. Introduction
even if some of them are zero. A sparse matrix is stored in a data structure that can
exploit sparsity by not storing numerically zero entries. Numerically zero entries
may be stored in a sparse matrix, typically as a result of numerical cancellation.
where [#J is the largest integer not greater than x. The total time for n insertions
6 Chapter 1. Introduction
is less than or equal to 3n. The amortized time for any one insertion is at most 3.
A common class of graph algorithms consists of methods for traversing the
nodes and edges of a graph. The depth-first search of a graph starts at a node j and
finds all nodes reachable from node .;'. It explores recursively by always examining
the outgoing edges of the latest node i just seen. When all edges of i have been
explored, it backtracks to the node from which i was first discovered. Nodes are
marked so that they are not searched twice. The time taken by a depth-first search
is O(s + e), where s = |Reach(i)| and e is the number of edges in the subgraph
induced by s. This subgraph is connected by the way it is constructed. Traversing
the entire graph in a depth-first manner requires the traversal to be repeated until
all nodes are visited. A depth-first search produces a list of nodes of a DAG in
topological order, i appears before j if i ^ j is a path in G.
The breadth-first search traverses a graph in a different order. Starting at
node i, it first examines all nodes adjacent to i. Next, it examines all nodes j whose
shortest path i ^ j is of length 2, then length 3, and so on. Like the depth-first
search, it too traverses all nodes in Reach^). Unlike the depth-first search, it
traverses these nodes in order of the shortest path from i, not in topological order.
A graph is denoted as G or Q, and T denotes a tree or forest. S denotes the
element lists in the minimum degree ordering algorithm, discussed in Chapter 7.
Basic algorithms
A sparse matrix is one whose entries are mostly zero. There are many ways of storing
a sparse matrix. Whichever method is chosen, some form of compact data structure
is required that avoids storing the numerically zero entries in the matrix. It needs
to be simple and flexible so that it can be used in a wide range of matrix operations.
This need is met by the primary data structure in CSparse, a compressed-column
matrix. Basic matrix operations that operate on this data structure are presented
below, including matrix-vector multiplication, matrix-matrix multiplication, matrix
addition, and transpose.
7
8 Chapter 2. Basic algorithms
int i [ ] = { 2, 1, 3, 0, 1, 3, 3, 1, 0, 2 >;
int j [ ] - { 2, 0, 3, 2, 1, 0, 1, 3, 0, 1 } ;
double x [ ] = {. 3.0, 3 . 1 , 1.0, 3.2, 2 . 9 , 3 . 5 , 0.4, 0 . 9 , 4 . 5 , 1.7 } ;
The triplet form is simple to create but difficult to use in most sparse matrix
algorithms. The compressed-column form is more useful and is used in almost all
functions in CSparse. An m-by-n sparse matrix that can contain up to nzmax entries
is represented with an integer array p of length n+1, an integer array i of length
nzmax, and a real array x of length nzmax. Row indices of entries in column j are
stored in i[p[j]] through i[p[j+l]-l], and the corresponding numerical values
are stored in the same locations in x. The first entry p [0] is always zero, and p [n]
< nzmax is the number of actual entries in the matrix. The example matrix (2.1)
is represented as
int p [ ] = { 0, 3, 6, 8, 10 } ;
int i [ ] = { 0, 1, 3, 1, 2, 3, 0, 2, 1, 3 } ;
double x [ ] = { 4 . 5 , 3 . 1 , 3.5, 2.9, 1.7, 0.4, 3.2, 3.0, 0 . 9 , 1.0 } ;
MATLAB uses a compressed-column data structure much like cs for its sparse
matrices. It requires the row indices in each column to appear in ascending order,
and no zero entries may be present. Those two restrictions are relaxed in CSparse.
The triplet form and the compressed-column data structures are both encapsulated
in the cs structure:
typedef struct cs_sparse /* matrix in compressed-column or triplet form */
i
int nzmax ; /* maximum number of entries */
int m ; /* number of rows */
int n ; /* number of columns */
int *p ; /* column pointers (size n+1) or col indices (size nzmax) */
int *i ; /* row indices, size nzmax */
double *x ; /* numerical values, size nzmax */
int nz ; /* # of entries in triplet matrix, -1 for compressed-col */
} cs ;
The array p contains the column pointers for the compressed-column form (of
size n+1) or the column indices for the triplet form (of size nzmax). The matrix is
in compressed-column form if nz is negative. Any given CSparse function expects
its sparse matrix input in one form or the other, except for cs_print, cs.spalloc,
cs_spf ree, and cs_sprealloc, which can operate on either form.
Within a mexFunction written in C or Fortran (but callable from MATLAB),
several functions are available that extract the parts of a MATLAB sparse matrix;
mxGet Jc returns a pointer to the equivalent of the A->p column pointer array of th
cs matrix A. The functions mxGetlr, mxGetPr, mxGetM, mxGetN, and mxGetNzmax
return A->i, A->x, A->m, A->n, and A->nzmax, respectively. These mx functions ar
not available to a MATLAB statement typed in the MATLAB command window
or in a MATLAB M-file but only in a compiled C or Fortran mexFunction. The
compressed-column data structures used in MATLAB and CSparse are identical,
except that MATLAB can handle complex matrices as well. MATLAB 7.2 forbids
explicit zero entries and requires row indices to be in order in each column.
2.2. Matrix-vector multiplication 9
Allowing the result to overwrite the input vector y, the jih iteration computes
y — y + A+jXj. The pseudocode for computing y = Ax + y is given below.
for j = 0 to n — I do
for each i for which a^ ^ 0 do
10 Chapter 2. Basic algorithms
Most algorithms are presented here directly in C, since the pseudocode directly
translates into C with little modification. Below is the complete C version of the
algorithm. Note how the for (p = . . .) loop in the cs_gaxpy function takes the
place of the for each i loop in the pseudocode (the name is short for generalized A
times x plus y). The MATLAB equivalent of cs_gaxpy(A,x,y) isy=A*x+y. Detailed
descriptions of the inputs, outputs, and return values of all CSparse functions are
given in Chapter 9.
int cs_gaxpy (const cs *A, const double *x, double *y)
{
int p, j, n, *Ap, *Ai ;
double *Ax ;
if OCS.CSC (A) || !x || !y) return (0) ; /* check inputs */
n = A->n ; Ap - A->p ; Ai = A->i ; Ax = A->x ;
for (j = 0 ; j < n ; j++)
{
for (p = Ap [j] ; p < Ap [j+1] ; p++)
{
y [Ai [p]] += Ax [p] * x [j] ;
}
}
return (1) ;
}
#define CS_CSC(A) (A fe& (A->nz — -1))
#define CSJTRIPLET(A) (A && (A->nz >= 0))
The function first checks its inputs to ensure they exist, and returns false (zero)
if they do not. This protects against a caller that ran out of memory. CS_CSC(A) is
true for a compressed-column matrix; CS_TRIPLET(A) is true for a matrix in triplet
form. The next line (n=A->n ; . . .) extracts the contents of the matrix A—its
dimension, column pointers, row indices, and numerical values.
2.3 Utilities
A sparse matrix algorithm such as cs_gaxpy requires a sparse matrix in cs form
as input. A few utility functions are required to create this data structure. The
cs_malloc, cs_calloc, cs_realloc, and cs_free functions are simple wrappers
around the equivalent ANSI C or MATLAB memory management functions.
void *cs_malloc (int n, size_t size)
{
return (malloc (CS_MAX (n,l) * size)) ;
}
The cs_spalloc function creates an m-by-n sparse matrix that can hold up to
nzmax entries. Numerical values are allocated if values is true. A triplet or
compressed-column matrix is allocated depending on whether triplet is true or
false. cs_spf ree frees a sparse matrix, and cs_sprealloc changes the maximum
number of entries that a cs sparse matrix can contain (either triplet or compressed-
column) .
cs *cs_spalloc (int m, int n, int nzmax, int values, int triplet)
{
cs *A = cs_calloc (1, sizeof (cs)) ; /* allocate the cs struct */
if (!A) return (NULL) ; /* out of memory */
A->m » m ; /* define dimensions and nzmax */
A->n = n ;
A->nzmax = nzmax = CS_MAX (nzmax, 1) ;
A->nz = triplet ? 0 : -1 ; /* allocate triplet or comp.col */
A->p = cs_malloc (triplet ? nzmax : n+1, sizeof (int)) ;
A->i = cs_malloc (nzmax, sizeof (int)) ;
A->x = values ? cs_malloc (nzmax, sizeof (double)) : NULL ;
return ((!A->p I I !A->i I I (values && !A->x)) ? cs_spfree (A) : A) ;
}
MATLAB provides similar utilities. cs_spalloc (m, n, nzmax ,1,0) is identical to the
MATLAB spalloc(m,n,nzmax), and cs_spfree(A) is the same as clear A. The
12 Chapter 2. Basic algorithms
cs *T ;
int *Ti, *Tj ;
double *Tx ;
T = cs_spalloc (m, n, nz, 1, 1) ;
Ti = T->i ; Tj = T->p ; Tx = T->x ;
Next, place each entry of the sparse matrix in the Ti, Tj, and Tx arrays. The
kth entry has row index i = Ti [k], column index j = Tj [k], and numerical value
dij = Tx [k]. The entries can appear in arbitrary order. Set T->nz to be the number
of entries in the matrix. Section 2.1 gives an example of a matrix in triplet form.
If multiple entries with identical row and column indices exist, the corresponding
numerical value is the sum of all such duplicate entries.
The cs.entry function is useful if the number of entries in the matrix is not
known when the matrix is first allocated. If space is not sufficient for the next entry,
the size of the T->i, T->j, and T->x arrays is doubled. The dimensions of T are
increased as needed.
int cs_entry (cs *T, int i, int j, double x)
{
if (!CS_TRIPLET (T) I I i < 0 I I j < 0) return (0) ; /* check inputs */
if (T->nz >= T->nzmax && !cs_sprealloc (T,2*(T->nzmax))) return (0) ;
if (T->x) T->x [T->nz] = x ;
T->i [T->nz] = i ;
T->p [T->nz++] = j ;
T->m = CS.MAX (T->m, i+1) ;
T->n = CS_MAX (T->n, j+1) ;
return (1) ;
}
The cs_done function returns a cs sparse matrix and frees any workspace.
2.5 Transpose
The algorithm for transposing a sparse matrix (C — AT) is very similar to the
cs.compress function because it can be viewed not just as a linear algebraic func-
tion but as a method for converting a compressed-column sparse matrix into a
compressed-row sparse matrix as well. The algorithm computes the row counts of
A, computes the cumulative sum to obtain the row pointers, and then iterates over
each nonzero entry in A, placing the entry in its appropriate row vector. If the
resulting sparse matrix C is interpreted as a matrix in compressed-row form, then
C is equal to A, just in a different format. If C is viewed as a compressed-column
matrix, then C contains AT. It is simpler to describe cs.transpose with C as a
row-oriented matrix.
cs *cs_transpose (const cs *A, int values)
{
int p, q, j, *Cp, *Ci, n, m, *Ap, *Ai, *w ;
double *Cx, *Ax ;
cs *C ;
if (!CS_CSC (A)) return (NULL) ; /* check inputs */
m = A->m ; n = A->n ; Ap = A->p ; Ai = A->i ; Ax = A->x ;
C = cs_spalloc (n, m, Ap [n], values && Ax, 0) ; /* allocate result */
w = cs_calloc (m, sizeof (int)) ; /* get workspace */
if (!C || !w) return (cs_done (C, w, NULL, 0)) ; /* out of memory */
Cp - C->p ; Ci = C->i ; Cx - C->x ;
for (p * 0 ; p < Ap [n] ; p++) w [Ai [p]]++ ; /* row counts */
cs_cumsum (Cp, w, m) ; /* row pointers */
for (j - 0 ; j < n ; j++)
{
for (p = Ap [j] ; p < Ap [j+1] ; p++)
{
Ci [q = w [Ai [p]]++] = j ; /* place A ( i , j ) as entry C ( j , i ) */
if (Cx) Cx [q] = Ax [p] ;
>
}
return (cs_done (C, w, NULL, 1)) ; /* success; free w and return C */
}
First, the output matrix C and workspace w are allocated. Next, the row
counts and their cumulative sum are computed. The cumulative sum defines the
row pointer array Cp. Finally, cs_transpose traverses each column j of A, placing
column index j into each row i of C for which a^- is nonzero. The position q of this
entry in C is given by q = w[i], which is then postincremented to prepare for the
next entry to be inserted into row i. Compare cs_transpose and cs.compress.
Their only significant difference is what kind of data structure their inputs are in.
The statement C=cs_transpose(A) is identical to the MATLAB statement C=A',
except that the latter can also compute the complex conjugate transpose. For
real matrices the MATLAB statements C=A' and C=A.' are identical. The values
parameter is true (nonzero) to signify that the numerical values of C are to be
computed or false (zero) otherwise.
Sorting the columns of a sparse matrix is particularly simple. The statement
C=cs_transpose(A) computes the transpose of A. Each row of C is constructed
one column index at a time, from column 0 to C->n-l. Thus, it is a sorted matrix;
2.6. Summing up duplicate entries 15
The function uses a size-m integer workspace; w[i] records the location in Ai
and Ax of the most recent entry with row index i. If this position is within the
current column j, then it is a duplicate entry and must be summed. Otherwise, the
entry is kept and w[i] is updated to reflect the position of this entry.
16 Chapter 2. Basic algorithms
Additional arguments can be passed to fkeep via the void * other parameter to
cs_f keep. This is demonstrated by cs_droptol, which removes entries whose mag-
nitude is less than or equal to tol.
2.8. Matrix multiplication 17
Theorem 2.1 (Gilbert [101]). The nonzero pattern of C*j is the set union of the
nonzero pattern of A*i for all i for which bij is nonzero. IfCj, Ai, and 13j denote
the set of row indices of nonzero entries in C*j, A*i, and B*j, then
A matrix multiplication algorithm must compute both C*j and Cj. Note that
(2.3) is correct only if numerical cancellation is ignored. It is implemented with
cs_scatter and csjnultiply below. A dense vector x is used to construct C#j.
The set Cj is stored directly in C, but another work vector w is needed to determine
if a given row index i is in the set already. The vector w starts out cleared. When
computing column j, w[i]<j+l will denote a row index i that is not yet in Cj.
When i is inserted in Cj, w[i] is set to j+1. The cs_scatter function computes one
iteration of (2.2) and (2.3) for a single value of i, using a scatter operation to copy
a sparse vector into a dense one. The matrix multiplication function cs_multiply
first allocates the w and x workspace and the output matrix C. Next, it iterates over
each column j of the result C. After a series of scatter operations, the dense vector
x is gathered into a sparse vector (a column of C). Since the number of nonzeros in
C is not known at the beginning, it is increased in size as needed.
Computing nnz (A*B) is actually much harder than computing nnz (chol (A) ).
The latter is discussed in Chapter 4. An alternate approach that computes nnz(A*B)
in an initial pass and then C=A*B in a second pass is left as an exercise (Prob-
lem 2.20).
18 Chapter 2. Basic algorithms
int cs_scatter (const cs *A, int j, double beta, int *w, double *x, int mark,
cs *C, int nz)
{
int i, p, *Ap, *Ai, *Ci ;
double *Ax ;
if (!CS_CSC (A) I I !w || !CS_CSC (C)) return (-1) ; /* check inputs */
Ap = A->p ; Ai = A->i ; Ax = A->x ; Ci = C->i ;
for (p = Ap [j] ; p < Ap [j+1] ; p++)
{
i = Ai [p] ; /* A(i,j) is nonzero */
if (w [i] < mark)
{
w [i] = mark ; /* i is new entry in column j */
Ci [nz++] = i ; /* add i to pattern of C(:,j) */
if (x) x [i] = beta * Ax [p] ; /* x(i) = beta*A(i,j) */
}
else if (x) x [i] += beta * Ax [p] ; /* i exists in C(:,j) already */
}
return (nz) ;
}
position nz. The new value of nz is returned. Row index i is in the pattern of x if
w[i] is equal to mark.
The time taken by csjnultiply is O(n + / -f \B\), where / is the number of
floating-point operations performed (/ dominates the run time unless A has one or
more columns with no entries, in which case either n or \B\ can be greater than /).
If the columns of C need to be sorted, either C = ((AB)T}T or C = (BTAT)T can
be computed. The latter is better if C has many more entries than A or B. The
MATLAB equivalent C=A*B uses a similar algorithm to the one presented here.
int cs_pvec (const int *p, const double *b, double *x, int n)
{
int k ;
if (!x || !b) return (0) ; /* check inputs */
f or (k = 0 ; k < n ; k++) x [k] = b [p ? p [k] : k] ;
return (1) ;
>
int cs_ipvec (const int *p, const double *b, double *x, int n)
{
int k ;
if (!x || !b) return (0) ; /* check inputs */
for (k = 0 ; k < n ; k++) x [p ? p [k] : k] = b [k] ;
return (1) ;
}
CSparse functions that operate on symmetric matrices use just the upper
triangular part, just like chol in MATLAB. If A is symmetric with only the upper
triangular part stored, C=A(p,p) is not upper triangular. The cs_symperm function
computes C=A(p,p) for a symmetric matrix A whose upper triangular part is stored,
returning C in the same format. Entries below the diagonal are ignored.
The first f or j loop counts how many entries are in each column of C. Suppose
i < J 5 and A ( i , j) is permuted to become entry C(i2, j2). If 12 < j2, this entry
is in the upper triangular part of C. Otherwise, C(i2, j2) is in the lower triangular
part of C, and the entry must be placed in C as C(j2,12) instead. After the column
counts of C are computed (in w), the cumulative sum is computed to obtain the
column pointers Cp. The second for loop constructs C, much like cs.permute.
cs *cs_symperm (const cs *A, const int *pinv, int values)
i
int i, j, p, q, ±2, j2, n, *Ap, *Ai, *Cp, *Ci, *w ;
double *Cx, *Ax ;
cs *C ;
if (!CS_CSC (A)) return (NULL) ; /* check inputs */
n = A->n ; Ap = A->p ; Ai = A->i ; Ax = A->x ;
C = cs_spalloc (n, n, Ap [n], values &ft (Ax != NULL), 0) ; /* alloc result*/
w » cs_calloc (n, sizeof (int)) ; /* get workspace */
if (!C II !w) return (cs_done (C, w, NULL, 0)) ; /* out of memory */
22 Chapter 2. Basic algorithms
Exercises
2.1. Write a cs_gatxpy function that computes y = ATx + y without forming AT.
2.2. Write a function cs_f ind that converts a cs matrix into a triplet-form matrix,
like the find function in MATLAB.
2.3. Write a variant of cs_gaxpy that computes y — Ax+y, where A is a symmetric
matrix with only the upper triangular part present. Ignore entries in the lower
triangular part.
2.4. Write a function with prototype void cs_scale(cs *A, double *r, double
*c) that overwrites A with RAC, where R and C are diagonal matrices; r [k]
and c [k] are the kth diagonal entries of R and C, respectively.
2.5. Write a function similar to cs_entry that adds a dense submatrix to a triplet
4
www.cse.clrc.ac.uk/nag/hb
5
math.nist.gov/MatrixMarket
6
www.cise.ufl.edu/research/sparse/matrices; see also www.siam.org/books/fa02
7
www.cse.clrc.ac.uk/nag/hsl
8
www.boeing.com/phantom/bcslib-ext
Exercises 25
where £22 is the lower right (n — l)-by-(n — 1) submatrix of L; /2i, x%, and 62 are
column vectors of length n — 1; and /n, xi, and b\ are scalars. This leads to two
equations, l\\x\ — b\ and l?.\x\ -f LZZXI = ^2- To solve Lx = 6, the first can be
solved (#1 = 6i//n) to obtain the first entry in x. The second equation is a lower
triangular system of the form £22^2 = &2 — fai^i that can be solved recursively for
X2- Unwinding the tail recursion leads naturally to an algorithm that iterates over
the columns of L. Note that 61 and 62are used just once; this allows x to overwrite
b in the implementation:
If x is a dense vector but L is sparse, the algorithm and code are very similar to the
matrix-vector multiplication, cs_gaxpy. On input, x contains the right-hand side
27
28 Chapter 3. Solving triangular systems
where U\\ is (n— l)-by-(n — 1). This results in the two equations t/nXi +Wi2#2 = 61
and U-22X-2 = b?. The second equation can be solved for x<2 = 62/^22 > and the first
becomes Ui\x\ = b\ — u\-2X2- These equations are encapsulated in the function
cs_usolve. It assumes the diagonal entry is always present and appears as the last
3.2. A sparse right-hand side 29
entry in each column. Row indices in the columns of U can otherwise be in any
order.
int cs_usolve (const cs *U, double *x)
{
int p, j, n, *Up, *Ui ;
double *Ux ;
if (!CS_CSC (U) I I !x) return (0) ; /* check inputs */
n = U->n ; Up = U->p ; Ui - U->i ; Ux = U->x ;
for (J - n-1 ; J >- 0 ; j—)
{
x [j] /= Ux [Up [j+l]-l] ;
for (p = Up [j] ; p < Up [j+l]-l ; p++)
{
x [Ui [p]] — Ux [p] * x [j] ;
}
}
return (1) ;
}
x+b
for j + 0 to n-do
if xj +0
for each j>jfor which jj+0do
The sparse vector x can be temporarily stored in a dense vector of size n, assumed
to be initially zero. Thus, the two statements x — b and Xi = Xi — lijXj can be done
efficiently. If this algorithm is implemented as in the above pseudocode, the time
taken would be O(n + \b\ + /), where / is the number of floating-point operation
performed and |6| is the number of nonzeros in b. Normally, |6| < /, so the time
is O(n + /). This looks efficient, but it is not. The floating-point operation count
can easily be dominated by n. If b is all zero except for bn, /is 0(1), but the total
work is 0(ra). Basing an LU factorization algorithm on this method for solving
Lx = b would lead to an Q(n2)-time factorization, which is clearly unacceptable.
Factorizing a tridiagonal matrix should take O(n) time, not 0(n 2 ) time.
The problem is the for j loop. A better method would assume that the algo-
rithm starts with a list of indices j for which Xj will be nonzero, X = {j \ Xj ^ 0},
sorted in ascending order. The algorithm would then be
x =b
for each j; E X do
for each i > j for which lij ^ 0 do
Assuming X is already given, the run time drops to O(\b\ + /), which is essentially
0(/), an ideal target.
The problem now becomes how to determine X and how to sort it. Entries in
x become nonzero in two places, the first and last lines of the above pseudocode.
If numerical cancellation is neglected, these two statements can be written as two
logical implications:
These two rules can be expressed as a graph traversal problem. Consider a directed
graph GL = (V,E], where V = { I . . .n} and E = {(j,i) \ l^ ^ 0} (note that this is
actually the graph of LT). The graph is acyclic. If marked nodes in GL correspond
to nonzero entries in x, rule one translates into marking all those nodes i € B, where
B — {i | bi ^ 0}. Rule two states that if node j is marked, and there is an edge from
node j to node i, then node i must be marked. The set X becomes the set of all
nodes in GL that can be reached via a path from one or more nodes in B. These
rules are illustrated in Figure 3.1. In graph terminology, X — Reaches (B), or more
simply X = Reach£,(#), to avoid double subscripts. This gives a formal proof of
the following theorem.
Theorem 3.1 (Gilbert and Peierls [109]). Define the directed graph GL = (V,E)
with nodes V = {l...n} and edges E = {(j,i)\lij 7^0}. Let Reach/,(i) denote
3.2. A sparse right-hand side 31
the set of nodes reachable from node i via paths in GL, and let Reach(H), for a
set B, be the set of all nodes reachable from any node in B. The nonzero pattern
X = {j | Xj / 0} of the solution x to the sparse linear system Lx = b is given by
X — Reacli£,(5), where B — {i \ 6j ^ 0}, assuming no numerical cancellation.
The set X can be computed by a depth-first search of the directed graph GL,
starting at nodes in B. The time taken by a depth-first search is proportional to
the number of edges traversed, plus the number of initial nodes in B. Each edge
reflects exactly two floating-point operations in the numerical solution to Lx = 6,
so the total time is thus O(\b\ + /). A depth-first search does not sort the set A",
however. Fortunately, the update Xi = Xi — lijXj can be computed as soon as Xj
is known. This update translates into two nodes j and i in X with an edge from
j to i in the directed graph GL- An ordering of X that preserves this precedence
is called a topological order, and a depth-first search can compute X in topological
order (a breadth-first search cannot).
A depth-first search is most easily written as a recursive algorithm, stated
in pseudocode below. The reach function computes X = Reachi,(#) by starting a
depth-first search at each node i € B.
function X = reach (L, B)
assume all nodes are unmarked
for each i for which bi ^ 0 do
if node i is unmarked
dfs ({)
int reachr (const cs *L, const cs *B, int *xi, int *w)
{
int p, n = L->n ;
int top = n ; /* stack is empty */
for (p = B->p [0] ; p < B->p [1] ; p++) /* for each i in pattern of b */
{
if (w [B->i [p]] != 1) /* if i is unmarked */
{
dfsr (B->i [p], L, fttop, xi, w) ; /* start a dfs at i */
}
>
return (top) ; /* return top of stack */
}
void dfsr (int j, const cs *L, int *top, int *xi, int *w)
{
int p ;
w [j] = 1 ; /* mark node j */
for (p = L->p [j] ; p < L->p [j+1] ; p++) /* for each i in L ( : , j ) */
{
if (w [L->i [p]] != 1) /* if i is unmarked */
{
dfsr (L->i [p], L, top, xi, w) ; /* start a dfs at i */
}
>
xi [—(*top)] = j ; /* push j onto the stack */
}
int cs_reach (cs *G, const cs *B, int k, int *xi, const int *pinv)
{
int p, n, top, *Bp, *Bi, *Gp ;
if (!CS_CSC (G) I I !CS_CSC (B) II !xi) return (-1) ; /* check inputs */
n = G->n ; Bp = B->p ; Bi = B->i ; Gp - G->p ;
top = n ;
for (p = Bp [k] ; p < Bp [k+1] ; p++)
{
if (!CS_MARKED (Gp, Bi [p])) /* start a dfs at unmarked node i */
{
top = cs_dfs (Bi [p], G, top, xi, xi+n, pinv) ;
}
}
for (p - top ; p < n ; p++) CS_MARK (Gp, xi [p]) ; /* restore G */
return (top) ;
}
int cs_dfs (int j, cs *G, int top, int *xi, int *pstack, const int *pinv)
{
int i, p, p2, done, jnew, head = 0, *Gp, *Gi ;
if (!CS_CSC (G) I I !xi I I Ipstack) return (-1) ; /* check inputs */
Gp = G->p ; Gi - G->i ;
xi [0] - j ; /* initialize the recursion stack */
while (head >= 0)
{
j = xi [head] ; /* get j from the top of the recursion stack */
jnew = pinv ? (pinv [j]) : j ;
if (!CS_HARKED (Gp, j))
{
CS_MARK (Gp, j) ; /* mark node j as visited */
pstack [head] = (jnew < 0) ? 0 : CS.UNFLIP (Gp [jnew]) ;
}
done = 1 ; /* node j done if no unvisited neighbors */
p2 = (jnew < 0) ? 0 : CSJJNFLIP (Gp [jnew+1]) ;
for (p = pstack [head] ; p < p2 ; p++) /* examine all neighbors of j */
{
i = Gi [p] ; /* consider neighbor node i */
if (CS_MARKED (Gp, i)) continue ; /* skip visited node i */
pstack [head] - p ; /* pause depth-first search of node j */
xi [-M-head] = i ; /* start dfs at node i */
done = 0 ; /* node j is not done */
break ; /* break, to start dfs (i) */
}
if (done) /* depth-first search at node j is done */
{
head— ; /* remove j from the recursion stack */
xi [—top] « j ; /* and place in the output stack */
}
}
return (top) ;
}
The cs_df s function starts by placing j in the recursion stack at xi [0] . Each
34 Chapter 3. Solving triangular systems
iteration of the while loop starts, or continues, the jth instance of cs_df s. If j
is on the recursion stack and it is not marked, then this is the first time it has
been visited. In this case, the node is marked, and pstack [head] is set to point to
the first outgoing edge of node j. If an unmarked node i is found, it is placed on
the recursion stack, and the iteration for node j is paused. The next while loop
iteration will then start the depth-first search for node i. When the depth-first
search for node j eventually finishes, it is removed from the recursion stack and
placed in the output stack.
The cs_reach function is nearly identical to reachr. It computes X =
Reach<3(#fc)> where Bk is the nonzero pattern of column k of B.
With cs_reach defined, solving Lx — 6, where L, x, and 6 are all sparse,
becomes a straightforward translation of the pseudocode. The cs.spsolve function
computes the solution to Lx = bk (if lo is nonzero), where bk is the kth column of
B. When lo is nonzero, the function assumes G = L is lower triangular with the
diagonal entry as the first entry in each column. It takes an optimal O(\b\ + /) time.
Solving an upper triangular system Ux = b is almost identical to solving Lx = b.
Its derivation is left as an exercise. With lo equal to zero, the cs_spsolve function
assumes G = U is upper triangular with the diagonal entry as the last entry in each
column.
int cs_spsolve (cs *G, const cs *B, int k, int *xi, double *x, const int *pinv,
int lo)
{
int j, J, p, q, px, top, n, *Gp, *Gi, *Bp, *Bi ;
double *Gx, *Bx ;
if (!CS_CSC (G) || !CS_CSC (B) I I !xi I I !x) return (-1) ;
Gp = G->p ; Gi = G->i ; Gx = G->x ; n = G->n ;
Bp = B->p ; Bi = B->i ; Bx = B->x ;
top = cs_reach (G, B, k, xi, pinv) ; /* xi[top..n-l]=Reach(B(:,k)) */
for (p = top ; p < n ; p++) x [xi [p]] = 0 ; /* clear x */
for (p = Bp [k] ; p < Bp [k+1] ; p++) x [Bi [p]] = Bx [p] ; /* scatter B */
for (px = top ; px < n ; px++)
{
j = xi [px] ; /* x(j) is nonzero */
J = pinv ? (pinv [j]) : j ; /* j maps to col J of G */
if (J < 0) continue ; /* column J is empty */
x [j] /= Gx [lo ? (Gp [J]) : (Gp [J+l]-!)] ;/* x(j) /= G(j,j) */
p = lo ? (Gp [J]+l) : (Gp [J]) ; /* lo: L(j,j) 1st entry */
q = lo ? (Gp [J+l]) : (Gp [J+l]-l) ; /* up: U(j,j) last entry */
for ( ; p < q ; p++)
{
x [Gi [p]] -= Gx [p] * x [j] ; /* x(i) -= G ( i , j ) * x ( j ) */
}
}
return (top) ; /* return top of stack */
}
The function returns the nonzero pattern X in xi [top] through xi [n-1], an array
of size 2*n. The first n entries of xi holds the output stack and the recursion stack
for j. The second n entries holds the stack for p in cs_dfs. The numerical values
are in the dense vector x, which need not be initialized on input. To solve Lx = 6,
a NULL pointer must be passed for pinv, and lo must be nonzero.
3.3. Further reading 35
Exercises
3.1. Derive the algorithm used by cs_utsolve.
3.2. Try to describe an algorithm for solving Lx — b, where L is stored in triplet
form, and x and b are dense vectors. What goes wrong?
36 Chapter 3. Solving triangular systems
Cholesky factorization
Theorem 4.1 (Rose, Tarjan, and Lueker [175]). The edge ( i , j ] is in the undirected
graph GI+^T of L-\-LT if and only if there exists a path i ~~* j in the undirected graph
of A where all nodes in the path except i and j are numbered less than n\m(i,j).
where L\\ and A\\ are (n — l)-by-(n — 1). The three equations that lead to the
up-looking Cholesky factorization algorithm are LuL^ — AH, LI 1/12 = aj.2, and
^12^12 + ^22 = °22- The first equation can be solved recursively to obtain L\\,
followed by a sparse triangular solve using the second equation to compute l\2-
Finally, a sparse dot product and scalar square root, /22 = \/a22 ~ ^12^12? result in
/22- If the matrix is positive definite, then 022 > ^12^12 holds, and the Cholesky
37
38 Chapter 4. Cholesky factorization
Figure 4.1. Pruning the directed graph GL yields the elimination tree T
Theorem 4.3 (Parter [165]). For a Cholesky factorization LLT = A, and neglect-
ing numerical cancellation, i < j < k A Iji ^ 0 A /^ 7^ 0 =>• Ikj 7^ 0. That is, if both
Iji and Iki are nonzero where i < j < k, then l^j will be nonzero as well.
Since there is a path from i to k via j that does not traverse the edge (i, fc),
the edge (i, k} is not needed to compute Reach(t). The set Reach(t) for any other
node t < i with a path t -^ i is also not affected if (i, k) is removed from the directed
graph GL . This removal of edges leaves at most one outgoing edge from node i in
the pruned graph, all the while not affecting Reach(i). If j > i is the least numbered
node for which Iji ^ 0, all other nonzeros Iki where k > j are redundant.
The result is the elimination tree. The parent of node i in the tree is j, where
the first off-diagonal nonzero in column i has row index j (the smallest j > i for
which Iji 7^ 0). Node i is a root of the tree if column i has no off-diagonal nonzero
entries; it has no parent. The tree may actually be a forest, with multiple roots, if
the graph of A consists of multiple connected components (there will be one tree
per component of A). By convention, it is still called a tree. Assume the edges of
the tree are directed, from a child to its parent. Let T denote the elimination tree
of L, and let Tk denote the elimination tree of submatrix Li...fc,i...fc, the first k rows
and columns of L. An example matrix A, its Cholesky factor L, and its elimination
tree T are shown in Figure 4.2. In the factor L, fill-in (entries that appear in L but
not in A) are shown as circled x's. Diagonal entries are numbered for easy reference.
The existence of the elimination tree has been shown; it is now necessary to
compute it. A few more theorems are required.
Proof. Refer to Figure 4.3. The proof is by induction on i for a fixed k. Let
j — min {j | Iji 7^ 0 A j > i] be the parent of i. The parent j > i must exist because
Iki ^ 0. For the base case, if k = j, then k is the parent of i and thus i ~-> fc is a
path in T. For the inductive step, k > j > i must hold, and there are thus two
nonzero entries Iki and 1^. From Theorem 4.3, Ikj 7^ 0. By induction, Ikj implies
the path j ~~* k exists in T. Combined with the edge (i, j), this means there is a
path i ~-> k in T.
Removing edges from the directed graph GL to obtain the elimination tree T
does not affect the Reach(t) of any node. The result is the following theorem.
Theorem 4.5 (Liu [148]). The nonzero pattern Ck of the kth row of L is given by
Theorem 4.5 defines Ck- The kth row subtree, denoted Tfc, is the subtree of T
induced by the nodes in Ck- The 11 row subtrees T 1 ,..., T11 of the matrix shown
in Figure 4.2 are shown in Figure 4.4. Each row subtree is characterized by its
leaves, which correspond to entries in A. This fact is summarized by the following
theorem.
4.1. Elimination tree 41
Theorem 4.6 (Liu [148]). Node j is a leaf ofTk if and only if both djk ^ 0 and
&ik = 0 for every descendant i of j in the elimination tree T.
Corollary 4.7 (Liu [148]). For a Cholesky factorization LLT — A, and neglecting
numerical cancellation, a^i ^ 0 and k > i imply that i is a descendant of k in the
elimination tree T; equivalently, i ~» k is a path in T.
Theorem 4.4 and Corollary 4.7 lead to an algorithm that computes the elimi-
nation tree T in almost O(|^4|) time. Suppose T^-i is known. This tree is a subset
of Tk- To compute 7^ from 7fc_i, the children of k (which are root nodes in Tk-\]
must be found. Since a^i / 0 implies the path i ~> k exists in T, this path can be
traversed in T^-\ until reaching a root node. This node must be a child of fc, since
the path i ~» k must exist.
To speed up the traversal of the partially constructed elimination tree 7fc_i,
a set of ancestors is kept. The ancestor of z, ideally, would simply be the root r
of the partially constructed tree Tk-\ that contains i. Traversing the path i ~> r
would take O(l) time, simply by finding the ancestor r of i. This goal can nearly be
met by a disjoint-set-union data structure. An optimal one would result in a total
time complexity of O(|^4|o;(|yl|, n)) for the \A\ path traversals that need to be made,
where a(|A|,n) is the inverse Ackermann function, a very slowly growing function.
However, a simpler method is used that leads to an O(\A\ logn) time algorithm. The
log n upper bound is never reached in practice, however, and the resulting algorithm
takes practically O(|A|) time and is faster (in practice, not asymptotically) than the
O(|^4|o;(|j4|, n))-time disjoint-set-union algorithm. The time complexity of cs_etree
is called nearly O(\A\) time.
The cs.etree function computes the elimination tree of the Cholesky factor-
ization of A (assuming at a is false), using just A and returning the int array parent
of size n. It iterates over each column k and considers every entry a^ in the upper
triangular part of A. It updates the tree, following the path from i to the root of
the tree. Rather than following the path via the parent array, an array ancestor
is kept, where ancestor [i] is the highest known ancestor of i, not necessarily the
root of the tree in Tk-\ containing i. If r is a root, it has no ancestor (ancestor [r]
is -1). Since the path is guaranteed to lead to node k in 7fc, the ancestors of all
nodes along this path are set to k (path compression). If a root node is reached in
7fc_i that is not k, it must be a child of k in 7^; parent is updated to reflect this.
If the input parameter ata is true, cs_etree computes the elimination tree
of ATA without forming A1 A. This is the column elimination tree. It will be
used in the QR and LU factorization algorithms in Chapters 5 and 6. Row i of A
creates a dense submatrix, or clique, in the graph of ATA. Rather than using the
graph of ATA (with one node corresponding to each column of A), a new graph
is constructed dynamically (also with one node per column of A). If the nonzero
pattern of row i contains column indices j i , J 2 , J 3 , J 4 , — -> the new graph is given
edges (ji, J2)> C?2> js)5 (J3> J4)> and so on. Each row i creates a path in this new
graph. In the tree, these edges ensure j\ -^ J2 ~» js ~» J 4 . . . is a path in T.
42 Chapter 4. Cholesky factorization
The clique in ATA has edges between all nodes ji, j2, j's,,74,... and will have the
same ancestor/descendant relationship. Thus, the elimination tree of A1 A and this
new graph will be the same. The path is constructed dynamically as the algorithm
progresses, using the prev array. prev[i] starts out equal to -1 for all i. Let
Ak be the nonzero pattern of A ( : , k). When column k is considered, the edge
(prev[i] ,k) is created for each i 6 Ak- This edge is used to update the elimination
tree, traversing from prev[i] up to k in the tree. After this traversal, prev[i] is
set to k, to prepare for the next edge in this row i, for a subsequent column in the
outer for k loop.
int *cs_etree (const cs *A, int ata)
-(
int i, k, p, m, n, inext, *Ap, *Ai, *w, *parent, *ancestor, *prev ;
if (!CS_CSC (A)) return (NULL) ; /* check inputs */
m = A->m ; n = A->n ; Ap = A->p ; Ai = A->i ;
parent = cs_malloc (n, sizeof (int)) ; /* allocate result */
w = cs_malloc (n + (ata ? m : 0), sizeof (int)) ; /* get workspace */
if (!w || (parent) return (cs_idone (parent, NULL, w, 0)) ;
ancestor = w ; prev = w + n ;
if (ata) for (i = 0 ; i < m ; i++) prev [i] = -1 ;
for (k = 0 ; k < n ; k++)
{
parent [k] = -1 ; /* node k has no parent yet */
ancestor [k] = -1 ; /* nor does k have an ancestor */
for (p = Ap [k] ; p < Ap [k+1] ; p++)
{
i = ata ? (prev [Ai [p]]) : (Ai [p]) ;
for ( ; i ! = - l f t & i < k ; i = inext) /* traverse from i to k */
{
inext = ancestor [i] ; /* inext = ancestor of i */
ancestor [i] = k ; /* path compression */
if (inext == -1) parent [i] = k ; /* no anc., parent is k */
>
if (ata) prev [Ai [p]] = k ;
}
}
return (cs_idone (parent, NULL, w, 1)) ;
>
The cs_idone function used by cs_etree returns an int array and frees any
workspace.
int *cs_idone (int *p, cs *C, void *w, int ok)
{
cs_spfree (C) ; /* free temporary matrix */
cs_free (w) ; /* free workspace */
return (ok ? p : cs_free (p)) ; /* return result if OK, else free it */
>
The total time taken by the algorithm is 0(|£fc|), the number of nonzeros in
row k of L. This is much faster in general than the cs_reach function that computes
Reaches) for an arbitrary lower triangular L. Solving Lx = b is an integrated part
of the up-looking Cholesky factorization; a stand-alone Lx = b solver when L is a
Cholesky factor is left as an exercise.
The cs_ereach function can be used to construct the elimination tree itself
by extending the tree one node at a time. The time taken would be O(|L|). As a
by-product, the entries in L are created one at a time. They can be kept to obtain
44 Chapter 4. Cholesky factorization
the nonzero pattern of L, or they can be counted and discarded to obtain a count of
nonzeros in each row and column of L. The latter function is done more efficiently
with cs.post and cs.counts, discussed in the next three sections.
MATLAB uses a code similar to cs_ereach in its sparse Cholesky factoriza-
tion methods (an up-looking sparse Cholesky cholmocLrowf ac and a supernodal
symbolic factorization, cholmocLsuper.symbolic). However, it cannot use the tree
when computing x=L\b, for several reasons. The elimination tree is discarded after
L is computed. MATLAB does not keep track of how L was computed, and L may
be modified prior to using it in x=L\b. It may be an arbitrary sparse lower trian-
gular system, whose nonzero pattern is not governed by the tree. Numerically zero
entries are dropped from L, so even if L is not modified by the application, the tree
cannot be determined from the first off-diagonal entry in each column of L. For
these reasons, the MATLAB statement x=L\b determines only that L is sparse and
lower triangular (see Section 8.5) and uses an algorithm much like cs_lsolve.
Theorem 4.8 (Liu [150]). The filled graphs of A and PAPT are isomorphic if P
is a postordering of the elimination tree of A. Likewise, the elimination trees of A
and PAP^ are isomorphic.
However, the depth of the elimination tree can easily be O(n), causing stack overflow
for large matrices. A nonrecursive implementation is better, as shown in the cs_post
function below.
int *cs_post (const int *parent, int n)
{
int j, k = 0, *post, *w, *head, *next, *stack ;
if ('parent) return (NULL) ; /* check inputs */
post = cs_malloc (n, sizeof (int)) ; /* allocate result */
w = cs_malloc (3*n, sizeof (int)) ; /* get workspace */
if (!w || Ipost) return (cs_idone (post, NULL, w, 0)) ;
head = w ; next = w + n ; stack = w + 2*n ;
for (j = 0 ; j < n ; j++) head [j] = -1 ; /* empty linked lists */
for (j = n-1 ; j >= 0 ; j—) /* traverse nodes in reverse order*/
{
if (parent [j] »» -1) continue /* j is a root */
next [j] = head [parent [j]] ; /* add j to list of its parent */
head [parent [j]] = j ;
First, workspace is allocated, and a set of n linked lists is initialized. The jth linked
list contains a list of all the children of node j in ascending order. Next, nodes j
from 0 to n-1 are traversed, corresponding to the for j loop in the postorder pseudo-
code. If j is a root, a depth-first search of the tree is performed, using cs_tdf s.
The cs_tdf s function places the root j on a stack. Each iteration of the while
loop considers the node p at the top of the stack. If it has no unordered children
46 Chapter 4. Cholesky factorization
left, it is removed from the stack and ordered as the kth node in the postordering.
Otherwise, its youngest unordered child i is removed from the head of the pth linked
list and placed on the stack. The next iteration of the while loop will commence
the depth-first search at this node i.
The cs.post function takes as input the elimination tree T represented as the
parent array of size n. The parent array is not modified. The function returns
a pointer to an integer vector post, of size n, that contains the postordering. The
total time taken by the postordering is O(n).
Figure 4.5 illustrates the matrix PAPT, its Cholesky factor, and its elimina-
tion tree, where P is the postordering of the elimination tree in Figure 4.2.
The MATLAB statement [parent,post] = etree(A) computes the elimi-
nation tree and its postordering, using the same algorithms (cholmocLetree and
cholmocLpostorder). Node i is the kth node in the postordered tree if post (k)=i.
A matrix can be permuted with this postordering via C=A(post,post); the number
of nonzeros in chol(A) and chol(C) will be the same. Looking ahead, Chap-
ter 7 discusses how to find a fill-reducing ordering, p, where the permuted matrix
is A(p,p). This permutation vector p can be combined with the postordering of
A(p,p), using the following MATLAB code:
[parent, post] = etree (A (p,p)) ;
p = p (post) ;
row i is the number of nodes in the row subtree Tl, and the column counts can
be accumulated while traversing each node of the ith row subtree. The number
of nonzeros in column j of L is the number of row subtrees that contain node
j. However, this method requires O(|L|) time. The goal of this section and the
following one is to show how to compute the row and column counts in nearly
O(\A\) time.
To reduce the time complexity to nearly O(|^4|), five concepts must be intro-
duced: (1) the least common ancestor of two nodes, (2) path decomposition, (3) the
first descendant of each node, (4) the level of a node in the elimination tree, and (5)
the skeleton matrix. The basic idea is to decompose each row subtree into a set of
disjoint paths, each starting with a leaf node and terminating at the least common
ancestor of the current leaf and the prior leaf node. The paths are not traversed
one node at a time. Instead, the length of these paths are found via the difference
in the levels of their starting and ending nodes, where the level of a node is its
distance from the root. The row count algorithm exploits the fact that all subtrees
are related to each other; they are all subtrees of the elimination tree.
The first step in the row count algorithm is to find the level and first descendant
of each node of the elimination tree. The first descendant of a node j is the smallest
postordering of any descendant of j. The first descendant and level of each node of
the tree can be easily computed in O(n} time by the f irstdesc function below.
void firstdesc (int n, int *parent, int *post, int *first, int *level)
{
int len, i, k, r, s ;
for (i = 0 ; i < n ; i++) first [i] = -1 ;
for (k = 0 ; k < n ; k++)
{
i = post [k] ; /* node i of etree is kth postordered node */
len = 0 ; /* traverse from i towards the root */
for (r - i ; r !=-!&& first [r] == -1 ; r = parent [r], len++)
first [r] = k ;
len += (r == -1) ? (-1) : level [r] ; /* root node or end of path */
for ( s = i ; s !=r ; s= parent [s]) level [s] = len— ;
}
}
A node i whose first descendant has not yet been computed has first[i]
equal to -1. The function starts at the first node (k=0) in the postordered elimina-
tion tree and traverses up towards the root. All nodes r along this path from node
zero to the root have a first descendant of zero, and first [r]=k is set accordingly.
For k>0, the traversal can also terminate at a node r whose first [r] and level [r]
have already been determined. Once the path has been found, it is retraversed to
set the levels of each node along this path.
Once the first descendant and level of each node are found, the row subtree
is decomposed into disjoint paths. To do this, the leaves of the row subtrees must
be found. The entries corresponding to these leaves form the skeleton matrix A; an
entry aij is defined to be in the skeleton matrix A of A if node j is a leaf of the
ith row subtree. The nonzero patterns of the Cholesky factorization of the skeleton
matrix of A and the original matrix A are identical. If node j is a leaf of the ith row
48 Chapter 4. Cholesky factorization
subtree, a^ must be nonzero, but the converse is not true. For example, consider
row 11 of the matrix A in Figure 4.2 and its corresponding row subtree T11 in
Figure 4.4. The nonzero entries in row 11 of A are in columns 3, 5, 7, 8, 10, and
11, but only the first three are leaves of the llth row subtree.
Suppose the matrix and the elimination tree are postordered. The first de-
scendant of each node determines the leaves of the row subtrees, using the following
skeleton function.
function skeleton
maxf irst[0... n — 1] = —1
for j — 0 to n — 1 do
for each i > j for which a^ ^ 0
if first [j] > maxfirst[z]
node j is a leaf in the ith subtree
maxf irst [z] = f irst[j]
The algorithm considers node j in all row subtrees i that contain node j, where j it-
erates from 0 to n—1. Let f irst [j] be the first descendant of node j in the elimina-
tion tree. Let maxf irst [i] be the largest first [j] seen so far for any nonzero o^ in
the ith subtree. If first [j] is less than or equal to maxf irst [i], then node j must
have a descendant d < j in the zth row subtree, for which first [d]=maxf irst [i]
will equal or exceed first[j]. Node j is thus not a leaf of the ith row subtree.
If first[j] exceeds maxfirst[i], then node j has no descendant in the ith row
subtree, and node j is a leaf. The correctness of skeleton depends on Corollary 4.11
below.
Lemma 4.9. Let fj < j denote the first descendant of j in a postordered tree. The
descendants of j are all nodes fj, fj + 1, • • • , j; — 1, j -
Theorem 4.10. Consider two nodes t < j in a postordered tree. Then either (1)
ft < t < fj < j and t is not a descendant of j, or (2) f j < f t < t < j and t is a
descendant of j.
Proof. The two cases of Theorem 4.10 are illustrated in Figure 4.6. A triangle
represents the subtree rooted at a node j, and a small circle represents fj. Case
4.4. Row counts 49
Figure 4.7. Postordered skeleton matrix, its factor, and its elimination tree
(1): Node t is not a descendant of j if and only if t < /j, because of Lemma 4.9.
Case (2): If t is a descendant of j, then ft is also a descendant of j, and thus fj < ft-
If fj < ft-, then all nodes ft through t must be descendants of j (Lemma 4.9). D
Corollary 4.11. Consider a node j in a postordered tree and any set of nodes S
where all nodes s e S are numbered less than j. Let t be the node in S with the
largest first descendant ft- Node j has a descendant in S if and only if ft > fj.
Figure 4.7 shows the postordered skeleton matrix of A, denoted A, its factor L,
and its elimination tree (compare with Figure 4.5). Figure 4.8 shows the postordered
row subtrees (compare with Figure 4.4). Entry a^ (where i > j) is present in the
skeleton matrix if and only if j is a leaf of the (postordered) iih subtree; they are
shown as dark circles (the entry o,i is also shown in the upper triangular part of
A). A. white circle denotes an entry in A that is not in the skeleton matrix A.
The leaves of the row subtree can be used to decompose the row subtree into
a set of disjoint paths in a process called path decomposition. Consider the first
(least numbered) leaf of a row subtree. The first disjoint path starts at this node
and leads all the way to the root. Consider two consecutive leaves of a row subtree,
Jprev < j - The next disjoint path starts at j and ends at the child of the least
50 Chapter 4. Cholesky factorization
common ancestor of jprev and j. The least common ancestor of two nodes a and b
is the least numbered node that is an ancestor of both a and b and is denoted as
q = /ca(a, b). In Tu, shown in Figure 4.9, the first path is from node 3 (the 2nd
node in the postordered tree) to the root node 11. The next path starts at node
5 (the 3rd node in the postordered tree) and terminates at node 5 (node 8 is the
least common ancestor of the two leaves 3 and 5). The third and last path starts at
node 7 and terminates at the child node 9 of the least common ancestor (node 10)
of nodes 5 and 7. Figure 4.9 shows the path decomposition of T11 into these three
disjoint paths. Each node is labeled with its corresponding column index in A and
its postordering (node 8 is the 4th node in the postordered tree, for example).
Once the fcth row subtree is decomposed into its disjoint paths, the fcth row
count is computed as the sum of the lengths of these paths. An efficient method
for finding the least common ancestors of consecutive leaves jprev and j of the row
subtree is needed. Given the least common ancestor q of these two leaves, the length
of the path from j to q can be added to the row count (excluding q itself). The
lengths of the paths can be found by taking the difference of the starting and ending
nodes of each path.
Theorem 4.12. Assume that the elimination tree T is postordered. The least
common ancestor of two nodes a and b, where a < b, can be found by traversing the
path from a towards the root. The first node q > b found along this path is the least
common ancestor of a and b.
The rowcnt function takes as input the matrix A, its elimination tree, and a
postordering of the elimination tree. It uses a disjoint-set-union data structure to
efficiently compute the least common ancestors of successive pairs of leaves of the
row subtrees. Since it is not actually part of CSparse, it does not check any out-of-
memory error conditions. Unlike cs_etree, the function uses the lower triangular
part of A only and omits an option for computing the row counts of the Cholesky
factor of ATA. The cs_leaf function determines if j is a leaf of the iih row subtree,
Tl. If it is, it computes the lea of jprev (the previous leaf found in Tl) and node j.
To compute q = /ca(jprev, j) efficiently in csJLeaf, an ancestor of each node
is maintained, using a disjoint-set-union data structure. Initially, each node is in
4.4. Row counts 51
its own set, and ancestor[z] = i for all nodes i. If a node i is the root of a set, it is
its own ancestor. For all other nodes i, traversing the ancestor tree and hitting a
root q determines the representative (q) of the set containing node i.
int *rowcnt (cs *A, int *parent, int *post) /* return rowcount [O..n-l] */
{
int i, j, k, len, s, p, jprev, q, n, sparent, jleaf, *Ap, *Ai, *maxfirst,
*ancestor, *prevleaf, *w, *first, *level, *rowcount ;
n - A->n ; Ap = A->p ; Ai - A->i ; /* get A */
w = cs_malloc (5*n, sizeof (int)) ; /* get workspace */
ancestor - w ; maxfirst = w+n ; prevleaf = w+2*n ; first = w+3*n ;
level = w+4*n ;
rowcount = cs_malloc (n, sizeof (int)) ; /* allocate result */
firstdesc (n, parent, post, first, level) ; /* find first and level */
for (i = 0 ; i < n ; i++)
{
rowcount [i] ™ 1 ; /* count the diagonal of L */
prevleaf [i] = -1 ; /* no previous leaf of the ith row subtree */
maxfirst [i] = -1 ; /* max first [j] for node j in ith subtree */
ancestor [i] = i ; /* every node is in its own set, by itself */
}
for (k = 0 ; k < n ; k++)
{
j = post [k] ; /* j is the kth node in the postordered etree */
for (p = Ap [j] ; p < Ap [j+1] ; p++)
{
i = Ai [p] ;
q = cs_leaf (i, j, first, maxfirst, prevleaf, ancestor, ftjleaf) ;
if (jleaf) rowcount [i] += (level [j] - level [q]) ;
}
if (parent [j] != -1) ancestor [j] = parent [j] ;
}
cs_free (w) ;
return (rowcount) ;
}
int cs_leaf (int i, int j, const int *first, int *maxfirst, int *prevleaf,
int *ancestor, int *jleaf)
{
int q, s, sparent, jprev ;
if (Jfirst II Imaxfirst II (prevleaf II 'ancestor II !jleaf) return (-1) ;
*jleaf = 0 ;
if (i <= j I| first [j] <= maxfirst [i]) return (-1) ; /* j not a leaf */
maxfirst [i] » first [j] ; /* update max first [j] seen so far */
jprev • prevleaf [i] ; /* jprev » previous leaf of ith subtree */
prevleaf [i] - j ;
*jleaf = (jprev ~ -1) ? 1: 2 ; /* j is first or subsequent leaf */
if (*jleaf — 1) return (i) ; /* if 1st leaf, q - root of ith subtree */
for (q = jprev ; q !• ancestor [q] ; q = ancestor [q]) ;
for (s = jprev ; s != q ; s = sparent)
{
sparent = ancestor [s] ; /* path compression */
ancestor [s] = q ;
}
return (q) ; /* q = least common ancestor (jprev,j) */
}
one at a time, where j iterates from 0 to n — 1. For each node j, all row subtrees i
that contain it are considered (all row indices i corresponding to nonzero entries aij,
where i > j ) . Since jprev and j are leaves of the same row subtree, they will have a
least common ancestor q that is greater than j and which will be the representative
of the set containing jprev. Traversing from node jprev towards the root determines
node q. After this path is traversed, it is compressed to speed up any remaining path
traversals. After all row subtrees containing node j are considered, it is merged into
the set corresponding to its parent. Assuming the elimination tree is connected, no
nodes 0 to j are now root nodes of any set. This ensures that traversing a path in
the ancestor tree will find the least common ancestor for subsequent nodes j (see
Theorem 4.12).
Theorem 4.13 (George and Liu [89]). If Cj denotes the nonzero pattern of the
jth column of L, and Aj denotes the nonzero pattern of the strictly lower triangular
part of the jth column of A, then
Proof. Refer to Figure 4.10. Consider any descendant d of j and any row i 6 Cd-
That is, li(i 7^ 0 and the path d ~»- j exists in T. Theorem 4.5 states that the nonzero
pattern of row i is given by the ith row subtree, Tl. Thus, the path d -^ s —> j
exists in Tl for some s, and row index i is present in Cj and in Cs of the child s of
j (also true if d = 5). To construct the nonzero pattern of column j (Cj}, only Cs
of the children s of j need to be considered. Likewise, there can be no i € Cj not
accounted for by (4.3). If i 6 Cj, then j must be in Tl. Either j is a leaf (and thus
i 6 Aj; C Aj}, or it is not a leaf (and thus j has a child s in Tl, and i € Cs).
Corollary 4.14 (Schreiber [181]). The nonzero pattern of the jth column of L is
a subset of the path j ^ r from j to the root of the elimination tree T.
4.5. Column counts 53
Figure 4.10. The nonzero pattern o/L(: ,j) is the union of its children
Computing the column counts c, = \Cj\ can be done in O(|L|) time by using
Theorem 4.13 or by traversing each row subtree explicitly and keeping track of how
many times j appears in any row subtree. Using the least common ancestors of
successive pairs of leaves of the row subtree reduces the time to nearly O(|A|).
Consider the case where j is a leaf of the elimination tree T. The column
count Cj is simply Cj — \Aj\ + 1 = \Aj\ + 1, since j has no children and each entry
in column j of A is also in the skeleton matrix A. Consider the case where j is not
a leaf of the elimination tree. Theorem 4.13 states that Cj is the union of Aj U {j}
and the nonzero patterns of its children, Cs \ {s}. Since Aj is disjoint from each
child £s, and since s e £s,
The overlap 04 = 2, because rows 4 and 11 each appear twice in the children. The
number of children is 64 = 2. Thus, 04 = 0 — 2 — 2 + 4 + 3 = 3. If the overlap
and the skeleton matrix are known for each column j, (4.4) can be used to compute
the column counts. Note that the diagonal entry j = 4 does not appear in A±.
Instead, 4 € £4 appears in each child, and the overlap accounts for all but one of
these entries.
The overlap can be computed by considering the row subtrees. There are
three cases to consider. Recall that the zth row subtree T1 determines the nonzero
pattern of the ith row of L. Node j is present in the ith subtree if and only if i e Cj.
54 Chapter 4. Cholesky factorization
static void init_ata (cs *AT, const int *post, int *w, int **head, int **next)
{
int i, k, p, m = AT->n, n = AT->m, *ATp = AT->p, *ATi = AT->i ;
*head = w+4*n, *next = w+5*n+l ;
for (k = 0 ; k < n ; k++) w [post [k]] = k ; /* invert post */
for (i = 0 ; i < m ; i++)
{
for (k = n, p - ATp[i] ; p < ATp[i+l] ; p++) k - CS.MIN (k, w [ATi[p]]);
(*next) [i] = (*head) [k] ; /* place row i in linked list k */
(*head) [k] = i ;
}
}
int *cs_counts (const cs *A, const int *parent, const int *post, int ata)
{
int i, j, k, n, m, J, s, p, q, jleaf, *ATp, *ATi, *maxfirst, *prevleaf,
*ancestor, *head = NULL, *next = NULL, *colcount, *w, *first, *delta ;
cs *AT ;
if (!CS_CSC (A) I I Iparent I I !post) return (NULL) ; /* check inputs */
m = A->m ; n = A->n ;
s = 4*n + (ata ? (n+m-H) : 0) ;
delta = colcount = cs_malloc (n, sizeof (int)) ; /* allocate result */
w = cs_malloc (s, sizeof (int)) ; /* get workspace */
AT = cs_transpose (A, 0) ; /* AT - A' */
if OAT || Icolcount II !w) return (cs_idone (colcount, AT, w, 0)) ;
ancestor = w ; maxfirst = w+n ; prevleaf = w+2*n ; first = w+3*n ;
for (k = 0 ; k < s ; k++) w [k] = -1 ; /* clear workspace w [O..S-1] */
for (k = 0 ; k < n ; k++) /* find first [j] */
{
j = post [k] ;
delta [j] - (first [j] — -1) ? 1 : 0 ; /* delta[j]=l if j is a leaf */
for ( ; j != -1 && first [j] ==• -1 ; j = parent [j]) first [j] = k ;
}
ATp = AT->p ; ATi = AT->i ;
if (ata) init_ata (AT, post, w, fthead, ftnext) ;
for (i - 0 ; i < n ; i++) ancestor [i] = i ; /* each node in its own set */
for (k - 0 ; k < n ; k++)
{
j « post [k] ; /* j is the kth node in postordered etree */
if (parent [j] != -1) delta [parent [j]]— ; /* j is not a root */
for (J = HEAD (k,j) ; J != -1 ; J = NEXT (J)) /* J=j for LL'=A case */
-C
for (p = ATp [J] ; p < ATp [J+l] ; p++)
{
i = ATi [p] ;
q = cs_leaf (i, j, first, maxfirst, prevleaf, ancestor, ftjleaf);
if (jleaf >= 1) delta [j]++ ; /* A(i,j) is in skeleton */
if (jleaf == 2) delta [q]— ; /* account for overlap in q */
>
}
if (parent [j] != -1) ancestor [j] - parent [j] ;
}
for (j = 0 ; j < n ; j++) /* sum up delta's of each child */
{
if (parent [j] != -1) colcount [parent [j]] += colcount [j] ;
}
return (cs_idone (colcount, AT, w, 1)) ; /* success: free workspace */
>
56 Chapter 4. Cholesky factorization
The column count algorithm presented here can also be used for the QR and
LU factorization of a square or rectangular matrix A. For QR factorization, the
nonzero pattern of R is identical to LT in the Cholesky factorization LLT = ATA
(assuming no numerical cancellation and mild assumptions discussed in Chapter 5).
This same matrix R provides an upper bound on the nonzero pattern of U for an
LU factorization of A. Details are presented in Chapters 5 and 6.
One method for finding the row counts of R is to compute ATA explicitly and
then find the column counts of its Cholesky factorization. This can be expensive
both in time and memory. A better method taking nearly O(|v4| + n + m) time is to
find a symmetric matrix with fewer nonzeros than ATA but whose Cholesky factor
has the same nonzero pattern as AT A. One matrix that satisfies this property is the
star matrix. It has O(|yl|) entries and can be found in O(|.4| + n + m) time. Each
row of A defines a clique in the graph of ATA. Let Ai denote the nonzero pattern
of the ith row of A. Consider the lowest numbered column index k of nonzeros
in row i of A; that is, k = min*4j. The clique in ATA corresponding to row i of
A is the set of entries Ai x Ai. Consider an entry (ATA)ab, where a & Ai and
b €. Ai. If both a > k and b > k, then this entry is not needed. It can be removed
without changing the nonzero pattern of the Cholesky factor of ATA. Without
loss of generality, assume a > b. The entries (ATA)bis and (ATA)ak will both be
nonzero. Theorems 4.2 and 4.3 imply that lab is nonzero, regardless of whether or
not (ATA)ab is nonzero.
The nonzero pattern of the kih row and column of the star matrix is thus
the union of all Ai, where k = min.4;. Fortunately, this union need not be formed
explicitly, since the row and column count algorithms (and specifically the skele-
ton function) implicitly ignore duplicate entries. To traverse all entries in the kth
column of the star matrix, all rows Ai, where k = min.Aj, are considered. In
cs.counts, this is implemented by placing each row in a linked list corresponding
to its least numbered nonzero column index (using a head array of size n+1 and a
next array of size n).
In MATLAB, c = symbfact(A) uses the same algorithms given here, return-
ing the column counts of the Cholesky factorization of A. The column counts of the
Cholesky factorization of ATA are given by c = symbfact(A, 'col'). Both forms
use the CHOLMOD function cholmod_rowcolcounts.
the symbolic analysis for the up-looking sparse Cholesky factorization presented in
the next section.
typedef struct cs .symbolic /* symbolic Cholesky, LU, or QR analysis */
{
int *pinv ; /* inverse row perm, for QR, fill red. perm for Choi */
int *q ; /* fill-reducing column permutation for LU and QR */
int *parent ; /* elimination tree for Cholesky and QR */
int *cp ; /* column pointers for Cholesky, row counts for QR */
int *leftmost /* leftmost[i] - m i n ( f i n d ( A ( i , : ) ) ) , for QR */
int m2 ; /* # of rows for QR, after adding fictitious rows */
double Inz ; /* # entries in L for LU or Cholesky; in V for QR */
double unz ; /* # entries in U for LU; in R for QR */
> ess ;
cs_schol does not compute the nonzero pattern of L. First, a ess structure S
is allocated. For a sparse Cholesky factorization, S->pinv is the fill-reducing permu-
tation (stored as an inverse permutation vector), S->parent is the elimination tree,
S->cp is the column pointer of L, and S->lnz = \L\. This symbolic structure will
also be used for sparse LU and QR factorizations. Next, p is found via a minimum
58 Chapter 4. Cholesky factorization
{
double d, Iki, *Lx, *x, *Cx ;
int top, i, p, k, n, *Li, *Lp, *cp, *pinv, *s, *c, *parent, *Cp, *Ci ;
cs *L, *C, *E ;
csn *N ;
if (!CS_CSC (A) || !S I I !S->cp |I !S->parent) return (NULL) ;
n = A->n ;
N = cs_calloc (1, sizeof (csn)) ; /* allocate result */
c = cs_malloc (2*n, sizeof (int)) ; /* get int workspace */
x = cs_malloc (n, sizeof (double)) ; /* get double workspace */
cp = S->cp ; pinv = S->pinv ; parent = S->parent ;
C = pinv ? cs_symperm (A, pinv, 1) : ((cs *) A) ;
E = pinv ? C : NULL ; /* E is alias for A, or a copy E=A(p,p) */
if (!N || !c || !x || !C) return (cs.ndone (N, E, c, x, 0)) ;
s = c +n ;
Cp = C->p ; Ci - C->i ; Cx = C->x ;
N->L = L = cs_spalloc (n, n, cp [n], 1, 0) ; /* allocate result */
if (!L) return (cs_ndone (N, E, c, x, 0)) ;
Lp = L->p ; Li = L->i ; Lx = L->x ;
for (k = 0 ; k < n ; k++) Lp [k] - c [k] = cp [k] ;
for (k = 0 ; k < n ; k++) /* compute L(:,k) for L*L' = C */
{
/* Nonzero pattern of L(k,:) */
top = cs_ereach (C, k, parent, s, c) ; /* find pattern of L ( k , : ) */
x [k] = 0 ; /* x (0:k) is now zero */
for (p = Cp [k] ; p < Cp [k+1] ; p++) /* x = full(triu(C(:,k))) */
{
if (Ci [p] <= k) x tCi [p]] = Cx [p] ;
}
d = x [k] ; /* d - C(k,k) */
x [k] = 0 ; /* clear x for k-Ust iteration */
/* Triangular solve */
for ( ; top < n ; top++) /* solve L(0:k-l,0:k-l) * x = C(:,k) */
{
i = s [top] ; /* s [top..n-l] is pattern of L(k,:) */
Iki - x [i] / Lx [Lp [i]] ; /* L(k,i) = x (i) / L(i,i) */
x [i] = 0 ; /* clear x for k+lst iteration */
for (p = Lp [i] + 1 ; p < c [i] ; p++)
{
x [Li [p]] -= Lx [p] * Iki ;
>
d -= Iki * Iki ; /* d = d - L(k,i)*L(k,i) */
p = c [i]++ ;
Li [p] = k ; /* store L(k,i) in column i */
Lx [p] = Iki ;
}
/* Compute L(k,k) */
if (d <= 0) return (cs_ndone (N, E, c, x, 0)) ; /* not pos def */
p = c [k]++ ;
Li [p] * k ; /* store L(k,k) = sqrt (d) in column k */
Lx [p] = sqrt (d) ;
}
Lp [n] = cp [n] ; /* finalize L */
return (cs_ndone (N, E, c, x, 1)) ; /* success: free E,s,x; return N */
}
For a sparse Cholesky factorization, only N->L is used. cs_nf ree frees a numeric
factorization. cs_ndone frees any workspace and returns a numeric factorization.
typedef struct cs_numeric /* numeric Cholesky, LU, or QR factorization */
{
cs *L ; /* L for LU and Cholesky, V for QR */
cs *U ; /* U for LU, R for QR, not used for Cholesky */
int *pinv ; /* partial pivoting for LU */
double *B ; /* beta [O..n-l] for QR */
} csn ;
csn *cs_ndone (csn *N, cs *C, void *w, void *x, int ok)
{
cs_spfree (C) ; /* free temporary matrix */
cs_free (w) ; /* free workspace */
cs_free (x) ;
return (ok ? N : cs_nfree (N)) ; /* return result if OK, else free it */
}
It computes L one column at a time and can be derived from the expression
where the middle row and column of each matrix are the kth row and column
of L, LT, and A, respectively. If the first k — I columns of L are known, /22 =
\Ja,2i — Ifyi2 can be computed first, followed by 1^2 = («32 — ^3i^i2)/^22- For the
sparse case, an amplified version is given below.
4.8. Left-looking and supernodal Cholesky 61
Consider (4.6), and let the middle row and column of the three matrices
represent a block of Sj > I rows and columns. This block of columns is selected
so that the nonzero patterns of these Sj columns are all identical, except for the
diagonal block L22> which is dense. In MATLAB notation, s is an integer vector
where all(s>0) is true, and sum(s)=n. The jth supernode consists of s(j) columns
of L which can be stored as a dense matrix of dimension |£/| by Sj, where / is the
column of L represented as the leftmost column in the jth supernode. chol_super
relies on four key operations, all of which can exploit dense matrix kernels:
62 Chapter 4. Cholesky factorization
The first equation, /^ = an, is solved for /n, followed by /2i = «2i/^ii- Next,
the Cholesky factorization 1/22^22 = ^22 — ^21^21 is computed. The chol_right
function is the MATLAB expression of this algorithm.
function L = chol_right (A)
n = size (A) ;
L = zeros (n) ;
for k = l:n
L (k,k) = sqrt (A (k,k)) ;
L (k+l:n,k) = A (k-H:n,k) / L (k,k) ;
A (k+l:n,k+l:n) = A (k+l:n,k+l:n) - L (k+l:n,k) * L (k+l:n,k)' ;
end
It forms the basis for the multifrontal method, which is similar to chol_right,
except that the summation of the outer product /2i/|i is postponed. Only a brief
overview of the multifrontal method is given here. See Section 6.3 for more details.
Just as in the supernodal method, the columns of L are grouped together; each
group is represented by a dense frontal matrix. Let £/ be the nonzero pattern of
the first column in a frontal matrix. The frontal matrix has dimension |£/|-by-|£/|.
Within this frontal matrix, k > I steps of factorization are computed, and a rank-fc
outer product is computed. These steps can use the dense matrix BLAS, and thus
they too can obtain very high performance.
4.10. Modifying a Cholesky factorization 63
Unlike chol_right, the outer product computed in the frontal matrix is not
immediately added into the sparse matrix A. Let column e be the last pivot column
of L represented by a frontal matrix (e = k2-l in chol_super). Its contribution is
held in the frontal matrix until its parent is factorized. Its parent is that frontal
matrix whose first column is the parent of e in the elimination tree of L. When a
frontal matrix is factorized, the contribution blocks of its children are first added
together.
MATLAB does not use a multifrontal sparse Cholesky method. It does use
the multifrontal method for its sparse LU factorization (see Section 6.3).
Consider the sparse case. A key observation is to note that the columns of L
that are modified correspond to the nonzero pattern of the solution to the triangular
system Lx = w. At the jth step, the variable alpha is equal to Xj. This can be
seen by removing everything from the algorithm except the modifications to w; all
that is left is just a lower triangular solve. If alpha is zero, beta2 and beta are
identical, gamma is zero, and delta is one. No change is made to the jth column of
L in this case. Thus, the jth step can be skipped if Xj — 0.
64 Chapter 4. Cholesky factorization
matrix. Tarjan [195] discusses how the disjoint-set-union data structure can be used
efficiently to compute a sequence of least common ancestors.
Many software packages are available for factorizing sparse symmetric positive
definite or symmetric indefinite matrices. Details of these packages are summarized
in Section 8.6, including references to papers that discuss the supernodal and mul-
tifrontal methods. Gould, Hu, and Scott [116] compare many of these packages.
The BLAS (Dongarra et al. [46]) and LAPACK (Anderson et al. [8]) are two of
the many software packages that provide dense matrix operations and factorization
methods. Optimized BLAS can obtain near-peak performance on many computers
(Goto and van de Geijn [115]).9
Applications that require the update or downdate of a sparse Cholesky fac-
torization include optimization algorithms, least squares problems in statistics, the
analysis of electrical circuits and power systems, structural mechanics, boundary
condition changes in partial differential equations, domain decomposition meth-
ods, and boundary element methods, as discussed by Hager [124]. Gill et al. [110]
and Stewart [190] provide an extensive summary of update/downdate methods.
Stewart [189] introduced the term downdating and analyzed its error properties.
LINPACK includes a rank-1 dense update/downdate [45]; it is used in the MAT-
LAB cholupdate function. The chol_update function above is Carlson's algo-
rithm [20], and chol-downdate is from Pan [163]. The cs.updown function is
based on Bischof, Pan, and Tang's combination of Carlson's update and Pan's
downdate [16]. Davis and Hager developed an optimal sparse multiple-rank super-
nodal update/downdate method, including a method to add and delete rows from
A (CHOLMOD [35, 36, 37]).
Exercises
4.1. Use cs.ereach to implement an O(|L|)-time algorithm for computing the
elimination tree and the number of entries in each row and column of L. It
should operate using only the matrix A and O(n) additional workspace. The
matrix A should not be modified.
4.2. Compare and contrast cs_chol with the LDL package [29] and with
cholmod_rowf ac. Both can be downloaded from www.siam.org/books/fa02.
4.3. Write a function that solves Lx — b when L, x, and b are sparse and L
comes from a Cholesky factorization, using the elimination tree. Assume the
elimination tree is already available; it can be passed in a parent array, or it
can be found by examining L directly, since L has sorted columns.
4.4. Repeat Problem 4.3, but solve LTx = b instead.
4.5. The cs_ereach function can be simplified if A is known to be permuted ac-
cording to a postordering of its elimination tree and if the row indices in each
column of A are known to be sorted. Consider two successive row indices i\_
and i-2 in a column of A. When traversing up the elimination tree from node
9
www.tacc.utexas.edu/resources/software
68 Chapter 4. Choleskv factorization
iiy the least common ancestor of «i and i% is the first node a > 1-2- Let p be
the next-to-the-last node along the path ii ^ a (where p < i2 < a). Include
the path i\ -^ p in an output queue (not a stack). Continue traversing the
tree, starting at node i^. The resulting queue will be in perfectly sorted order.
The while (len>0) loop in cs_ereach can then be removed.
4.6. Compute the height of the elimination tree, which is the length of the longest
path from a root to any leaf. The time taken should be O(n). The result
should be the same as the second output of the MATLAB symbf act function.
4.7. Why is head of size n+1 in cs_counts?
4.8. How does the skeleton function implicitly ignore duplicate entries?
4.9. The cs_schol function computes a postordering, but does not combine it
with the fill-reducing ordering, because the ordering from cs.amd includes an
approximate postordering of the elimination tree. However, cs_amd might
not be called. Add an option to cs_schol to combine the fill-reducing order
(or the natural ordering) with the postordering.
4.10. Write a function that computes the symbolic Cholesky factorization of A (the
nonzero pattern of L). Hint: start with cs_chol and remove any numerical
computations. The algorithm should compute the pattern of L mO(\L\) time
and return a matrix L with sorted columns. The s array can be removed,
since the row indices can be stored immediately into L in any order. It should
allocate both N->L->i and N->L->x for use in Problem 4.11. Allocating
N->L->x can be postponed, but allocating it here makes it simpler to write
a MATLAB mexFunction interface for this problem.
4.11. Write a sparse left-looking Cholesky factorization algorithm with prototype
int cs_leftchol(cs *A, ess *S, csn *N). It should assume the nonzero
pattern of L has already been computed (see Problem 4.10). Compare its
performance with cs_chol and cs_rechol in Problem 4.12. The algorithm is
very similar to cs_chol. The initializations are identical, except that x should
be created with cs_calloc, not cs_malloc. The N structure should be passed
in with all of N->L preallocated. The s array is not needed if cs_ereach is
merged with cs_lef tchol (the topological order is not required).
4.12. Write a function with prototype int cs_rechol(cs *A, ess *S, csn *N)
that computes the Cholesky factorization of A using the up-looking method.
It should assume that the nonzero pattern of L has already been computed
in a prior call to cs_chol (or by Problem 4.10). The nonzero pattern of A
should be the same as in the prior call to cs_chol.
4.13. An incomplete Cholesky factorization computes an approximation to L with
fewer nonzeros. It is useful as a preconditioner for iterative methods, as
discussed in detail by Saad [178]. One method for computing it is to drop
small entries from L as they are computed. Another is to use a fixed nonzero
pattern (typically the nonzero entries in A) and keep only entries in L within
that pattern. Write an incomplete Cholesky factorization based on cs_chol
or cs_leftchol (Problem 4.11). See Problem 6.13 for more details. See also
the MATLAB cholinc function.
Chapter 5
Orthogonal methods
The most reliable methods for solving least squares problems use orthogonal trans-
formations. This chapter considers QR factorization based on Householder reflec-
tions and Givens rotations.
69
70 Chapter 5. Orthogonal methods
The left-looking algorithm qr_lef t applies the Householder reflections only to the
current column k, one column at a time, and is simpler to implement for the sparse
case.
function [V,Beta,R] = qr_left (A)
[m n] = size (A) ;
V = zeros (m,n) ;
Beta = zeros (l,n) ;
R = zeros (m,n) ;
for k = l:n
x - A (:,k) ;
for i = l:k-l
v » V (i:m,i) ;
beta = Beta (i) ;
x (i:m) = x (i:m) - v * (beta * (v' * x (i:m))) ;
end
[v,beta,s] = gallery ('house', x (k:m), 2) ;
V (k:m,k) = v ;
Beta (k) = beta ;
R (l:(k-l),k) = x (l:(k-D) ;
R (k,k) = s ;
end
Some of the theorems stated here require A\.k ' to be structurally nonzero
(that is, it is an entry in the data structure even if numerically zero). If this is
not the case, then the rows of A can be permuted, or the sparse matrix A can be
modified by adding explicit zero entries, to ensure that this condition holds. All
of the theorems ignore numerical cancellation. Some theorems require the matrix
to have the strong Hall property; they provide loose upper bounds on the nonzero
pattern otherwise. Because of space constraints, the proofs are brief or omitted.
The definition of strong Hall is given in Section 7.3.
That is, in HA, the nonzero pattern of any modified row i € V is replaced with the
set union of all rows that are modified by the Householder reflection H.
Proof. From Theorem 2.1, the nonzero pattern of vTA = (ATv}T is given by
(5.1). The entry (v(vTA})ij (/? can be ignored) is nonzero if and only if i 6 V
and j € Ujgy^i*- The matrix v(vTA) is then subtracted from A to obtain HA,
modifying all rows i e V. Each of the corresponding rows of A was used to construct
the set union (5.1), so all of them now have the same nonzero pattern, given by
(5.1).
Theorem 5.3 (Golub and Van Loan [114]). If ATA is positive definite, and its
Cholesky factorization is LLT = ATA, then
Proof.
Theorem 5.4 (George and Ng [97]). If A\.^ ' is structurally nonzero for all
then
Theorem 5.5 (Coleman, Edenbrandt, and Gilbert [22], George and Heath [83]).
Assuming the matrix A has the strong Hall property, 7£*fc = C^*, where £&* denotes
the nonzero pattern of the kth row of the symbolic Cholesky factor of ATA. If A
does not have the strong Hall property,
Corollary 5.6. 11,+k = Reachrk(Ck), where Ck is the nonzero pattern of the upper
triangular part of column k of ATA (assuming A has the strong Hall property).
A more concise method for computing 7£*fc is based on the following theorem.
Hall property).
Theorem 5.8.
where each set in the above expression is disjoint from all other sets, and A has the
strona Hall vrovertv. That is.
If A does not have the strong Hall property, this is an upper bound on V.
Theorem 5.9.
where A has the strong Hall property. Otherwise, (5.3) is an upper bound.
turally nonzero or, equivalently, k € Vk. Compare the following pseudocode function
with the MATLAB function qr_lef t.
The cs_sqr function does the ordering and analysis for a sparse QR factor-
76 Chapter 5. Orthogonal methods
ization. Two parameters determine behavior of cs_sqr: order and qr. The order
parameter specifies the ordering to use; the natural ordering (order=0) or a min-
imum degree ordering of ATA (order=3) are good choices for a QR factorization.
The qr parameter must be true (nonzero) for a sparse QR factorization. cs_sqr
first finds a fill-reducing column permutation S->q. The function then finds the
permuted matrix C — AQ (where Q is the column permutation, not the orthogonal
factor Q), determines the elimination tree of CTC and postorders it, and finds the
column counts of L (equivalently, the row counts of R). It then calls cs_vcount to
find the column counts |Vi...n| of the V matrix that holds the Householder vectors.
The cs_qr function performs the numerical QR factorization.
ess *cs_sqr (int order, const cs *A, int qr)
{
int n, k, ok = 1, *post ;
ess *S ;
if (!CS_CSC (A)) return (NULL) ; /* check inputs */
n = A->n ;
S = cs_calloc (1, sizeof (ess)) ; /* allocate result S */
if OS) return (NULL) ; /* out of memory */
S->q = cs_amd (order, A) ; /* fill-reducing ordering */
if (order && !S->q) return (cs_sfree (S)) ;
if (qr) /* QR symbolic analysis */
{
cs *C = order ? cs_permute (A, NULL, S->q, 0) : ((cs *) A) ;
S->parent = cs_etree (C, 1) ; /* etree of C'*C, where C=A(:,q) */
post = cs_post (S->parent, n) ;
S->cp = cs_counts (C, S->parent, post, 1) ; /* col counts chol(C'*C) */
cs_free (post) ;
ok = C && S->parent && S->cp && cs_vcount (C, S) ;
if (ok) for (S->unz = 0 , k = 0 ; k < n ; k++) S->unz += S->cp [k] ;
ok = ok && S->lnz >= 0 && S->unz >= 0 ; /* int overflow guard */
if (order) cs_spfree (C) ;
}
else
{
S->unz = 4*(A->p [n]) + n ; /* for LU factorization only, */
S->lnz = S->unz ; /* guess nnz(L) and nnz(U) */
}
return (ok ? S : cs_sfree (S)) ; /* return result S */
}
The cs_qr function uses the symbolic analysis computed by cs_sqr: the
column elimination tree S->parent, column preordering S->q, row permutation
S->pinv, the S->lef tmost array, the number of nonzeros in R and V (S->unz and
S->lnz, respectively), and the number of rows S->m2 after adding fictitious rows if
A is structurally rank deficient.
The function first extracts the contents of S, allocates the result N, and allo-
cates and initializes some workspace. Next, each column k of V and R is computed.
The body of the for k loop first determines where V(: ,k) and R ( : ,k) start, and
finds the column A ( : , col) corresponding to C(: ,k). The nonzero pattern 7£*fc of
the kth column of R is found using a symbolic sparse triangular solve (Theorem
5.7). Prior Householder reflections are applied to column k, one for each nonzero
entry in |7£fc|, and V/t is computed (5.3). The modified column x = Hk-i • • • H\C*k
is gathered from its dense vector representation x as the kth column of V, and over-
written with the kth Householder vector. A complete symbolic and numeric QR
factorization, including a fill-reducing column preordering, can be computed with
S=cs_sqr(3,A,l) followed by N=cs_qr(A,S).
In MATLAB, [Q,R]=qr(A) computes the QR factorization of A. The fill-
reducing column permutation must be applied to A prior to calling qr. The MAT-
LAB qr function is based on Givens rotations, not Householder reflections. It
returns the orthogonal matrix Q, not the more compact representation of V, Beta,
and pinv that cs_qr uses.
The cs_qright and cs_qleft M-files apply the Householder reflections (V,
Beta, and p as computed by cs_qr) to the left or right of a matrix. cs_qlef t is
similar to csjiapply, except that it applies all of the Householder vectors.
function X = cs_qright (V, Beta, p, Y)
7tCS_QRIGHT apply Householder vectors on the right.
7. X = cs_qright(V,Beta,p,Y) computes X = Y*P'*H1*H2*...*Hn = Y*Q where Q is
'/. represented by the Householder vectors V, coefficients Beta, and
'/, permutation p. p can be [] , which denotes the identity permutation.
7. To obtain Q itself, use Q = cs_qright(V,Beta,p,speye(size(V,l))) .
7,
7, See also CS_QR, CS_QLEFT.
[m n] - size (V) ;
X = Y ;
if ("isempty (p)) X = X (:,p) ; end
for k = l:n
X = X - (X * (Beta (k) * V (:,k))) * V (:,k)' ;
end
if (m2 > m)
if (issparse (Y))
X = [X ; sparse(m2-m,ny)] ;
else
X = [X ; zeros(m2-m,ny)] ;
end
end
if ("isempty (p)) X = X (p,:) ; end
for k = l:n
X = X - V (:,k) * (Beta (k) * (V (:,k)' * X)) ;
end
The Householder vectors stored in V are typically much sparser than the ex-
plicit representation of Q. Try this short experiment in MATLAB, which compares
Q (with 38,070 nonzeros) and V (with only 3,906 nonzeros):
load west0479
q = colamd (west0479) ;
[Q,R] = qr (west0479 ( : , q ) ) ;
[V,beta,p,R2] = cs_qr (west0479 (:,q)) ;
Q2 = cs_qright (V, beta, p, speye(size(V,l))) ;
where r = ±||o;||2. The coefficients c and s are computed using the following givens2
MATLAB function.
function g = givens2(a,b)
if (b == 0)
c = 1 ; s = 0 ;
elseif (abs (b) > abs (a))
tau = -a/b ; s = 1 / sqrt (l+tau~2) ; c = s*tau ;
else
tau » -b/a ; c • 1 / sqrt (l+tau~2) ; s = c*tau ;
end
g = [c -s ; s c] ;
advantages in the sparse case. Rows can be ordered to reduce the work below that
of the Householder-based sparse QR. The disadvantage of using Givens rotations
is that the resulting QR factorization is less suitable for multifrontal techniques.
MATLAB uses Givens rotations for its sparse QR factorization. It operates on
the rows of R and A. The matrix R starts out equal to zero but with enough space
allocated to hold the final R. Each step of the factorization brings in a new row of
A and eliminates its entries with the existing R until it is either all zero or it can be
placed as a new row of R. The qr_givens_f ull algorithm for full matrices is shown
below. It assumes the diagonal of A is nonzero. The innermost loop annihilates the
ttik entry via a Givens rotation of the incoming ith row of A and the kth row of R.
function R = qr_givens_full (A)
[m n] = size (A) ;
for i = 2:m
for k = l:min(i-l,n)
A ([k i],k:n) = givens2 (A(k,k), A(i,k)) * A ([k i],k:n) ;
A (i,k) = 0 ;
end
end
R = A ;
For the sparse case, the rotation to zero out a^ must be skipped if it is already
zero. The entries k that must be annihilated correspond to the nonzero pattern V;*
of the ith row of the Householder matrix V, discussed in the previous section.
Theorem 5.10 (George, Liu, and Ng [95]). Assume A has a zero-free diagonal.
The nonzero pattern V;* of the ith row of the Householder matrix V is given by the
path f ~» min(i, r) in the elimination tree T of ATA, where f = min Ai* is the
leftmost nonzero entry in the ith row, and r is the root of the tree.
size n. The matrix R is allocated with enough space for each final row, but it starts
out empty. The ith row of A is annihilated with rows k in the path / ~» min(z, r)
until encountering an empty row k of R, at which point the elimination stops and
the partially eliminated row of A becomes the kih row of R. If A has a zero-
free diagonal, this happens when i — k. This method is called the row-merge QR
factorization algorithm; it could also be called an up-looking sparse QR, since at
the iih step it accesses only row i of A and rows 1 to min(i, n) of R. The qr_givens
function assumes A has a zero-free diagonal.
tions; see Lu and Barlow [154], Matstoms [156], Amestoy, Duff, and Puglisi [6], and
Pierce and Lewis [167] (who also present an approximate rank-revealing multifrontal
QR algorithm).
Underdetermined systems can be solved with QR factorization applied to AT',
as discussed by George, Heath, and Ng [84].
Exercises
5.1. Write a function that computes the nonzero pattern of V and R.
5.2. Modify cs_qr so that it can handle m-by-n matrices where m < n. One simple
solution is to append empty rows onto A, but this will not be efficient if m is
much smaller than n.
5.3. Write a function csjreqr (cs *A, ess *S, csn *N) that computes a QR
factorization. It should assume that the nonzero patterns of V and R are
already computed.
5.4. Add column pivoting to cs_qr. If a column has a norm less than or equal
to a given tolerance, permute it to the end of the matrix. The matrices V
and R will need to be dynamically reallocated (see cs_lu in Chapter 6), since
permuting the columns breaks the symbolic preanalysis.
5.5. Combine the postordering with the fill-reducing ordering in cs_sqr (see Prob-
lem 4.9 for details).
Chapter 6
LU factorization
Of the three factorization methods (Cholesky, QR, and LU) presented here, LU
factorization is the oldest. As a factorization method, it factors a matrix A into the
product LU, where L is lower triangular and U is upper triangular. The historical
method for dense matrices is a right-looking one (Gaussian elimination); both it
and a left-looking method are presented here. The latter is used in CSparse, since
it leads to a much simpler implementation for the sparse case.
Theorem 6.1 (George and Ng [97], Gilbert [101], and Gilbert and Ng [106]). //
the matrix A is strong Hall, R is an upper bound on the nonzero pattern ofU. More
precisely, Uij can be nonzero if and only ifrij ^ 0.
83
84 Chapter 6. LU factorization
all candidate pivot rows. This proof also establishes a bound on I/, namely, the
nonzero pattern of V.
Theorem 6.2 (Gilbert [101] and Gilbert and Ng [106]). // the matrix A is strong
Hall, and assuming a^ ' ^ 0 for all k, the Householder matrix V is an upper
bound on the nonzero pattern of L obtained with partial pivoting. More precisely,
lij can be nonzero if and only if Vij ^ 0.
10
This is called static pivoting; it can be used even if the matrix is not quite diagonally dominant,
if iterative refinement is used after the solution has been found.
6.2. Left-looking LU 85
If the qr parameter of cs_sqr is true, the QR upper bound is found for the
permuted matrix AQ (here, Q is the column permutation, not the orthogonal factor
Q). In this case, LU factorization can proceed using a statically allocated memory
space. This bound can be quite high, however (a comparison between the upper
bound and the actual \L\ and \U\ is left as an exercise). It is sometimes better just
to make a guess at the final \L\ and \U\ or to guess that no partial pivoting will
be needed and to use a symbolic Cholesky analysis to determine a guess for \L\
and |U| (this is left as an exercise). Sometimes a good guess is available from the
LU factorization of a similar matrix in the same application. If qr is false, cs_sqr
makes an optimistic guess that |L| = |t/"| = 4|.4| + n. This guess is suitable for some
matrices but too low for others. After calling cs_sqr, the guess S->lnz and S->unz
can be easily modified as desired. The only penalty for making a wrong guess is
that the memory space for \L\ or \U\ must be reallocated if the guess is too low, or
memory may run out if the guess is too high.
6.2 Left-looking LU
The left-looking LU factorization algorithm computes L and U one column at a
time. At the kih step, it accesses columns 1 to k — 1 of L and column k of A. If
partial pivoting is ignored, it can be derived from the following 3-by-3 block matrix
expression, which is very similar to (4.6) for the left-looking Cholesky factorization
algorithm. The matrix L is assumed to have a unit diagonal.
The middle row and column of each matrix is the kih row and column of L, C7, and
>1, respectively. If the first k — 1 columns of L and U are known, three equations
can be used to derive the kih columns of L and U: L\\u\i = a\2 is a triangular
system that can be solved for u\i (the kih column of U), /2i«i2 + ^22 = «22 can be
solved for the pivot entry ^22, and £31^12 + £32^22 = °32 can then be solved for £32
(the kih column of L). However, these three equations can be rearranged so that
nearly all of them are given by the solution to a single triangular system:
The solution to this system gives u^ = X i , «22 = £2, and £32 = £3/^22- The algo-
rithm is expressed in the MATLAB function luJLef t, except that partial pivoting
with row interchanges has been added. It returns L, U, and P so that L*U = P*A. I
does not exploit sparsity.
86 Chapter 6. LU factorization
Lp [n] = Inz ;
Up [n] = unz ;
Li = L->i ; /* fix row indices of L for final pinv */
for (p = 0 ; p < Inz ; p++) Li [p] = pinv [Li [p]] ;
cs_sprealloc (L, 0) ; /* remove extra space from L and U */
cs_sprealloc (U, 0) ;
return (cs_ndone (N, NULL, xi, x, 1)) ; /* success */
}
The first part of the cs_lu function allocates workspace and obtains the in-
formation from the symbolic ordering and analysis. The number of nonzeros in L
and U is not known; S->lnz and S->unz are either upper bounds computed from a
symbolic QR factorization or simply a guess.
Triangular solve: The kth iteration of the f or k loop first records the start
of the kth columns of L and U and then reallocates these two matrices if space
might not be sufficient. Next, the triangular system (6.2) is solved for x. No post-
permutation is required for U, since pinv[i] is well defined.
Find pivot: The largest nonpivotal entry in the pivot column is found. An
entry x [i] corresponding to a row i that is already pivotal is copied directly into U.
If no nonpivotal row index i is found (ipiv is -1), the matrix is structurally rank
deficient. If the largest entry in nonpivotal rows is numerically zero (a is zero), the
matrix is numerically rank deficient. The diagonal entry (x[col], where col is the
kth column of AQ and Q is the fill-reducing column ordering), is selected if it is
large enough compared with the partial pivoting choice (x[ipiv]).
Divide by pivot: The pivot entry is saved as U(k,k), the last entry in
U(: ,k), as required by cs_usolve. A unit diagonal entry is stored as the first entry
in L(: ,k), as required by cs_lsolve. Note that ipiv corresponds to a row index
of A, not PA.
Finalize L and U: The last column pointers for L and U are recorded, the row
indices of L are fixed to refer to their permuted ordering, and any extra space is
removed from L and U.
The algorithm takes O(n+\A\+f) time, where / is the number of floating-point
operations performed. This is essentially O(f), except when A is diagonal (for exam-
ple). MATLAB uses the algorithm above for the [L,U,P]=lu(A) syntax (GPLU).
It uses a right-looking multifrontal method (UMFPACK) for [L,U,P,Q]=lu(A) and
x=A\b when A is sparse, square, and not symmetric positive definite.
Cholesky factorization,
where In = 1 is a scalar, and all three matrices are square and partitioned identi-
cally. Other choices for /n are possible; this choice leads to a unit lower triangular
L and the four equations
Solving each equation in turn leads to the recursive lu_rightr, written in MATLAB
below. This function is meant as a working description of the algorithm, not an
efficient implementation.
function [L,U] = lu_rightr (A)
n - size (A,l)
if (n == 1)
L = 1 ;
U = A ;
else
ull = A (1,1) ; '/. (6.4)
u!2 = A (l,2:n) ; */. (6.5)
121 = A (2:n,l) / ull ; */. (6.6)
[L22.U22] = lu.rightr (A (2:n,2:n) - 121*ul2) ; '/. (6.7)
L = [ 1 zeros(l,n-l) ; 121 L22 ] ;
U = [ull u!2 ; zeros(n-1,1) U22 ] ;
end
The lu_rightr function uses tail recursion, where the recursive call is the very
last step (the last two lines of the loop do not do any work; they just define the
contents of L and U computed via (6.4) through (6.7)). Tail recursion can easily be
converted into an iterative algorithm, as shown by the lu_right function. This is
how a right-looking LU factorization algorithm would normally be written, except
that in the dense case, A is normally overwritten with L and U.
function [L,U] = lu.right (A)
n - size (A,l)
L - eye (n) ;
U = zeros (n) ;
for k = l:n
U (k,k:n) = A (k,k:n) ; */, (6.4) and (6.5)
L (k+l:n,k) - A (k+l:n,k) / U (k,k) ; '/. (6.6)
A (k+l:n,k+l:n) = A (k+l:n,k+l:n) - L (k+l:n,k) * U (k,k+l:n) ; '/. (6.7)
end
and |an| > max|a2i|. If (6.3) and its equivalent form in (6.4) through (6.7) are
used directly on A, the inductive hypothesis (6.8) cannot be used. If LU — PA is
the statement being proven, the inductive hypothesis must be applied to a matrix
of smaller dimension but with the same form; (6.8) does not have a permutation
matrix. The inductive hypothesis
function does. For a dense matrix factorization, access to the rows of L is much sim-
pler, and the permutations can be applied immediately, as done by lu_left. Either
method leads to the same LU = PA factorization. After replacing the recursion
in lu_rightpr with its nonrecursive implementation and allowing A to be over-
written with its LU factorization, the conventional outer-product form of Gaussian
elimination is obtained, as demonstrated by the lu_rightp function, shown below.
function [L,U,P] = lu.rightpr (A)
n = size (A,l)
if (n == 1)
P = 1 ;
L = 1 ;
U - A ;
else
[x,i] = max (abs (A (l:n,l)» ; 7, partial pivoting
PI - eye (n) ;
PI ([1 i],:) = PI ([i 1], :) ;
A = P1*A ;
ull = A (1,1) 5 */. (6.10)
u!2 - A (l,2:n) ; 7. (6.11)
121 = A (2:n,l) / ull ; 7. (6.12)
[L22.U22.P2J - lu.rightpr (A (2:n,2:n) - 121*ul2) ; 7. (6.9) or (6.13)
0 = zeros(l,n-l) ;
L = [ 1 o ; P2*121 L22 ] ; 7, (6.14)
U = [ ull u!2 ; o' U22 ] ;
P = [ 1 o ; o' P2] * PI ;
end
The frontal matrices are related to one another via the assembly tree, which is
a coarser version of the elimination tree (some nodes having been merged together
via amalgamation). To factorize a frontal matrix, the original entries of A are
added, along with a summation of the contribution blocks of its children (called the
assembly}. One or more steps of dense LU factorization are performed within the
frontal matrix, leaving behind its contribution block (the Schur complement of its
pivot rows and columns). A high level of performance can be obtained using dense
matrix kernels (the BLAS). The contribution block is placed on a stack, and deleted
when it is assembled into its parent.
An example is shown in Figure 6.1. Black circles represent the original entries
of A. Circled x's represent fill-in entries. White circles represent entries in the
contribution block of each frontal matrix. The arrows between the frontal matrices
represent both the data flow and the parent/child relationship of the assembly tree.
A symbolic analysis phase determines the elimination tree and the amalga-
mated assembly tree. During numerical factorization, numerical pivoting may be
required. In this case it may be possible to pivot within the fully assembled rows
and columns of the frontal matrix. For example, consider the frontal matrix holding
diagonal elements 077 and agg in Figure 6.1. If 077 is numerically unacceptable, it
may be possible to select a79 and o97 as the next two pivot entries instead. If this is
not possible, the contribution block of frontal matrix 7 will be larger than expected.
This larger frontal matrix is assembled into its parent, causing the parent frontal
matrix to be larger than expected. Within the parent, all pivots originally assigned
to the parent and all failed pivots from the children (or any descendants) comprise
the set of pivot candidates. If all of these are numerically acceptable, the parent
contribution block is the same size as expected by the symbolic analysis.
If the nonzero pattern of A is unsymmetric, the frontal matrices become rect-
angular. They are related either by the column elimination tree (the elimination
tree of ATA) or by a directed acyclic graph. An example is shown in Figure 6.2.
This is the same matrix used for the QR factorization example in Figure 5.1.
Using a column elimination tree, arbitrary partial pivoting can be accommodated
without any change to the tree. The size of each frontal matrix is bounded by the
size of the Householder update for the QR factorization of A (the fcth frontal matrix
is at most |Vfc|-by-|7?.fc*| in size), regardless of any partial pivoting. In the LU factors
in Figure 6.2, original entries of A are shown as black circles. Fill-in entries when
no partial pivoting occurs are shown as circled x's. White circles represent entries
that could become fill-in because of partial pivoting. In this small example, they all
happen to be in U, but in general they can appear in both L and U. Amalgamation
can be done, just as in the symmetric-pattern case; in Figure 6.2, nodes 5 and 6,
and nodes 7 and 8, have been merged together. The upper bound on the size of
each frontal matrix is large enough to hold all candidate pivot rows, but this space
does not normally need to be allocated.
In Figure 6.2, the assembly tree has been expanded to illustrate each frontal
matrix. The tree represents the relationship between the frontal matrices but not
the data flow. The assembly of contribution blocks can occur not just between par-
ent and child but between ancestor and descendant. For example, the contribution
to «77 made by frontal matrix 2 could be included into its parent 3, but this would
94 Chapter 6. LU factorization
require one additional column to be added to frontal matrix 3. The upper bound
of the size of this frontal matrix is 2-by-4, but only a 2-by-2 frontal matrix needs
to be allocated if no partial pivoting occurs. Instead of augmenting frontal matrix
3 to include the 0.77 entry, the entry is assembled into the ancestor frontal matrix
4. The data flow between frontal matrices is thus represented by a directed acyclic
graph.
One advantage of the right-looking method over left-looking sparse LU factor-
ization is that it can select a sparse pivot row. The left-looking method does not
keep track of the nonzero pattern of the A^ submatrix, and thus cannot determine
the number of nonzeros in its pivot rows. The disadvantage of the right-looking
method is that it is significantly more difficult to implement.
MATLAB uses the unsymmetric-pattern multifrontal method (UMFPACK)
in x=A\b when A is sparse and either unsymmetric or symmetric but not positive
definite. It is also used in [L,U,P,Q]=lu(A). For the [L,U,P]=lu(A) syntax when
A is sparse, MATLAB uses GPLU, a left-looking sparse LU factorization much like
cs_lu.
Exercises
6.1. Use cs_lu, cs_ltsolve, and cs_utsolve to solve ATx = b without forming
AT. See Problem 6.15 for an example application.
6.2. Reduce the size of the workspace xi in cs_lu. Note that L and U both contain
at least n unused space after the call to cs.sprealloc. This space could be
used in a modified cs_spsolve for pstack. Also note that L is unit diagonal,
which simplifies cs_spsolve.
6.3. Implement column pivoting in cs_lu. If no pivot is found in a column or if
the largest pivot candidate is below a given tolerance, permute it to the end
of the matrix and try the next column in its place. Do not modify S->q.
6.4. Write a function with prototype void cs_relu (cs *A, csn *N, ess *S)
that computes the LU factorization of A. It should assume that the nonzero
patterns of L and U have already been computed in a prior call to cs_lu. The
nonzero pattern of A should be the same as in the prior call to cs_lu. Use
the same pivot permutation.
6.5. Modify cs_lu so that it can factorize both numerically and structurally rank-
deficient matrices.
6.6. Modify cs_lu so that it can factorize rectangular matrices.
6.7. Derive an LU factorization algorithm that computes the kth column of L and
the kth row of U at the kth step of factorization (Grout's method). Write a
MATLAB prototype and then a C function that implements this factorization
for a sparse matrix A. Optionally include partial pivoting.
6.8. Derive an LU factorization algorithm that computes the kth row of L and
the kth column of U at the kth step of factorization Why is it difficult to add
partial pivoting to this algorithm?
6.9. The MATLAB interface for cs_lu sorts both L and U with a double transpose.
Modify it so that it requires only one transpose for L and another for U. Hint:
see Problem 6.1.
6.10. Create cs_slu, identical to cs_sqr except for one additional option: a sym-
bolic Cholesky analysis, used for the case when order=l. Use this as a guess
for S->lnz and S->unz.
6.11. If cs_sprealloc fails in cs_lu, the function simply halts and reports that it
is out of memory. The requested memory space is far more than what might
be needed, however. Implement a scheme where 2|Z/| + n is attempted (for
|L|, as in the current cs_lu). If this fails, reduce the request slowly until the
request succeeds or until requesting the bare minimum (\L\ -f n — k) fails.
The bare minimum for U is \U\ + k + 1. This feature cannot be tested via a
MATLAB mexFunction, because mxRealloc terminates a mexFunction if it
fails.
6.12. Write a version of lu_rightpr that uses a permutation vector p instead of a
permutation matrix.
6.13. An incomplete LU factorization computes approximations of L and U with
96 Chapter 6. LU factorization
Fill-reducing orderings
The kth step updates A with the outer product L(: ,k)*L(: ,k) '. Let A^ denote
the matrix A(k:n,k:n) at the start of the kth iteration, above.11 Consider the
n
This use of the A^ notation differs from its use in Chapter 5, in which A^ = //*.... HI A.
99
100 Chapter 7. Fill-reducing orderings
excluding the diagonal (no self-edges occur in G or Q). The degree di of node i
is the size of the set (7.1). When node k is eliminated, any elements adjacent to
it are no longer required to represent the nonzero pattern of A^ (a consequence
of Theorem 4.13); these elements can be removed (called element absorption). An
example sequence of graphs G and quotient graphs Q is given in Figure 7.1. In
the graphs, a plain circle represents a node in (/, while a dark circle represents an
element. In the matrices, a filled-in circle represents an edge between two nodes in
(?, a circle is an edge no longer in G, and a circled x is an edge in G represented by
an element in Q.
Additional terms in Q can be pruned. If two nodes i and j are both in the
pivotal element £&, then j and i can be removed from Ai and Aj, respectively.
They may have been adjacent due to an original entry o^- and are still adjacent in
Q because they are both adjacent to element k. That is, Ai can be replaced with
the smaller set Ai \ Ck for all i G Ck (referred to here as pruning). With element
absorption and pruning of the Ai sets, the graph Q can be represented in place (its
size never exceeds \A\).
With this graph representation Q, the minimum degree algorithm consists
simply of a greedy reordering of the nodes. Rather than selecting node k at the
fcth step, the algorithm selects the node with the least degree. When an element k
is created, the degree of all nodes i € Ck must be recomputed, using (7.1). This is
the most costly part of the algorithm.
The cost can be reduced by exploiting supernodes, also called indistinguishable
nodes. If two nodes i and j become identical (Si = Sj and Ai = Aj), they will
remain identical until one of them is eliminated (either both are adjacent to the
pivotal element or both are not adjacent). When one of them becomes the node
of least degree, the other node is also a node of least degree, and eliminating both
causes no more edges in Q than when eliminating just one of them. Thus, if two
nodes i and j are found to be indistinguishable, they can be merged into a single
supernode that represents both of them. This is done by removing one of the nodes
(j, say) and letting i be a representative of the supernode containing both i and j (j
has been absorbed into i). The minimum degree algorithm selects a supernode k of
least degree and eliminates it. All nodes start out simply representing themselves.
7.1. Minimum degree ordering 101
After k is eliminated, if a node i is left with just an edge to k (£i = {k} and Ai is
empty), it can be eliminated immediately (called mass elimination). Let \i\ denote
the number of nodes represented by supernode i. To keep the notation simple, when
dealing with set expressions the use of i as a member of a set should be interpreted
as the set of nodes represented by i.
Supernodes and mass elimination reduce the number of times (7.1) must be
evaluated. Another technique discards the use of (7.1), and uses an approximation
di to the true degree di instead, which is cheaper to compute, where
and k is the current pivot element, and where Ai is assumed to have already been
pruned. Note that di = di if \£i\ < 2_ (where £j includes fc), because Ai and £&
are disjoint after pruning. Otherwise di > di. Using d in place of d results in the
approximate minimum degree algorithm, or AMD. At first glance, (7.2) looks no
simpler than computing the set (7.1) and then finding its size. The scanl algorithm
102 Chapter 7. Fill-reducing orderings
below shows how to compute the set differences \Ce \ C^\ efficiently. In (7.2) and in
the rest of the discussion, \Ai\ and \Ce\ refer to the sum of |ji| for the nodes j that
they contain.
function scanl
assume w(e) < 0 for all e = 1,..., n
for each node i € £/t do
for each element e e €{ do
if (w(e) < 0) then
Then w(e) = \Ce \ Ck\ if w(e] > 0. If w(e) < 0, then the sets Ce and Ck are
disjoint, and \Ce \ Ck\ = \£e\- Once the set differences are known, a second pass
over all i € Ck evaluates (7.2) to compute di. The amortized time for computing
the set differences, computing di, and pruning the set Ai is O(|.4j| + \£i\}. This is
much less than the O(di) required to compute (7.1).
The minimum degree algorithm (AMD) is the most complex of the codes pre-
sented in this book. A concise version of AMD is presented below as the cs_amd
function. It uses slightly more workspace than AMD, leading to a simpler code. It
uses the tree postordering cs.tdf s, rather than AMD's own postordering. It has
no control parameters, as AMD does, and does not compute any of the statistics
that AMD does (such as \L\ and the floating-point operation count for a subsequent
Cholesky factorization). It also has a simpler dense-node removal strategy. How-
ever, even with these simplifications, cs_amd generates orderings of the same quality
as AMD and is just as fast.
Construct matrix C: The function accepts the matrix A as input and returns
a permutation vector p. The cs_amd function operates on a symmetric matrix, so
one of three symmetric matrices is formed. If order is 0, a natural ordering p=NULL
is returned. If order is 1 and the matrix is square, C=A+A' is formed, which is
appropriate for a Cholesky factorization or an LU factorization of a matrix with
substantial entries on the diagonal and a roughly symmetric nonzero pattern (using
toKl for cs_lu). If order is 2, C=A'*A is formed after removing "dense" rows from
A. This is suitable for LU factorization of unsymmetric matrices and is similar to
what COLAMD computes. If order is 3, C=A'*A is computed, which is best used
for QR factorization or for LU factorization if A has no dense rows. A "dense" row
is a row with more than 10^/n entries.
Diagonal entries are removed from C (since cs_amd requires a graph with no
self-edges), and extra "elbow room" is added to C->i via cs.sprealloc. The con-
tents of C will be destroyed during the elimination (it holds Q^}. After C is formed,
the output p and workspace of size 8(n + 1) are allocated. The input A is not mod-
ified. To simplify the remainder of the discussion, the superscript [k] is dropped.
Initialize quotient graph: The quotient graph Q is represented by the
arrays Cp, Ci, w, nv, elen, and len, and degree, each of size n+1, except for Ci of
size nzmax. There are four kinds of nodes and elements that must be represented:
. A live node is a node i (or a supernode) that has not been selected as a
7.1. Minimum degree ordering 103
pivot and has not been merged into another supernode. In this case, £j
is represented as Ci [Cp[i] . . . Cp[i]+elen[i]-l], where elen[i] > 0.
The set Ai is represented as Ci [Cp[i]+elen[i] ... Cp[i]+len[i]-l].
Note that Cp [i] is greater than or equal to zero. The number of original
nodes represented by i is given by \i\ = nv[i], which is thus greater than
zero. The degree di is degree [i] .
. A dead node i is one that has been removed from the graph, having been
absorbed into node r = CS_FLIP(Cp[i]), where CS_FLIP(x) is denned as
-(x)-2. Note that Cp[i] is less than zero. Note that node r might itself be
absorbed into yet another node. In this case Cp forms an assembly tree, very
similar to the elimination tree. The adjacency list of i is not stored. elen[i]
is set to -1 to denote the fact that node i is dead. The size of node i, |?'| =
nv[i], is zero.
. A live element e is one that is in the graph G, having been formed when
node e was selected as the pivot. elen[e] is set to -2, and w[e] will always
be greater than zero. The sets Ae and £e do not exist. Instead, the set Ce
is stored in Ci [Cp[e] . . . Cp[e]+len[e]-l] . degree[e] is |£e|, which is
not the same as len [e] ; the latter is smaller because Ce is a list of supernodes.
The size of node e, \e\ = nv[e], is greater than zero. It represents the number
of nodes represented by supernode e when it was selected as the pivot.
. A dead element e is one that as been absorbed into a subsequent element s
= CSJFLIP(Cp[e]). elen[e] is -2 and w[e] is set to zero to denote that e is
a dead element. \e\ = nv[e] > 0 is the same as for live elements.
cs_amd initializes the quotient graph Q and two sets of n linked lists: the degree
lists and the hash buckets. Degree list d is a doubly linked list containing a list of
all nodes with approximate degree d. The head of list d is head [d]. The nodes
preceding and following node i in the list are last [i] and next [i], respectively.
The hash buckets share some of this workspace; hhead [h] is the head of the hth
hash bucket, a singly linked list. Since a node is never in both lists, next [i] is the
node following i in the list, and last [i] is the hash key for node i. The degree
lists are used to determine the node of minimum degree, and the hash buckets are
used for supernode detection.
Initialize degree lists: Each node is placed in its degree lists. Nodes of zero
degree are eliminated immediately. Nodes with degree > dense are also eliminated
and merged into a placeholder node n, a dead element. These dense nodes will
appear last in the output permutation p.
Select node of minimum approximate degree: cs_amd is now ready to
start eliminating the graph. It first finds a node k of minimum degree and removes
it from its degree list. The variable nel keeps track of how many nodes have been
eliminated; the elimination of k increments this count by \k\ = nv[k]. Because
nodes are not eliminated in order 0 through n-1, this pivot node k is not equivalent
to the k discussed above, but it serves the same purpose.
Garbage collection: The new element C^ requires space in Ci. It will be
placed at the end of this array, in Ci [cnz ... cnz + \Ck\ — 1], if |£fc| > 0 (more
104 Chapter 7. Fill-reducing orderings
precisely, less space than this will be used; the exact degree dk = \C^\ is an upper
bound, and d^ > d^ is yet a higher upper bound). If not enough space is available,
garbage collection is performed to pack Q in the Ci array.
Live nodes and elements need not appear in any particular order in Ci. To
pack Ci efficiently, the method relies on the fact that all entries in Ci[0. . .cnz-1]
are nonnegative, and Cp will be redefined for live nodes and elements. The f or j
loop copies the first entry from each live node and element j into Cp[j] and places
CS_FLIP(j) in the first position of each live object. The second loop scans all of
Ci, looking for negative entries. When a negative entry is found, the live node or
element j is compacted. Garbage collection occurs rarely.
Construct new element: The new element C^ is constructed, using (7.1).
It is constructed in place if \£k \ = 0. nv [i] is negated for all nodes i G Ck to flag
them as members of this set. Each node i is removed from the degree lists. All
elements e € £k are absorbed into element k.
Find set differences: The scanl function now computes the set differences
\Ce \ Ck\ for all elements e. At the start of the scan, no entry in the w array is
greater than or equal to mark. A test is made to ensure mark + max \Ce\ does not
cause integer overflow. If it does, then w is safely reset and the algorithm continues.
The value of mark is used as an offset; w(e) in the scanl pseudocode is replaced
with w[e] -mark in the code.
Degree update: The second pass (scanS) computes the approximate degree
di using (7.2), prunes the sets £i and Ai, and computes a hash function h(i] for
all nodes in Ck- The hash function will be used in the next step for supernode
detection. If a live element e is found where \£e \£>k\ — 0> then aggressive element
absorption is performed. Element e is a subset of k, so it is not needed to represent
Q. At this point, degree[i] = d — di — |£fc \ {i}\ is computed for node i (7.2). The
|£fc \ {i}\ term is added later, after mass elimination and supernode detection. If d
is zero, node i is mass eliminated along with element k; otherwise, node i remains
alive. Element k is added to £;, and node i is placed in the hth hash bucket. Finally,
mark is incremented by max \Ce\ to ensure that all entries in w are less than mark.
Supernode detection: Supernode detection relies on the hash function h(i)
computed for each node i. If two nodes have identical adjacency lists, their hash
functions will be identical. Each hash bucket containing any node i € Ck is consid-
ered. The first node i in the hash bucket is compared with all other nodes j; this
is repeated until the hash bucket is empty. To compare i with a set of other nodes
j, w[s]=mark is set for each node or element s in Ai or £;. These lists have been
pruned in scan2 of all dead nodes and elements. If the adjacent lists of i and j
have the same lengths and all s in Aj or £j are flagged, then i and j are identical.
In this case, j is absorbed into i and removed from the hash bucket. The mark is
incremented to clear the array w for the next iteration.
Finalize new element: The elimination of node k is nearly complete. All
nodes i in Ck are scanned one last time. Node i is removed from C^ if it is dead (it
may have been absorbed during supernode detection). The flagged status of nv[i]
is cleared. The degree di is finalized, and node i is placed in its corresponding degree
list. The new minimum degree is found when nodes are placed back into the degree
lists. Note that the degree of the current element, dk = |£/- \ {i}\, is finally added to
7.1. Minimum degree ordering 105
the degree of each node during this final pass to complete the approximate degree
computation for (7.2). This term was not added in scan2, because it was modified
during that scan due to mass elimination. Finally, nv [k] is updated to reflect the
number of nodes represented by k (this may have increased, since k was selected as
the pivot, due to mass elimination). If the set Ck is empty, element k is a root of
the assembly tree, and element k is removed from the graph.
Postordering: The elimination is complete, but no permutation has been
computed. All that is left of the graph is the assembly tree (Cp) and a set of dead
nodes and elements (i is a dead node if nv[i] is zero and a dead element if nv[i]
> 0). It is from this information only that the final permutation is computed. The
tree is restored by unflipping all of Cp. It now forms a tree; Cp [x] is the parent of
x or -1 if x is a root. This is not the elimination tree, but it is quite similar.
If an element e has been absorbed into its parent Cp [e], then e must precede
Cp[e] in the output permutation p. Likewise, a node i must precede its parent
Cp[i]. A distinction must be made between nodes and elements. The parent of an
element is always an element. The parent of a node can be either another node or
an element, but a node can never be a root of the tree. The children of an element
e must appear before it in p, but all child elements must appear before all child
nodes, because child nodes (and their descendants in the assembly tree) reflect a
set of nodes that were absorbed into supernode e when it was selected as a pivot.
A postordering of the assembly tree gives the permutation p. The list of
children of any node x are partitioned; the child elements appear first, followed
by the child nodes. The dead element n is a placeholder for any dense rows and
columns of C, so it too is included in the postordering; it and its descendants will
be ordered at the very last, followed by n=p[n] itself. Thus, p[0. . .n-1] is the
resulting fill-reducing permutation. This postordering is much simpler than the
postordering in AMD, yet just as effective.
int *cs_amd (int order, const cs *A) /* order 0:natural, l:Chol, 2:LU, 3:QR */
{
cs *C, *A2, *AT ;
int *Cp, *Ci, *last, *W, *len, *nv, *next, *P, *head, *elen, *degree, *w,
*hhead, *ATp, *ATi, d, dk, dext, lemax = 0 , e, elenk, eln, i, j, k, kl,
k2, k3, jlast, In, dense, nzmax, mindeg = 0, nvi, nvj, nvk, mark, wnvi,
ok, cnz, nel = 0, p, pi, p2, p3, p4, pj, pk, pkl, pk2, pn, q, n, m, t ;
unsigned int h ;
/* Construct matrix C */
if (!CS_CSC (A) I I order <= 0 I I order > 3) return (NULL) ; /* check */
AT * cs_transpose (A, 0) ; /* compute A' */
if (!AT) return (NULL) ;
m = A->m ; n = A->n ;
dense = CS_MAX (16, 10 * sqrt ((double) n)) ; /* find dense threshold */
dense = CS_MIN (n-2, dense) ;
if (order == 1 && n == m)
{
C - cs_add (A, AT, 0, 0) ; /* C = A+A' */
}
else if (order == 2)
{
ATp = AT->p ; /* drop dense columns from AT */
ATi = AT->i ;
106 Chapter 7. Fill-reducing orderings
The cs_wclear function is used in cs_amd to clear the w array. The condition
is true just once in the first call to cs_wclear and then when integer overflow is
near (in which case w is safely reset and the algorithm continues), cs.diag is used
to drop diagonal entries.
static int cs_wclear (int mark, int lemax, int *w, int n)
{
int k ;
if (mark < 2 I I (mark + lemax < 0))
{
for (k = 0 ; k < n ; k++) if (w [k] != 0) w [k] = 1 ;
mark = 2 ;
}
return (mark) ; /* at this point, w [0..n-1] < mark holds */
}
static int cs_diag (int i, int j, double aij, void *other) { return (i != j) ; }
112 Chapter 7. Fill-reducing orderings
matched edge (ii,ji), where j\ = jmatch[ii], and then an unmatched edge (.71,22)-
The path continues until it stops at an unmatched row. The path is alternating
because every other edge in the path is matched. In general the path can be of
any odd length, and no node or edge appears in the path twice. An alternating
augmenting path of length 7 with matched edges shown in bold,
is shown in Figure 7.2. The figure also gives a matrix view of the same path.
Matched edges in the graph are shown in bold; the same edges correspond to the
diagonal elements of the permuted submatrix (drawn as a box). In the matrix view,
entries corresponding to unmatched edges are circled. To extend the matching (as
shown in Figure 7.3), k and i± are added to C and 7£, respectively, and the matching
along the path is changed so that the path becomes
114 Chapter 7. Fill-reducing orderings
That is, k — jmatch[zi], j\ = jmatchf^], and so on. The matching has been
extended by one additional edge. Note that any matched row or column remains
matched; k and i^ are added to C and 7£, respectively, and no nodes are removed
from C or 7£. If no unmatched row 24 is found, no such path exists and the matching
is not extended (this can occur only if A is structurally rank deficient). The modified
graph and matrix are shown in Figure 7.3. The four entries circled in Figure 7.2
are still circled in Figure 7.3; the rows have been permuted to place them on the
diagonal, and the box denoting the current match is one row and column larger.
The three formerly matched entries are no longer on the diagonal. There can of
course be other unmatched edges incident on these 8 nodes, and correspondingly
more off-diagonal entries in the matrix. They are left out to make the figures clearer.
The path (7.3) can be found via a depth-first search of G, starting at an
unmatched column k and traversing only alternating paths. If the whole graph is
searched at every step, the time complexity is O(|^4|n) to find the entire maximum
matching for a square matrix, but typically only a small part of G needs to be
traversed before finding an alternating path (at which point the search stops). To
reduce the typical cost of finding a path, a one-step breadth-first search is performed
at each column j before continuing in a depth-first manner (called a cheap match).
Once a row i is matched, it remains matched (although the column jmatch[z] it is
matched to may change). The breadth-first search exploits this fact by splitting Aj
into two parts, the first of which contains only matched rows. When considering
a column j, rows in the second part of Aj are considered, and the splitting is
extended until an unmatched row (if any) is found. Thus, any edge is considered
only once in this breadth-first search, adding only 0(|A|) to the time for finding
the entire maximum matching but greatly reducing the average-case complexity of
the algorithm.
The maxtrans and augment functions are not part of CSparse, since they
rely on a recursive depth-first search, which can cause stack overflow for very large
matrices. However, they are simpler to understand than the nonrecursive versions,
so they are discussed first, maxtrans allocates workspace (w and cheap) and the
result j match. Initially, all rows are unmatched. w[j] is used to mark column
node j during a depth-first search; w[j]=k if column j has been visited during
the kth step of the algorithm or w[j]<k otherwise. The splitting of Aj for the
one-step breadth-first search is given by cheap [j]; Ai[Ap[j] . . . cheap [j]-l]
is known to contain only matched rows, whereas Ai[cheap[j] ... Ap[j+!]-!]
may contain both matched and unmatched rows.
After these initializations, the one-line f or k loop computes the matching. It
searches for an alternating augmenting path starting at column k and augments the
matching if this path is found. At the start of the kth iteration, C is a subset of
{0.. .k — 1}, and it may be extended by adding column k. When the algorithm
completes, j=jmatch[i] > 0 if row i is matched to column j. If row i is not
matched, jmatch[i]=-l.
The recursive function augment is called by maxtrans, starting at node j=k (j
will be modified when augment calls itself recursively, but k is kept unchanged).
When at node j, it first attempts to find a cheap match (the first for loop),
cheap [j] is modified to point to the remaining part of Aj. If no cheap match
7.2. Maximum matching 115
is found, all edges i € Aj are considered in a depth-first manner. All of these rows i
must be already matched; otherwise, a cheap match would have already been found
and this loop would be skipped. If column jmatch[i] has not been considered
during this kth step, a recursive call is made to find an augmenting path starting at
column jmatch [i] (corresponding, for example, to column j\ in Figure 7.2). The
loop is terminated if a path is found, and the matching is revised by matching this
new row i to column j.
int *maxtrans (cs *A) /* returns jmatch [O..m-l] */
{
int i, j, k, n, m, *Ap, *jmatch, *w, *cheap ;
if (!A) return (NULL) ; /* check inputs */
n = A->n ; m » A->m ; Ap = A->p ;
jmatch = cs_malloc (m, sizeof (int)) ; /* allocate result */
w = cs_malloc (2*n, sizeof (int)) ; /* allocate workspace */
if (!w || !jmatch) return (cs_idone (jmatch, NULL, w, 0)) ;
cheap = w + n ;
for (j - 0 ; j < n ; j++) cheap [j] = Ap [j] ; /* for cheap assignment */
for (j = 0 ; j < n ; j++) w [j] = -1 ; /* all columns unflagged */
for (i = 0 ; i < m ; i++) jmatch [i] = -1 ; /* no rows matched yet */
for (k = 0 ; k < n ; k++) augment (k, A, jmatch, cheap, w, k) ;
return (cs_idone (jmatch, NULL, w, 1)) ;
}
int augment (int k, cs *A, int *jmatch, int *cheap, int *w, int j)
{
int found = 0, p, i * -1, *Ap = A->p, *Ai = A->i ;
/* Start depth-first-search at node j */
w [j] = k ; /* mark j as visited for kth path */
for (p = cheap [j] ; p < Ap [j+1] && !found ; p++)
{
i = Ai [p] ; /* try a cheap assignment (i,j) */
found = (jmatch [i] == -1) ;
}
cheap [j] * p ; /* start here next time for j */
/* Depth-first-search of neighbors of j */
for (p = Ap [j] ; p < Ap [j+1] ftft !found ; p++)
{
i = Ai [p] ; /* consider row i */
if (w [jmatch [i]] == k) continue ; /* skip col jmatch [i] if marked */
found = augment (k, A, jmatch, cheap, w, jmatch [i]) ;
}
if (found) jmatch [i] = j ; /* augment jmatch if path found */
return (found) ;
}
The cs.augment function does the same thing as augment (k, . . ., k). First,
the node j is placed on the stack js. The while loop continues until either the
stack is empty or the last node in an augmenting path is found.
int *cs_maxtrans (const cs *A, int seed) /*[jmatch [0..m-1]; imatch [0..n-1]]*/
{
int i, j, k, n, m, p, n2 = 0, m2 = 0, *Ap, *jimatch, *w, *cheap, *js, *is,
*ps, *Ai, *Cp, *jmatch, *imatch, *q ;
cs *C ;
if (!CS_CSC (A)) return (NULL) ; /* check inputs */
n = A->n ; m = A->m ; Ap = A->p ; Ai = A->i ;
w = jimatch = cs_calloc (m+n, sizeof (int)) ; /* allocate result */
if (!jimatch) return (NULL) ;
for (k = 0 , j « 0 ; j < n ; j++) /* count nonempty rows and columns */
{
n2 +- (Ap [j] < Ap [j+1]) ;
for (p = Ap [j] ; p < Ap [j+1] ; p++)
{
w [Ai [p]] = 1 ;
k += (j == Ai [p]) ; /* count entries already on diagonal */
}
}
if (k == CS_MIN (m,n)) /* quick return if diagonal zero-free */
{
jmatch = jimatch ; imatch = jimatch + m ;
for (i = 0 ; i < k ; i++) jmatch [i] = i ;
for ( ; i < m ; i++) jmatch [i] = -1 ;
for (j = 0 ; j < k ; j++) imatch [j] = j ;
for ( ; j < n ; j++) imatch [j] = -1 ;
return (cs_idone (jimatch, NULL, NULL, D) ;
}
for (i = 0 ; i < m ; i++) m2 += w [i] ;
C = (m2 < n2) ? cs_transpose (A,0) : ((cs *) A) ; /* transpose if needed */
if (!C) return (cs_idone (jimatch, (m2 < n2) ? C : NULL, NULL, 0)) ;
n = C->n ; m = C->m ; Cp = C->p ;
jmatch = (m2 < n2) ? jimatch + n : jimatch ;
imatch = (m2 < n2) ? jimatch : jimatch + m ;
w = cs_malloc (5*n, sizeof (int)) ; /* get workspace */
if (!w) return (cs_idone (jimatch, (m2 < n2) ? C : NULL, w, 0)) ;
cheap = w + n ; js = w + 2*n ; is = w + 3*n ; ps = w + 4*n ;
for (j = 0 ; j < n ; j++) cheap [j] = Cp [j] ; /* for cheap assignment */
for (j = 0 ; j < n ; j++) w [j] = -1 ; /* all columns unflagged */
for (i = 0 ; i < m ; i++) jmatch [i] = -1 ; /* nothing matched yet */
q = cs_randperm (n, seed) ; /* q = random permutation */
for (k = 0 ; k < n ; k++) /* augment, starting at column q[k] */
{
cs_augment (q ? q [k]: k, C, jmatch, cheap, w, js, is, ps) ;
}
cs_free (q) ;
for (j = 0 ; j < n ; j++) imatch [j] = -1 ; /* find row match */
for (i = 0 ; i < m ; i++) if (jmatch [i] >= 0) imatch [jmatch [i]] = i ;
return (cs.idone (jimatch, (m2 < n2) ? C : NULL, w, 1)) ;
}
7.2. Maximum matching 117
static void cs_augment (int k, const cs *A, int *jmatch, int *cheap, int *w,
int *js, int *is, int *ps)
-c
int found = 0, p, i - -1, *Ap = A->p, *Ai = A->i, head = 0, j ;
js [0] = k ; /* start with just node k in jstack */
while (head >= 0)
{
/* Start (or continue) depth-first-search at node j */
j = js [head] ; /* get j from top of jstack */
if (w [j] != k) /* 1st time j visited for kth path */
{
w [j] = k ; /* mark j as visited for kth path */
for (p = cheap [j] ; p < Ap [j+1] && !found ; p++)
{
i = Ai [p] ; /* try a cheap assignment (i,j) */
found - (jmatch [i] == -1) ;
}
cheap [j] = p ; /* start here next time j is traversed*/
if (found)
{
is [head] = i ; /* column j matched with row i */
break ; /* end of augmenting path */
}
ps [head] = Ap [j] ; /* no cheap match: start dfs for j */
}
/* Depth-first-search of neighbors of j */
for (p = ps [head] ; p < Ap [j+1] ; p-H-)
{
i = Ai [p] ; /* consider row i */
if (w [jmatch [i]] == k) continue ; /* skip jmatch [i] if marked */
ps [head] = p + 1 ; /* pause dfs of node j */
is [head] = i ; /* i will be matched with j if found */
js [++head] = jmatch [i] ; /* start dfs at column jmatch [i] */
break ;
}
if (p == Ap [j+1]) head— ; /* node j is done; pop from stack */
} /* augment the match if path found: */
if (found) for (p = head ; p >= 0 ; p—) jmatch [is [p]] = js [p] ;
}
jmatch[i] at the top of the stack. If the for loop terminates after searching
all rows i G Aj, then no match is found (yet), and j is popped from the stack by
decrementing head. Finally, if an augmenting path is found the "recursion" unwinds
by revising the matching for all unmatched edges (i, j) in this path, corresponding
to the jmatch[i]=j statement in augment.
If a column-perfect matching is found, imatch[0. . .n-1] is a permutation of
a subset of the columns of A (or all of the columns if A is square), and A(imatch,:)
has a zero-free diagonal. The MATLAB statement p=dmperm(A) is identical (that
is, p is imatch).
The worst-case time complexity of cs_maxtrans is O(|.A|n), but this rarely
occurs in practice, csjnaxtrans can match the columns of A in reverse order (from
n-1 to 0), or in a randomized order, which can help avoid this worst-case behavior.
The cs_randperm computes a random permutation used by csjnaxtrans. If seed
is zero, the identity permutation is returned (p=NULL). If seed is -1, the reverse
permutation is returned (p=n-l: -1:0 in MATLAB notation). Otherwise, a random
permtutation is returned.
where each diagonal block is square with a zero-free diagonal and has the strong Hall
property. The strong Hall property implies full structural rank. The block trian-
gular form (7.5) is unique, except that the blocks can sometimes be interchanged.
There is often a choice of ordering within the blocks (the diagonal must remain
zero-free). To solve Ax = b with LU factorization, only the diagonal blocks need to
be factorized, followed by a block backsolve for the off-diagonal blocks. No fill-in
occurs in the off-diagonal blocks. Because each diagonal block is strong Hall, the
theorems in Chapters 5 and 6 provide tighter bounds on the nonzero pattern of the
factors.
The inverse of a strong Hall matrix has no zero entries (ignoring numerical
cancellation), and thus should very rarely be computed in practice.
Permuting a square matrix with a zero-free diagonal into block triangular
form is identical to finding the strongly connected components of a directed graph,
G(A). The directed graph is defined as G(A) = (V,E), where V = (1,... ,n} and
E = {(i, j] \aij ^ 0}. That is, the nonzero pattern of A is the adjacency matrix
of the directed graph G(A). A strongly connected component is a maximal set of
nodes such that for any pair of nodes i and j in the component, the paths i ~> j
and j ~» i both exist in the graph.
The strongly connected components of a graph can be found in many ways.
The simplest method uses two depth-first traversals, one of G(A) and the second of
the graph G(AT}. This is simple in CSparse, because cs_df s can already perform
this depth-first traversal. It was presented in the context of a directed acyclic
graph (the graph of L) to find the nonzero pattern X = Reaches) for the sparse
triangular solve in Section 3.2, but nothing in the design of cs_df s limits it to
acyclic graphs. In general, the graph of A can have cycles, unlike the graph of L.
The first depth-first search returns a set X that contains all the nodes of the
graph. As a set, this is not very interesting. However, the order in which nodes
appear in X is very important. A node j is placed in the stack X in the order in
which its corresponding dfs(j) finishes. A second depth-first traversal of G(AJ},
where nodes are considered in the reverse order of their finish times (from the top
of the stack X to the bottom), determines the strongly connected components.
Every new node i found in a new depth-first search in the second pass, and all
nodes reachable from it in G(AT), define a unique strongly connected component
of G(A), denoted as Cb. The algorithm and its implementation (cs_scc) are given
below. The components are actually computed in reverse order; this detail is not
included in the sec function below. See Section 7.7 for more details on why it works.
Since A is stored in compressed-column form, Aj is the adjacency list of node
j in the graph G(AT). However, the sec algorithm will find a permutation that puts
the adjacency matrix of the graph in block lower triangular form. A block upper
triangular form is more conventional for sparse matrix computations, so cs_scc
can be applied to the transpose. These two transposes cancel each other, so the
120 Chapter 7. Fill-reducing orderings
function scc(A)
A' = Reach^({l...n})
6=0
for each node i & X
if i is unmarked
6=6+1
Cb =dfs(i) of the graph G(AT)
The last part of the cs_scc function sorts the permutation vector p in linear
time (O(n)), so that the rows and columns in each block appear in their natural
order. This is not essential but useful because a subsequent fill-reducing ordering
algorithm can tend to give slightly better results if it is provided with the matrix in
natural order, P = I if the matrix; consists of a single strongly connected component,
and cs_dmspy looks prettier in MATLAB.
The cs_dalloc, cs_df ree, and cs_ddone functions allocate, free, and return
a csd object that represents the strongly connected components found by cs_scc.
Note that all rows in 1Z,\ are matched; if they were not, an alternating augmenting
path could be found, extending the maximum matching (which is a contradiction).
Likewise, all columns in €3 are matched. Also note that if C\ and €3 had a column
in common, there would be an alternating augmenting path from 7£ to C through
that column. Similarly, T^i and 72-3 are disjoint. All rows in T^i are matched to
some row in C-2- Thus, 7£ is divided into three disjoint subsets 7£i, 72-2, and 72-3,
and C is divided into three disjoint subsets C\, €2, and €3. Given this four-way
partition of the rows and columns, any matrix A can be permuted into the 4-by-4
block matrix
where A^, ^23, and ^34 are square with a zero-free diagonal. The transpose of the
matrix
are both rectangular and have the strong Hall property. The matrix (7.7) has a
perfect row-matching, and the matrix (7.8) has a perfect column-matching. If C is
empty, the matrix A has a column-perfect matching, and both 7£i and C\ will be
empty. Likewise, if 7£ is empty, the matrix A has a row-perfect matching, and both
7?-3 and €3 will be empty. Thus, it is possible for the two matrices (7.7) and (7.8)
to be empty (with no rows and columns). If they do exist, (7.7) will have more
columns than rows, corresponding to the structurally underdetermined part of the
system Ax — 6, and (7.8) will have more rows than columns, corresponding to the
structurally overdetermined part of the system Ax = b. The matrix ^23 need not
have the strong Hall property. If it does not, it can be permuted into block upper
7.4. Dulmage-Mendelsohn decomposition 123
triangular form, as described in Section 7.3. It has structural full rank because it is
square with a zero-free diagonal. Thus, for any matrix A, LU or QR factorization
can be applied to submatrices, all of which have the strong Hall property.
The permutation and partitioning of A given in (7.6) is unique, except that
a different maximum matching can swap columns between C and C\ and can swap
rows between Tl and "R-s (but not arbitrarily; C\ must still be matchable to the
set 7£i). Otherwise the eight sets, and their sizes, are unique. The row or column
ordering within each of the eight sets is not unique in general.
The cs_dmperm function computes the Dulmage-Mendelsohn decomposition.
It returns a csd object containing the row and column permutation vectors p and
q. The four subsets of these permutation vectors are given by cc and rr; this
determines the coarse decomposition, given in (7.6). The eight sets are given by
The fine decomposition includes the permutation of the A<23 submatrix into its
strongly connected components, (7.5). It is given by r and s. If C=A(p,q), the kth
block consists of rows r [k] through r [k+1] -1 and columns s [k] through s [k+1] -1
of C. The first block is the rectangular matrix (7.7) and the last block is (7.8) if
they are not O-by-0. The middle blocks are the strongly connected components of
A23- Note that (7.7) can have columns but no rows (An is 0-by-cc[l] and A\i is
O-by-0). Similarly, (7.8) can have rows but no columns.
Maximum matching: cs.dmperm finds the maximum matching jmat ch [i] = j
and its inverse imatch[j]=i.
Coarse decomposition: A breadth-first search starting from unmatched
columns C finds C\ and T?-I. Using the matrix AT, another breadth-first search
starting from rows in 7?. determines €3 and 7^.3. In the code, C and 7£ are CO and
RO, respectively. At this point, the coarse decomposition is determined solely by
the flag arrays wi and wj and the matching j match and imatch. These arrays are
scanned to find the permutations p and q and the coarse set sizes cc and rr.
Fine decomposition: The strongly connected components of the A(R2,C2)
submatrix are found. The A(R2,C2) matrix is formed by first computing C=A(p,q)
and then removing all rows and columns not in the set R2 or C2. The cs_f keep
function is used to drop entries not in R2, and then the size of Rl is subtracted from
all row indices. The strongly connected components could be found without forming
C=A(R2,C2) explicitly, but this method results in a simpler cs_scc function.
Combine decompositions: The fine and coarse decompositions are com-
bined into the r and s vectors that determine the boundaries of the blocks. The
permutation vectors p and q from the coarse decomposition are combined with the
permutation scc->p of A(R2,C2) that reveals its strongly connected components.
124 Chapter 7. Fill-reducing orderings
r [0] - s [0] - 0 ;
if (cc [2] > 0) nb2++ ; /* leading coarse block A (Rl, [CO Cl]) */
for (k = 0 ; k < nbl ; k++) /* coarse block A (R2.C2) */
{
r [nb2] = rs [k] + rr [1] ; /* A (R2.C2) splits into nbl fine blocks */
s [nb2] = rs [k] + cc [2] ;
nb2++ ;
}
if (rr [2] < m)
{
r [nb2] - rr [2] ; /* trailing coarse block A ([R3 RO], C3) */
s [nb2] = cc [3] ;
nb2++ ;
}
r [nb2] = m ;
s [nb2] = n ;
D->nb - nb2 ;
cs_dfree (sec) ;
return (cs_ddone (D, C, NULL, 1)) ;
}
The breadth-first search is performed by the cs_bf s function below.
static int cs_bfs (const cs *A, int n, int *wi, int *wj, int *queue,
const int *imatch, const int *jmatch, int mark)
{
int *Ap, *Ai, head = 0, tail = 0, j, i, p, j2 ;
cs *C ;
for (j = 0 ; j < n ; j++) /* place all unmatched nodes in queue */
{
if (imatch [j] >= 0) continue ; /* skip j if matched */
wj [j] * 0 ; /* j in set CO (RO if transpose) */
queue [tail++] = j ; /* place unmatched col j in queue */
}
if (tail == 0) return (1) ; /* quick return if no unmatched nodes */
C = (mark == 1) ? ((cs *) A) : cs_transpose (A, 0) ;
if (!C) return (0) ; /* bfs of C=A' to find R3.C3 from RO */
Ap - C->p ; Ai = C->i ;
while (head < tail) /* while queue is not empty */
{
j = queue [head++] ; /* get the head of the queue */
for (p = Ap [j] ; p < Ap [j+1] ; p++)
{
i = Ai [p] ;
if (wi [i] >- 0) continue ; /* skip if i is marked */
wi [i] = mark ; /* i in set Rl (C3 if transpose) */
j2 = jmatch [i] ; /* traverse alternating path to j2 */
if (wj [j2] >= 0) continue ;/* skip j2 if it is marked */
wj [j2] = mark ; /* j2 in set Cl (R3 if transpose) */
queue [tail++] = J2 ; /* add J2 to queue */
}
}
if (mark != 1) cs_spfree (C) ; /* free A' if it was created */
return (1) ;
}
To find KI and C\, cs_bfs starts at unmatched column nodes in C and tra-
verses alternating paths, according to the maximum matching found by cs_maxtrans.
126 Chapter 7. Fill-reducing orderings
The order of the nodes in the sets KI and C\ is not important, so a simpler breadth-
first search can be used instead of a more complicated depth-first search. To find
7?-3 and £3, it starts at unmatched row nodes in 7£ and searches the transpose of the
graph of A. The queue array is workspace for the breadth-first queue. cs_dmperm
passes p and q to cs_bf s to use as workspace for the breadth-first search queue.
cs_matched constructs the portions of the output permutations corresponding
to the matched submatrices (A ([Rl R2 R3] , [Cl C2 C3])).
static void cs_matched (int n, const int *wj, const int *imatch, int *p, int *q,
int *cc, int *rr, int set, int mark)
{
int kc = cc [set], j ;
int kr = rr [set-1] ;
for (j = 0 ; j < n ; j++)
{
if (wj [j] != mark) continue ; /* skip if j is not in C set */
p [kr++] = imatch [j] ;
q [kc++] = j ;
}
cc [set+1] = kc ;
rr [set] = kr ;
}
n = size (A,l) ;
if (n < 2) p = 1 ; v = l ; d = 0 ; return ; end
opt.disp = 0 ; 7. turn off printing in eigs
opt.tol » sqrt (eps) ;
S = A | A' I speye (n) ; 7. compute the Laplacian of A
S = diag (sum (S)) - S ;
[v,d] = eigs (S, 2, 'SA', opt) ; 7, find the Fiedler vector v
v - v (:,2) ;
d = d (2,2) ;
[ignore p] = sort (v) ; 7, sort it to get p
128 Chapter 7. Fill-reducing orderings
There are many methods for finding a good node separator. One class of
methods starts with an edge separator and then converts it into a node separator.
Likewise, an edge separator can be found in many ways; the method discussed here
is based on the profile-reducing methods discussed in the previous section. There
are many ways of finding a nested dissection ordering; this method was chosen for
its simplicity of implementation. State-of-the-art methods are highlighted at the
end of this section. See also Section 7.7.
Suppose a profile-reducing ordering P has been found. Divide the matrix
PAPT into its first [n/2\ rows and columns and its last n— \n/1\ rows and columns.
Since the profile of PAPT has been reduced, the number of entries in Ai2 will be
small. If these entries (edges in the graph G) are removed, the graph splits into
two components of equal size (possibly more than two connected components if the
graphs of Au or ^22 are unconnected). The edges in A\% are an edge separator of
G. In practice, the matrix is not divided equally, since the size of the edge separator
can often be reduced if the two subgraphs are allowed to differ in size (they are kept
roughly equal in size; otherwise, a good ordering is not obtained). The cs_esep
M-file shown below finds an edge separator using symrcm.
The Dulmage-Mendelsohn decomposition can convert this edge separator into
a node separator by finding a minimal node cover of the edges in A\i- Consider the
7.6. Nested dissection 129
The method can select as the node cover either the set H\ U €2 U £3 or T^i U
7?.2 U £3, both with size equal to the size of the maximal matching of Ai2 (its
structural rank). Any edge in S will be incident on at least one node in one of
these two sets. The cs_sep M-file selects T^i UC2 U^. In practice, the set is chosen
that best balances the sizes of the left and right subgraphs. The cs_nsep M-file
constructs an edge separator and converts it into a node separator. The recursive
cs_nd M-file finds a node separator using cs_nsep and then recursively bisects the
two subgraphs. Small graphs (of order 500 or less) are ordered with cs.amd.
function [a,b] = cs_esep (A)
*/,CS_ESEP find an edge separator of a symmetric matrix A
'/, [a,b] = cs_esep(A) finds a edge separator s that splits the graph of A
% into two parts a and b of roughly equal size. The edge separator is the
7, set of entries in A(a,b) .
7.
7. See also CS_NSEP, CS_SEP, CSJJD, SYMRCM.
p = symrcm (A) ;
n2 - fix (size(A,l)/2) ;
a = p (I:n2) ;
b = p (n2+l:end) ;
[a b] = cs_esep (A) ;
[ s a b ] = cs_sep (A, a, b) ;
130 Chapter 7. Fill-reducing orderings
n = size (A,l) ;
if (n == 1)
P - i;
elseif (n < 500)
p = cs_amd (A) ; 7. use cs_amd on small graphs
else
[ s a b ] = csjasep (A) ; 7, find a node separator
a = a (cs_nd (A (a, a))) ; 7, order A (a, a) recursively
b = b (cs_nd (A (b,b))) ; 7. order A(b,b) recursively
p = [a b s] ; 7« concatenate to obtain the final ordering
end
The Fiedler vector, or other eigenvector techniques, can lead to a smaller node
separator and orderings with lower fill-in, but they are prohibitively expensive to
compute for large graphs. To overcome this problem, the graph G of A can be
successively coarsened. A node in the coarse graph Gc of A represents a unique
set of nodes in G with node weight equal to the number of nodes it represents.
Edge weights are used to reflect the number of edges (the sum of their weights)
between sets of nodes in G. A sequence of coarser and coarser graphs G (the
original graph), G?i, G%, • • • , Gk is found until Gk is small enough to use powerful
edge or node separator methods efficiently. Next, the edge or node separator is
mapped to the graph of Gfc-i, and refinement techniques (such as the Kernighan-
Lin algorithm) are used to improve this partition of Gk-i- The refinement process
continues until a separator of G is obtained.
Figure 7.4 is the graph of the matrix in Figure 4.2 on page 39 with both node
and edge separators highlighted.
If applied to a 2D s-by-s mesh with node separators selected along a mesh
line that most evenly divides the graph, nested dissection leads to an asymptot-
ically optimal ordering, with 31(nlog2n)/8 + O(ri) nonzeros in L, and requiring
829(n3/2)/84-f O(nlogn) floating-point operations to compute, where n — s2 is the
dimension of the matrix. For a 3D s-by-s-by-s mesh the dimension of the matrix is
n — s3. There are 0(n4/3) nonzeros in I/, and 0(n2) floating-point operations are
required to compute L, which is also asymptotically optimal.
the node were selected as the pivot. Methods based on variations of approximate
deficiency have been developed by Rothberg and Eisenstat [176], Ng and Raghavan
[161], and Pellegrini, Roman, and Amestoy [166].
Nested dissection is a "top-down" approach, since the first level separator
includes the root of the elimination tree and its immediate descendants. Kernighan
and Lin [140] present an early graph partitioning technique based on exchanging
pairs of nodes in the graph. Fiduccia and Mattheyses present a more efficient node-
swapping method [77]. Hager, Park, and Davis extend this idea to exchange blocks
of nodes [126]. The first nested dissection algorithm for ordering sparse matrices
is due to George [81, 82]. It relies on finding a good pseudoperipheral node, as
discussed by George and Liu [86]. See also Duff, Erisman, and Reid's discussion of
George's nested dissection method [52].
More recent approaches to nested dissection are based on multilevel meth-
ods and eigenvector techniques (in particular the Fiedler vector [78]). These in-
clude methods by Pothen, Simon, and Liou [170], Karypis and Kumar (METIS
[139]), Hendrickson and Leland (CHACO [131]), Pellegrini, Roman, and Amestoy
(SCOTCH [166]), and Walshaw, Cross, and Everett (JOSTLE [197]). Since many
matrices arise in problems with 2D and 3D geometry, Heath and Raghavan [129] and
Gilbert, Miller, and Teng (MESHPART [104]) present partitioning methods based
on the geometric position of nodes in the graph. CHOLMOD includes both AMD
and a partitioning method that combines METIS with a constrained approximate
column minimum degree ordering algorithm, CCOLAMD [30].
Many of the early sparse matrix factorization methods used a profile or enve-
lope data structure, so reducing the profile of a matrix had a direct impact on the
memory usage of the method. Profile reduction is still a useful method for more
recent factorization techniques. It can form a first step in finding a good edge or
node separator for graph partitioning and nested dissection. Cuthill and McKee [25
developed one of the first techniques, which is still in use. Liu and Sherman [153]
showed that reversing the Cuthill-McKee ordering never increases the profile and
often reduces it. Chan and George [21] present an efficient implementation. Other
profile reduction techniques include those by Crane et al. [24], Gibbs [99], Gibbs,
Poole, and Stockmeyer [100], Hager [125], Reid and Scott [172, 173], Lewis [146],
and Sloan [187]. Eigenvector techniques are also an effective method for reducing
the profile, as discussed by Barnard, Pothen, and Simon [14], Pothen, Simon, and
Liou [170], and Kumfert and Pothen [142].
Gormen, Leiserson, and Rivest [23] describe the sec algorithm discussed in
Section 7.3. The earliest algorithm for finding the strongly connected components
of a graph is due to Tarjan [194]; Duff and Reid [59, 60] implement the algorithm.
Gustavson [120] discusses both the maximum matching and an implementation of
Tarjan's algorithm. Duff [48, 49] presents the 0(|yl|n)-tinie maximum matching
algorithm used by cs_maxtrans. Duff and Wiberg [71] implement an Od^lv/n)-
time maximum matching algorithm of Hopcroft and Karp [137] that is not always
faster than the O(|^4|n)-time algorithm in practice. Pothen and Fan [169] compare
various methods for computing the block triangular form. Duff and Koster [57, 58]
present a maximum weighted matching.
Ordering methods in MATLAB are discussed in Chapter 10.
Exercises 133
Exercises
7.1. Download a copy of AMD from www.siam.org/books/fa02 (or from www.acm.
org as ACM Algorithm 837). Compare it with cs_amd and make a list
of the differences between the two codes. Compare the run time, mem-
ory usage, and ordering quality on a large range of symmetric matrices (use
p=cs_amd(A) and compare with p=amd(A) in MATLAB). The MATLAB ex-
pression lnz=sum(symbfact(A(p,p))) gives the number of nonzeros in the
Cholesky factor L of the matrix A(p,p) (ignoring numerical cancellation).
7.2. Compare the MATLAB statement q=cs_amd(A, 2) with the permutation com-
puted by the MATLAB statement q=colamd(A). Compare the ordering time
and memory usage. Use the column ordering in [L,U,P]=lu(A(: ,q)) to
compare the ordering quality (nnz(L)+nnz(U)). Add code to cs_amd and
COLAMD to compute their memory usage. COLAMD orders ATA without
forming it explicitly, so it will tend to use much less memory than cs_amd(A).
Both drop dense rows from A.
7.3. Compare the MATLAB statement p=cs_amd(A ,3) with the permutation com-
puted by the MATLAB statement p=amd(A > *A) (see Problem 7.1). Compare
the ordering time and memory usage. Compare ordering quality; use rnz
=sum(symbfact(A(: ,q), 'col')) (where rnz is the same as nnz(qr(A,0)),
ignoring numerical cancellation and assuming A is not structurally rank defi-
cient) .
7.4. Write a function that solves Ax = b by combining LU factorization with the
block triangular form (the fine Dulmage-Mendelsohn decomposition). Find
the blocks with cs_dmpenn and then analyze and factorize each block with
cs_sqr and cs_lu, respectively. Next, solve Ax = b via a block backsolve.
Compare with cs_lu on matrices with many diagonal blocks (these include
matrices arising in circuit simulation and chemical process simulation). See
also Section 8.4.
7.5. Why is the block triangular form not helpful for sparse Cholesky factoriza-
tion? Hint: consider the elimination tree postordering. What happens if the
elimination tree is a forest?
7.6. Heuristics for placing large entries on the diagonal of a matrix are useful
methods for reducing the need for partial pivoting during factorization (see
[57, 58], for example). Try the following method. First, scale a copy of the
matrix A so that the largest entry in each column is equal to one. Next,
remove small entries from the matrix and use csjnaxtrans to find a zero-free
diagonal. If too many entries were dropped, decrease the drop tolerance and
try again, or simply complete the matching arbitrarily. Use the matching as
a column preordering Q and then order AQ -+- (AQ)T with minimum degree.
Use a small pivot tolerance in cs_lu and determine how many off-diagonal
pivots are found.
7.7. Compare the run time of cs_dmperm with different values of seed (0, -1, and
1) on a wide range of matrices from real applications. Symmetric indefinite
134 Chapter 7. Fill-reducing orderings
Theorem 8.1 (Gilbert [101]). Ignoring numerical cancellation, the nonzero pattern
of the solution to Ax = b, where A has a zero-free diagonal, is X = Reach A(&)-
Ignoring numerical cancellation, the solution x has no zero entries if A is strong
Hall
Theorem 8.2 (Gilbert [101]). The transitive closure of the directed graph of A is
the graph C, where Ci = ReachA(i)- Ignoring numerical cancellation, C gives the
nonzero pattern of A~l. Every edge is present in C, and A~* has no zero entries,
if A is strong Hall.
135
136 Chapter 8. Solving sparse linear systems
p = amd (A) ;
L = chol (A) ;
x = L' \ (L \ b (p)) ;
x (p) = x ;
int cs_lusol (int order, const cs *A, double *b, double tol)
-C
double *x ;
ess *S j
csn *N ;
int n, ok ;
if (!CS_CSC (A) || !b) return (0) ; /* check inputs */
n = A->n ;
S = cs_sqr (order, A, 0) ; /* ordering and symbolic analysis */
N = cs_lu (A, S, tol) ; /* numeric LU factorization */
x = cs_malloc (n, sizeof (double)) ; /* get workspace */
ok (S fe& N &ft x) ;
if (ok)
{
cs_ipvec (N->pinv, b, x, n) ; /* x = b(p) */
cs_lsolve (N->L, x) ; /* x = L\x */
cs_usolve (N-XJ, x) ; /* x = U\x */
cs_ipvec (S->q, x, b, n) ; /* b(q) = x */
}
cs_free (x) ;
cs_sfree (S) ;
cs_nfree (N) ;
return (ok) ;
>
where €22 = ^23- The overdetermined system £33X3 = 63 can first be solved for #3
using a QR factorization to obtain a least squares solution. Next, €22X2 = &2~C*23#3
can be solved for #2 using an LU factorization. When solving this system the block
upper triangular form of 622 should be exploited (see Problem 7.4). Finally, the
underdetermined system GHX\ = bi — €12X2 — Ci3x^ can be solved for x\ using the
QR factorization of C^.
This method is illustrated by the cs_dmsol M-file. It is able to find consistent
solutions to rank-deficient problems, assuming the structural rank and numeric rank
are equal. It finds the least squares solution if A is overdetermined, even though it
relies on an LU factorization for the well-determined part of the system.
[m n] = size (A) ;
[p q r s cc rr] = cs_dmperm (A) ;
C - A (p,q) ;
b = b (p) ;
x - zeros (n,1) ;
if (rr(3) <= m && cc(4) <- n)
x (cc(4):n) = cs_qrsol (C (rr(3):m, cc(4):n), b (rr(3):m)) ;
b (l:rr(3)-l) - b (l:rr(3)-l) - C (l:rr(3)-l, cc(4):n) * x (cc(4):n) ;
end
if (rr(2) < rr (3) ft& cc(3) < cc(4))
x (cc(3):cc(4)-l) = ...
cs.lusol (C (rr(2):rr(3)-l, cc(3):cc(4)-l), b (rr(2):rr(3)-l)) ;
b <l:rr(2)-l) = ...
b (l:rr(2)-l) - C (l:rr(2)-l, cc(3):cc(4)-l) * x (cc(3):cc(4)-l) ;
end
if (rr(2) > 1 &ft cc(3) > 1)
x (l:cc(3)-l) = cs_qrsol (C (l:rr(2)-l, l:cc(3)-l), b (l:rr(2)-l)) ;
end
x (q) = x ;
error is used [9], [135, Chap. 12]. UMFPACK relies on the theory and some
of the algorithms presented in nearly the whole book.
9. If A is square and full, LAPACK is used.
10. If A is sparse and not square, a sparse QR factorization based on Givens
rotations is used (Section 5.5).
11. If A is full and not square, a QR factorization based on Householder reflections
is used (in LAPACK).
The x=b/A statement in MATLAB is called the forward slash, or matrix right-
division (mrdivide). It is translated immediately into x=(A'\b')', and the above
algorithm for backslash is used. Type doc mldivide in MATLAB for more details.
Even with all its host of supporting solvers, the backslash operator in MAT-
LAB 7.2 has its limitations. It does not attempt to use iterative methods. It makes
no use of ordering methods based on graph partitioning methods, and so its fill-in
can be higher than it might be otherwise. It does not use the Dulmage-Mendelsohn
decomposition. It uses LU factorization for symmetric indefinite matrices, rather
than methods that exploit symmetry.
Gilbert, Moler, and Schreiber [105] developed the original sparse backslash for
MATLAB 4.0.
Package Method
BCSLIB-EXT multifrontal
CHOLMOD left-looking supernodal
C Sparse various
DSCPACK multifrontal
GPLU left-looking
KLU left-looking
LDL up-looking
MA27 multifrontal
MA28 right-looking Markowitz
MA32 frontal
MA37 multifrontal
MA38 unsymmetric multifrontal
MA41 multifrontal
MA42 frontal
HSL.MP42 frontal
MA46 finite-element multifrontal
MA47 multifrontal
MA48 left-looking
HSL.MP48 left-looking
MA49 multifrontal
MA57 multifrontal
MA62 frontal
HSL.MP62 frontal
MA67 right-looking Markowitz
Mathematica various
MATLAB various
Meschach right-looking
MUMPS multifrontal
NSPIV up-looking
Oblio left, right, multifrontal
PARDISO left/right supernodal
PaStiX left-looking supernodal
PSPASES multifrontal
RF product form of inverse
S+ right-looking supernodal
Sparse 1.4 right-looking Markowitz
SPARSPAK left-looking
SPRSBLKLLT left-looking supernodal
SPOOLES left-looking, multifrontal
SuperLU left-looking supernodal
SuperLLLMT left-looking supernodal
SuperLUJDIST right-looking supernodal
TAUCS left-looking, multifrontal
UMFPACK multifrontal
WSMP multifrontal
Y12M right-looking Markowitz
8.6. Software for solving sparse linear systems 143
Exercises
8.1. Write a sparse backslash algorithm, just like x=A\b in MATLAB, that solves
Ax = b. Assume A is sparse, and b can be a full or sparse vector. Examine A
and determine its properties. If A is upper or lower triangular, use cs_usolve
or cs_lsolve (or cs_spsolve if b is sparse). If it is square, symmetric, and
all its diagonal entries are greater than zero, try cs.chol. Otherwise (or if
cs_chol fails), use cs_lu if it is square or cs_qr if it is rectangular. Order
the matrix as appropriate, factorize it, and then perform the appropriate
forward/backsolves. For LU factorization, optionally select order and tol
based on how symmetric the nonzero pattern is and how large the diagonal
entries are relative to the off-diagonal entries. For yet more possibilities, see
Sections 8.4 and 8.5. Optionally allow b to be a matrix.
8.2. CSparse does very little error checking of its inputs. CSparse checks only a
few key error conditions: if it runs out of memory, if the matrix is singular
for an LU factorization, if the matrix is not positive definite for a Cholesky
factorization, if the matrix has the wrong type (compressed-column versus
triplet), or if the row or column index is negative in cs_entry. Add more
error checking to CSparse. See also Problem 2.12.
8.3. Add a floating-point operation (flop) counter to CSparse. Avoid adding state-
ments such as flop++. Use a global double flop variable.
8.4. Modify cs_qrsol so that it returns x, r, and ||r|J2.
8.5. Iterative refinement is a process that can improve the accuracy of the solu-
tion to Ax = b. In the sparse case, it is most useful in LU factorization when a
small pivot tolerance is used. In MATLAB notation, x=A\b ; x=x+A\ (b-A*x
where of course A needs to be factorized only once. Add iterative refinement
to cs_lusol. Note that b-A*x can be computed with cs_gaxpy. MATLAB
uses iterative refinement with sparse backward error [9], [135, Chap. 12] in
x=A\b when A is sparse and unsymmetric, so in this case iterative refinement
will not lead to any improvement, since it has already been done. Compare
with cs_lu with a very small pivot tolerance and no iterative refinement.
8.6. Modify cs_lusol, adding a qrbound parameter. Use this as the qr input to
cs_sqr. For a wide range of matrices, determine how close \L\ and \U\ are
to their upper bounds, computed when qrbound=l. Determine how good the
guess is when qrbound=0. Add a parameter a to cs.lusol and cs_sqr that
modifies the initial guess (replace \U\ = 4|A| + n in cs_sqr with a times the
upper bound) and experiment with this parameter.
8.7. Modify cs_lusol so that b can have multiple columns.
8.8. Write a version cs_lusol where b is a sparse n-by-k matrix.
8.9. Repeat Problems 8.7 and 8.8 for cs_cholsol and cs.qrsol.
8.10. Repeat Problem 8.1, where b can have multiple columns.
8.11. Write a MATLAB interface for CXSparse. Note that MATLAB and CX-
Sparse use different methods for storing complex values.
Chapter 9
CSparse
145
146 Chapter 9. CSparse
cs.add: C = aA + (3B
cs *cs_add (const cs *A, const cs *B, double alpha, double beta) ;
Adds two sparse matrices, C — a A + (3B.
A in sparse matrix
B in sparse matrix
alpha in scalar
beta in scalar
returns C=alpha*A+beta*B; NULL on error
cs.cholsol: solve Ax = b using Cholesky factorization
int cs_cholsol (int order, const cs *A, double *b) ;
Solves Ax = b, where A is symmetric positive definite.
order in ordering method to use (0 or 1)
A in sparse matrix; only upper triangular part used
b in/out size n; b on input, x on output
returns 1 if successful; 0 on error
cs_compress: triplet form to compressed-column conversion
cs *cs_compress (const cs *T) ;
Converts a triplet-form matrix T into a compressed-column matrix C. The
columns of C are not sorted, and duplicate entries may be present in C.
T in sparse matrix in triplet form
returns C if successful; NULL on error
cs_dupl: remove duplicate entries
int cs_dupl (cs *A) ;
Removes and sums duplicate entries in a sparse matrix.
A in/out sparse matrix; duplicates summed on output
returns 1 if successful; 0 on error
9.1. Primary CSparse routines and definitions 147
A in sparse matrix
returns the 1-norm if successful; -1 on error
cs_print: print a sparse matrix
int cs_print (const cs *A, int brief) ;
Prints a compressed-column or triplet-form sparse matrix.
A in sparse matrix
brief in print all of A if zero, a few entries otherwise
returns 1 if successful; 0 on error
cs_qrsol: solve a least squares or underdetermined problem
int cs_qrsol (int order, const cs *A, double *b) ;
Solves a least squares problem (min || Ar—6||2, where A is m-by-n with ra > n),
or an underdetermined system (Ax = b, where ra < n).
order in ordering method to use (0 or 3)
A in sparse matrix
b in/out size max(m,n); b (size m) on input, x (size n) on output
returns 1 if successful; 0 on error
cs_transpose: C = AT
cs *cs_transpose (const cs *A, int values) ;
A in sparse matrix
values in pattern only if 0, both pattern and values otherwise
returns C=A'; NULL on error
cs.ipvec: x = PTb
int cs_ipvec (const int *p, const double *b, double *x, int n) ;
Permutes a vector; x = PTb. In MATLAB notation, x(p)=b.
p in permutation vector
b in input vector
x out x(p)=b, output vector
n in length of p, b, and x
returns 1 if successful; 0 on error
cs_lsolve: solve a lower triangular system Lx = b
int cs_lsolve (const cs *L, double *x) ;
Solves a lower triangular system Lx — 6, where x and b are dense vectors.
The diagonal of L must be the first entry of each column.
L in lower triangular matrix
x in/out size n; right-hand side on input, solution on output
returns 1 if successful; 0 on error
Scatters and sums a sparse vector A (:, j ) into a dense vector, x=x+beta*A (:, j ) ,
A in the sparse vector is A (:, j)
j in the column of A to use
beta in scalar multiplied by A (:, j)
w in/out size m; node i is marked if w[i]=mark
x in/out size m; ignored if NULL
mark in mark value for w
C in/out pattern of x accumulated in C->i
nz in pattern of x placed in C starting at C->i [nz]
returns new value of nz; -1 on error
cs_scc: strongly connected components of a square matrix
csd *cs_scc (cs *A) ;
A in matrix to analyze (A->p modified then restored)
returns strongly connected components D; NULL on error
cs_spsolve: sparse lower or upper triangular solve, Lx — b or Ux = b
int cs_spsolve (cs *G, const cs *B, int k, int *xi, double *x,
const int *pinv, int lo) ;
If lo is zero, Ux = B*k is solved, where G = U is upper triangular. Otherwise
Lx = B*k is solved instead, where G = L. Both b = £?*& and x are sparse; X is the
nonzero pattern of x.
G in lower or upper triangular matrix (L or U)
B in right-hand side, 6 = B*k
k in use kth column of B as right-hand side
xi out size 2*n; X in xi[top.. .n-1]
x out size n; x in x [xi [top. . . n-1] ]
pinv in mapping of rows to columns of L, ignored if NULL
lo in 1 if lower triangular, 0 if upper
returns top; -1 on error
cs_tdf s: postorder a tree
int cs_tdfs (int j, int k, int *head, const int *next, int *post,
int *stack) ;
All arrays are of size n, where n is the number of nodes in the tree.
j in postorder the tree rooted at node j
k in number of nodes ordered so far
head in/out head[i] is first child of node i; -1 on output
next in next [i] is next sibling of i or -1 if none
post in/out postordering
stack work size n
returns new value of k; -1 on error
9.3. Tertiary CSparse routines and definitions 157
Each one-line prototype described in this chapter is listed in cs.h. The cs.h
file ends with the following lines, which define the CSparse macros.
#define CS_MAX(a,b) (((a) > (b)) ? (a) : (b))
#define CS_MIN(a,b) (((a) < (b)) ? (a) : (b))
#define CS_FLIP(i) (-(i)-2)
#define CS_UNFLIP(i) (((i) < 0) ? CS_FLIP(i) : (i))
#define CS_MARKED(w,j) (w [j] < 0)
#define CS_MARK(w,j) { w [j] = CS_FLIP (w [j]) ; }
#define CS_CSC(A) (A && (A->nz == -1))
#define CS_TRIPLET(A) (A &ft (A->nz >= 0))
#endif
9.4 Examples
The three example programs below exercise every routine and nearly every line of
code in CSparse (all but out-of-memory condition handling).
ftinclude "cs.h"
int main (void)
{
cs *T, *A, *Eye, *AT, *C, *D ;
int i, m ;
T = cs_load (stdin) ; /* load triplet matrix T from stdin */
printf ("T:\n") ; cs_print (T, 0) ; /* print T */
A = cs_compress (T) ; /* A = compressed-column form of T */
printf ("A:\n") ; cs_print (A, 0) ; /* print A */
cs_spfree (T) ; /* clear T */
AT = cs_transpose (A, 1) ; /* AT = A' */
printf ("AT:\n") ; cs_print (AT, 0) ; /* print AT */
m = A ? A->m : 0 ; /* m = # of rows of A */
T = cs_spalloc (m, m, m, 1, 1) ; /* create triplet identity matrix */
for (i = 0 ; i < m ; i++) cs_entry (T, i, i, 1) ;
Eye = cs_compress (T) ; /* Eye = speye (m) */
cs_spfree (T) ;
C - cs.multiply (A, AT) ; /* C = A*A» */
D - cs_add (C, Eye, 1, cs_norm (C)) ; /* D - C + Eye*norm (C,l) */
printf ("D:\n") ; cs_print (D, 0) ; /* print D */
cs_spfree (A) ; /* clear A AT C D Eye */
cs_spfree (AT) ;
cs_spfree (C) ;
cs_spfree (D) ;
cs.spfree (Eye) ;
return (0) ;
}
The tl file can be used as input to cs_demol. It contains the triplet form of the
matrix used in Section 2.1:
2 2 3.0
1 0 3.1
3 3 1.0
0 2 3.2
1 1 2.9
3 0 3.5
3 1 0.4
1 3 0.9
0 0 4.5
2 1 1.7
The cs_demol.m script below is the MATLAB equivalent for the C program
cs_demol, except that the CSparse results are compared with the same operations
in MATLAB. The MATLAB load statement can read a triplet-form matrix in the
same format as the tl file above, except that MATLAB expects its matrices to be
1-based. MATLAB always returns matrices with sorted columns.
*/,CS_DEM01: MATLAB version of the CSparse/Demo/cs_demol.c program.
'/, Uses both MATLAB functions and CSparse mexFunctions, and compares the two
7, results. This demo also plots the results, which the C version does not do.
load ../../Matrix/tl
T = tl
A = sparse (T(:,l)+l, T(:,2)+l, T(:,3))
A2 - cs_sparse (T(:,l)+l, T(:,2)+l, T(:,3))
160 Chapter 9. CSparse
The output of the C program cs.demol is given below. Compare it with the triplet
and compressed-column matrices defined in Section 2.1. Also compare it with the
output of the MATLAB equivalent code above. The maximum number of entries
that T can hold is 16; it was doubled four times from its original size of one entry.
T:
CSparse Version 2.0.1, May 27, 2006. Copyright (c) Timothy A. Davis, 2006
triplet: 4-by-4, nzmax: 16 nnz: 10
22 3
1 0 3.1
33 1
0 2 3.2
1 1 2.9
3 0 3.5
3 1 0.4
1 3 0.9
0 0 4.5
2 1 1.7
A:
CSparse Version 2.0.1, May 27, 2006. Copyright (c) Timothy A. Davis, 2006
4-by-4, nzmax: 10 nnz: 10, 1-norm: 11.1
col 0 : locations 0 to 2
1 3.1
3 3.5
0 4.5
col 1 : locations 3 to 5
1 2.9
3 0.4
2 1.7
col 2 : locations 6 to 7
2 3
0 3.2
col 3 : locations 8 to 9
9.4. Examples 161
3 :1
1 : 0.9
AT:
CSparse Version 2.0.1, May 27, 2006. Copyright (c) Timothy A. Davis, 2006
4-by-4, nzmax: 10 nnz: 10, 1-norm: 7.7
col 0 : locations 0 to 1
0 4.5
2 3.2
col 1 : locations 2 to 4
0 3.1
1 2.9
3 0.9
col 2 : locations 5 to 6
1 1.7
2 3
col 3 : locations 7 to 9
0 3.5
1 0.4
3 1
D:
CSparse Version 2.0.1, May 27, 2006. Copyright (c) Timothy A. Davis, 2006
4-by-4, nzmax: 16 nnz: 16, 1-norm: 139.58
col 0 : locations 0 to 3
1 13.95
3 15.75
0 100.28
2 9.6
col 1 : locations 4 to 7
1 88.62
3 12.91
0 13.95
2 4.93
col 2 : locations 8 to 11
1 4.93
3 0.68
2 81.68
0 9.6
col 3 : locations 12 to 15
1 12.91
3 83.2
0 15.75
2 0.68
cs_demo3 programs.
#include "cs.h"
typedef struct problem_struct
{
cs *A ;
cs *C ;
int sym ;
double *x ;
double *b ;
double *resid ;
} problem ;
/* C = A + triu(A.l)' */
static cs *make_sym (cs *A)
{
cs *AT, *C ;
AT = cs.transpose (A, 1) ; /* AT - A' */
cs_fkeep (AT, ftdropdiag, NULL) ; /* drop diagonal entries from AT */
C = cs_add (A, AT, 1, 1) ; /* C = A+AT */
cs_spfree (AT) ;
return (C) ;
}
/* infinity-norm of x */
static double norm (double *x, int n)
{
int i ;
double normx = 0 ;
for (i = 0 ; i < n ; i-n-) normx - CS_MAX (normx, fabs (x [i])) ;
return (normx) ;
}
if (tol > 0) cs_droptol (A, tol) ; /* drop tiny entries (just to test) */
Prob->C = C = sym ? make_sym (A) : A ; /* C = A + triu(A,l)', or C=A */
if (!C) return (free.problem (Prob)) ;
printf ("\n Matrix: 7,d-by-7,d, nnz: */.d (sym: 7,d: nnz */.d), norm: */.8.2e\n",
m, n, A->p [n] , sym, sym ? C->p [n] : 0, cs_norm (C)) ;
if (nzl != nz2) printf ("zero entries dropped: 7,d\n", nzl - nz2) ;
if (nz2 != A->p [n]) printf ("tiny entries dropped: 7.d\n", nz2 - A->p [n]) ;
Prob->b = cs_malloc (mn, sizeof (double)) ;
Prob->x = cs_malloc (mn, sizeof (double)) ;
Prob->resid = cs_malloc (mn, sizeof (double)) ;
return ((!Prob->b I I !Prob->x I I !Prob->resid) ? free_problem (Prob) : Prob) ;
}
/* free a problem */
problem *free_problem (problem *Prob)
{
if (IProb) return (NULL) ;
cs_spfree (Prob->A) ;
if (Prob->sym) cs_spfree (Prob->C) ;
cs_free (Prob->b) ;
cs_free (Prob->x) ;
cs_free (Prob->resid) ;
return (cs_free (Prob)) ;
}
/* solve a linear system using Cholesky, LU, and QR, with various orderings */
int demo2 (problem *Prob)
•C
cs *A, *C ;
double *b, *x, *resid, t, tol ;
int k, m, n, ok, order, nb, ns, *r, *s, *rr, sprank ;
csd *D ;
if (IProb) return (0) ;
A = Prob->A ; C = Prob->C ; b = Prob->b ; x = Prob->x ; resid = Prob->resid;
m = A->m ; n = A->n ;
tol = Prob->sym ? 0.001 : 1 ; /* partial pivoting tolerance */
D = cs_dmperm (C, 1) ; /* randomized dmpenn analysis */
if (!D) return (0) ;
nb = D->nb ; r = D->r ; s = D->s ; rr = D->rr ;
sprank = rr [3] ;
f o r ( n s = 0 , k = 0 ; k < n b ; k++)
{
ns •*•= ((r [k+1] == r [k]+D && (s [k+1] =- s [k]+l)) ;
}
printf ("blocks: 7.d singletons: 7.d structural rank: 7.d\n", nb, ns, sprank) ;
cs_dfree (D) ;
for (order = 0 ; order <= 3 ; order += 3) /* natural and amd(A'*A) */
{
if (!order && m > 1000) continue ;
printf ("QR ") ;
print_order (order) ;
rhs (x, b, m) ; /* compute right-hand side */
t = tic () ;
ok = cs_qrsol (order, C, x) ; /* min norm(Ax-b) with QR */
printf ("time: 7,8.2f ", toe (t)) ;
print_resid (ok, C, x, b, resid) ; /* print residual */
}
9.4. Examples 165
The output of cs_demo2 for the bcsstkOl, fs_183_l, mbeacxc, west0067, and
lp_af iro matrices is shown below. One matrix (mbeacxc) is actually 496-by-496,
but cs_load returns a matrix of size 492-by-490 as determined by the largest row
and column index of nonzero entries in the matrix. The matrix has a numeric and
structural rank of 448, which is why the residual is nan.
Matrix: 48-by-48, nnz: 224 (sym: -1: nnz 400), norm: 3.57e+09
blocks: 1 singletons: 0 structural rank: 48
QR natural time: 0.00 resid: 2.83e-19
QR amd(A'*A) time: 0.00 resid: 5.19e-19
LU natural time: 0.00 resid: 2.63e-19
LU amd(A+A') time: 0.00 resid: 8.63e-20
LU amd(S'*S) time: 0.00 resid: 2.04e-19
LU amd(A'*A) time: 0.00 resid: 2.04e-19
Choi natural time: 0.00 resid: 1.90e-19
Choi amd(A+A') time: 0.00 resid: 2.01e-19
166 Chapter 9. CSparse
/* Cholesky update/downdate */
int demoS (problem *Prob)
{
cs *A, *C, *W = NULL, *WW, *WT, *E = NULL, *W2 ;
int n, k, *Li, *Lp, *Wi, *Wp, pi, p2, *p = NULL, ok ;
double *b, *x, *resid, *y = NULL, *Lx, *Wx, s, t, tl ;
ess *S = NULL ;
9.4. Examples 167
csn *N - NULL ;
if (IProb || !Prob->sym I I Prob->A->n == 0) return (0) ;
A = Prob->A ; C = Prob->C ; b = Prob->b ; x = Prob->x ; resid = Prob->resid;
n « A->n ;
if (!Prob->sym I I n == 0) return (1) ;
rhs (x, b, n) ; /* compute right-hand side */
printf ("\nchol then update/downdate ") ;
print_order (1) ;
y = cs_malloc (n, sizeof (double)) ;
t = tic () ;
S = cs_schol (1, C) ; /* symbolic Choi, amd(A+A>) */
printf ("\nsymbolic chol time */,8.2f\n", toe (t)) ;
t = tic () ;
N = cs_chol (C, S) ; /* numeric Cholesky */
printf ("numeric chol time */.8.2f\n", toe (t)) ;
if (!S II !N II !y) return (done3 (0, S, N, y, W, E, p)) ;
t = tic () ;
cs_ipvec (S->pinv, b, y, n) ; /* y = P*b */
cs_lsolve (N->L, y) ; /* y = L\y */
cs.ltsolve (N->L, y) ; /* y = L'\y */
cs_pvec (S->pinv, y, x, n) ; /* x = P'*y */
printf ("solve chol time 7.8.2f\n", toe (t)) ;
printf ("original: ") ;
print_resid (1, C, x, b, resid) ; /* print residual */
k = n/2 ; /* construct W */
W - cs_spalloc (n, 1, n, 1, 0) ;
if (!W) return (done3 (0, S, N, y, W, E, p)) ;
Lp = N->L->p ; Li = N->L->i ; Lx = N->L->x ;
Wp = W->p ; Wi = W->i ; Wx = W->x ;
Wp [0] - 0 ;
pi = Lp [k] ;
Wp [1] = Lp [k+1] - pi ;
s = Lx [pi] ;
srand (1) ;
for ( ; pi < Lp [k+1] ; pl++)
{
P2 = pi - Lp [k] ;
Wi [p2] - Li [pi] ;
Wx [p2] = s * rand () / ((double) RAND.MAX) ;
>
t = tic () ;
ok = cs_updown (N->L, +1, W, S->parent) ; /* update: L*L'+W*W' */
tl = toe (t) ;
printf ("update: time: 7.8.2f\n", tl) ;
if (!ok) return (done3 (0, S, N, y, W, E, p)) ;
t = tic () ;
cs_ipvec (S->pinv, b, y, n) ; /* y = P*b */
cs_lsolve (N->L, y) ; /* y * L\y */
cs_ltsolve (N->L, y) ; /* y = L'\y */
cs_pvec (S->pinv, y, x, n) ; /* x = P'*y */
t = toe (t) ;
p = cs_pinv (S->pinv, n) ;
W2 = cs_permute (W, p, NULL, 1) ; /* E = C + (P'W)*(P'W)> */
WT = cs_transpose (W2,l) ;
WW - csjnultiply (W2, WT) ;
cs_spfree (WT) ;
cs_spfree (W2) ;
168 Chapter 9. CSparse
Matrix: 4884-by-4884, nnz: 147631 (sym: -1: nnz 290378), norm: 7.01e+09
Sparse matrices in
MATLAB
Almost all MATLAB operators and functions work seamlessly on both sparse and
full matrices.15 It is possible to write an efficient MATLAB M-file that can operate
on either full or sparse matrices with no changes to the code. In MATLAB, "sparse"
is an attribute of the data structure used to represent a matrix.
Sparsity propagates in MATLAB; if a function or operator has sparse operands,
the result is usually sparse. A fixed set of rules determines the storage class (sparse
or full) of the result. In general, unary functions and operators return a result of
the same storage class as the input. For example, chol (A) is sparse if A is sparse
and full otherwise. The result of a binary operator (A+B, for example) is sparse if
both A and B are sparse and full if both A and B are full. If the operands are mixed,
the result is usually full, unless the operation preserves sparsity ([A B] , [A; B] , and
A. *B are sparse if either A or B are sparse, for example). The submatrix A (i, j ) has
the same type as A, unless it is a scalar (in which case A(i, j) is full). Submatrix
assignment (A(i, j) = . . .) leaves the storage class of A unchanged.
169
170 Chapter 10. Sparse matrices in MATLAB
If for loops are unavoidable, at least try to create subsets of more than one entry
at a time. This function computes the same matrix as mesh2d2. Preallocating a
matrix at its final size is better than repeatedly appending new entries to it.
function A = mesh2dl (n)
'/, create an n-by-n 2D mesh for the 2nd difference operator
ii = zeros (5*n~2, 1) ; 7, preallocate ii, jj, and xx
jj = zeros (5*n~2, 1) ;
xx = zeros (5*n~2, 1) ;
k = 1;
for j = 0:n-l
for i = 0:n-l
s = j*n+i + 1 ;
ii (k:k+4) = [(j-l)*n+i j*n+(i-l) j*n+i j*n+(i+l) (j+l)*n+i ] + 1 ;
jj (k:k+4) = [s s s s s] ;
xx (k:k+4) = [-1 -1 4 -1 -1] ;
k = k + 5 ;
end
end
*/. remove entries beyond the boundary
keep = find (ii >= 1 ft ii <= n'2 & jj >= 1 & jj <= n~2) ;
A = sparse (ii (keep), jj (keep), xx (keep)) ;
{
for (i = 0 ; i < s ; i++) P [i] = rand () 7. n ;
for (j = 0 ; j < s ; j++)
{
for (i = 0 ; i < s ; i++)
{
cs_entry (T, P [i], P [j], rand () / (double) RAND_MAX) ;
}
>
}
for (i - 0 ; i < n ; i++) cs_entry (T, i, i, 1) ;
A = cs_compress (T) ;
cs_spfree (T) ;
return (cs_dupl (A) ? A : cs_spfree (A)) ;
}
The worst way to create a sparse matrix is with a statement A(i, j) = . . ., where
i and j are scalars. Below are four methods of creating the same matrix A. The
first method is an example of what not to do.
7. method 1: A(i,j) - ...
rand ('state', 0) ;
A = sparse (n,n) ;
for k = l:nz
'/, compute some arbitrary entry and add it into the matrix
i = 1 + fix (n * rand (1)) ;
j = 1 + fix (n * rand (1)) ;
x = rand (1) ;
A (i,j) = A (i,j) + x ; */. VERY slow, esp. if A(i,j) not already nonzero!
end
If the number of entries is unknown, the size of the triplet matrix can be increased
as needed, just like cs_entry.
7t method 3: triplet form, one entry at a time, pretend nz is unknown
rand ('state', 0) ;
len = 16 ;
ii = zeros (len, 1) ;
jj = zeros (len, 1) ;
xx = zeros (len, 1) ;
172 Chapter 10. Sparse matrices in MATLAB
for k = l:nz
7, compute some arbitrary entry and add it into the matrix
if (k > len)
7. double the size of ii,jj,xx
len = 2*len ;
ii (len) = 0 ;
jj (len) = 0 ;
xx (len) = 0 ;
end
ii (k) = 1 + fix (n * rand (1)) ;
jj (k) = 1 + fix (n * rand (1)) ;
xx (k) = rand (1) ;
end
A = sparse (ii (l:k), jj (l:k), xx (l:k)) ;
Each of the above four methods constructs the same matrix A. The first is exceed-
ingly slow, taking O(|.A|2) time. Methods 2 and 3 take about the same time, but
method 4 is much faster (methods 2, 3, and 4 all take O(|^4|) time, however).
Finally, never create a full matrix A and then convert it to sparse. For example,
A=sparse(eye(n)) takes O(n2) time and memory, but A=speye(n) takes only O(n)
time and memory. The difference is dramatic for large n.
| operators (highlights only; all MATLAB operators work for sparse matrices)
\ Backslash, or mldivide. See Section 8.5.
/ Slash, or mrdivide. See Section 8.5.
' C=A' is the transpose of A (complex conjugate transpose if A is
complex).
+ C=A+B adds two matrices. C is sparse if A and B are sparse.
C=A-B subtracts two matrices. C is sparse if A and B are sparse.
* C=A*B multiplies two matrices. C is sparse if A and B are sparse.
; C=[A;B] concatenates A and B vertically; A and B must have the
same number of columns. C is sparse if A or B are sparse.
, C=[A,B] concatenates A and B horizontally; A and B must have the
same number of rows. C is sparse if A or B are sparse.
| iterative methods:
Iterative methods exploit sparsity when solving Ax = b by not factorizing A.
They typically rely on repeated matrix-vector multiplications. Methods in MAT-
LAB include bicg, bicgstab, cgs, gmres, Isqr, minres, peg, qmr, and symmlq.
| tree and graph operations
etree parent=etree(A) is the elimination tree of triu(A)+triu(A)'.
[parent post]=etree(A) also returns the elimination tree post-
ordering, etree (A, 'col') finds the elimination tree of A'*A.
etreeplot etreeplot (A) plots a picture of the elimination tree of A+A'.
gplot gplot(A,xy) plots a picture of the undirected graph of A+A',
where the n-by-2 matrix xy gives the x-y coordinates of each node.
symbfact Symbolic Cholesky factorization. c=symbfact(A) is a vector of
column counts of the Cholesky factor L=chol(A)', where only
triu(A) is accessed, symbfact (A,'col') analyzes A'*A but
does not form it. Additional outputs are [c h parent post
R] =symbf act ( . . . ) , where h is the height of the elimination tree,
parent is the tree, post is the postordering of the tree, and R is a
binary matrix with the same pattern chol (A).
treelayout [x, y, h] =treelayout (parent) finds x-y coordinates for the nodes
of a tree, and the height h of the tree, for use in treeplot.
treeplot treeplot (parent) plots a picture of the elimination tree.
[ functions that partially work on sparse matrices or that have sparse substitutes
cholupdate Rank-1 update/downdate of full Cholesky factorization. Use
CHOLMOD or cs_updown for the sparse case,
cond Use condest instead.
eig Use eigs instead or d=eig(A) for sparse symmetric A.
norm Works for 1-norm, co-norm, vector 2-norm, and Frobenius norm.
Use normest (A) to estimate the 2-norm of a sparse matrix A.
poly Only works if A is symmetric,
svd Use svds instead.
| functions and features that do not work on sparse matrice
Functions and features of MATLAB that do not work at all for sparse matri-
176 Chapter 10. Sparse matrices in MATLAB
ces include JV-dimensional arrays for N > 2, different types (only double and com-
plex double are available), airy, bessel, besselj, bessely, besseli, besselk,
besselh, betainc, bitand, bitcmp, bitor, bitxor, bitset, bitget, bitshift,
complex, condeig, conv2, convn, deconv, erf, erfc, erfcx, fft, fft2, fftn,
filter, f ilter2, funm, gamma, gammaln, gsvd, hess, histc, ifft, ifftn, linsolve
logm, Isqnonneg, null, ordeig, ordqz, ordschur, orth, qz, pinv, psi, rank,
rcond, reallog, realpow, realsqrt, residue, rsf2csf, schur, sqrtm, ss2zp,
subspace, surfnorm, and tzero.
The following accept sparse inputs but produce full outputs: ellipj, ellipke,
erfcinv, erfinv, expint, gammainc, legendre, polyeig, polyval, and polyvalm.
where ^12, A^z-, and ^34 are square with zero-free diagonals. The columns of AH
are the unmatched columns, and the rows of A^ are the unmatched rows. Any
of these blocks can be empty. In the coarse decomposition, the (i,j)th block is
C(rr(i) :rr(i+l)-l,cc(j) :cc(j+l)-l). In terms of a linear system, [All A12]
is the undetermined part of the system (it is always rectangular and with more
columns than rows or O-by-0), A23 is the well-determined part of the system (it is
always square), and [A34 ; A44] is the overdetermined part of the system (it is
always rectangular with more rows than columns or O-by-0). The structural rank of
A is rr(4)-l. The A^j, submatrix is further subdivided into block upper triangular
form via the fine decomposition (the strongly connected components of ^23).
C(r(i) :r(i+l)-l,s(j) :s(j+l)-l) is the (i,j)th block of the fine decompo-
sition. The (1,1) block is the rectangular block [All A12], unless this block is
O-by-0. The (&, 6) block is the rectangular block [A34 ; A44], unless this block is
O-by-0, where b = length(r)-!. All other diagonal blocks are submatrices of A^
and are square with a zero-free diagonal.
A second argument provides a seed for a randomized maximum matching.
See also cs_dmspy, cs_dmsol, dmperm, sprank, cs_randperm.
178 Chapter 10. Sparse matrices in MATLAB
Usage: x = cs_ltsolve(L,b)
MATLAB equivalent: x = L'\b
Solves a sparse upper triangular system. L must be lower triangular with a
zero-free diagonal, b must be a full vector.
See also cs_lsolve, cs_usolve, cs.utsolve, mldivide.
cs_lu: sparse LU factorization
Usage: [L,U,p,q] = cs_lu(A,tol)
MATLAB equivalent: [L,U,P,Q] = lu(A,tol)
[L,U,p] = cs_lu(A) factorizes A ( p , : ) into L*U.
[L,U,p] = cs_lu(A, tol) factorizes A ( p , : ) into L*U. Entries on the diagonal
are given preference in partial pivoting.
[L,U,p,q] = cs_lu(A) factorizes A(p,q) into L*U, using a fill-reducing or-
dering q = cs_amd(A,2). Normal partial pivoting is used.
[L,U,p,q] = cs_lu(A,tol) factorizes A(p,q) into L*U, using a fill-reducing
ordering q = cs_amd(A). Entries on the diagonal are given preference in partial
pivoting. With a pivot tolerance tol, the entries in L have magnitude 1/tol or
less, tol = 1 is normal partial pivoting (but with the q = cs_amd(A) ordering),
tol = 0 ensures p = q. 0<tol<l is relaxed partial pivoting; the diagonal is selected
if it is at least tol*max(abs(A(: ,k))).
See also cs.amd, lu, umf pack, amd, colamd.
cs_lusol: solve Ax = b using LU factorization
Usage: x = cs_lusol(A,b,order,tol)
MATLAB equivalent: x = A\b
x = cs_lusol(A,b) computes x=A\b, where A is sparse and square, and b
is a full vector. The ordering cs_amd(A,2) is used, x = cs_lusol(A,b,l) also
computes x=A\b but uses the cs_amd(A) ordering with diagonal preference (default
tol=0.001). If order is present, cs_amd(A,order) is used.
See also csJLu, cs_amd, cs_cholsol, cs_qrsol, mldivide.
I csjnake: compiles CSparse for use in MATLAB
Usage: csjnake
See also mex. Type help csjnake in MATLAB for more information, includ-
ing instructions on how to add new mexFunctions to CSparse.
csjnultiply: sparse matrix multiply
Usage: C = cs_multiply(A,B)
MATLAB equivalent: C = A*B
See also cs_gaxpy, cs_add, mtimes.
cs_nd: generalized nested dissection
See Section 7.6 for a complete description of the cs_nd M-file.
csjisep: find a node separator
180 Chapter 10. Sparse matrices in MATLAB
L=cs_updown(L, c .parent) computes the rank-1 update L=chol (L*L' +c*c') ',
where parent is the elimination tree of L. c must be a sparse column vector, and
find(c) must be a subset of find(L(: ,k)), where k=min(f ind(c)).
L=cs_updown(L,c,parent,'-') is the downdate L=chol(L*L'-c*c') '.
L=cs_updown(L,c,parent,' + ') is the update L=chol(L*L'+c*c')'.
Updating/downdating is much faster than refactorization with cs_chol or
chol. L must not have an entries dropped due to numerical cancellation.
See also cs_etree, cs_chol, etree, cholupdate, chol.
cs_usolve: solve a sparse upper triangular system Ux = b
Usage: x = cs_usolve(U,b)
MATLAB equivalent: x = U\b
Solves a sparse upper triangular system. U must be upper triangular with a
zero-free diagonal, b can be a full or sparse vector.
See also cs_lsolve, cs_ltsolve, cs_utsolve, mldivide.
cs_utsolve: solve a sparse lower triangular system U'x — b
Usage: x = cs_utsolve(U,b)
MATLAB equivalent: x = U'\b
Solves a sparse lower triangular system. U must be upper triangular with a
zero-free diagonal, b must be a full vector.
See also cs_lsolve, csJLtsolve, cs_usolve, mldivide.
espy: plot a sparse matrix in color
Usage: espy (A, res)
espy (A) plots a sparse matrix, in color, with a default resolution of 256-by-
256. espy (A, res) changes the resolution to res. Entries with tiny absolute value
are light tan. Entries with large magnitude are black. Entries in the midrange
(the median of the Iog10 of the nonzero values, ± one standard deviation) range
from light green to deep blue. With no inputs, the color legend of espy is plotted.
[s,M,H] = espy (A) returns the scale factor s, the image M, and colormap H.
See also cs_dmspy, spy.
10.4 Examples
The cs.demol, cs_demo2, and cs_demo3 M-files in the MATLAB/Demo directory are
MATLAB equivalents of the C demo programs of the same name. They access
CSparse via a set of mexFunctions. These demos also plot their results with espy.
A mexFunction interfaces a C or Fortran program to MATLAB. Once com-
piled, it acts just like an M-file. Its name is always "mexFunction" and it always has
the same parameters. C- and Fortran-callable MATLAB functions with the prefix
mx provide access to MATLAB data structures, while functions with mex prefixes
operate in the MATLAB environment. Below is a sample mexFunction in CSparse
that interfaces the cs_chol function to MATLAB (the file cs_chol_mex.c). It calls
cs_mex_get_sparse to convert a MATLAB matrix A into a CSparse matrix A. The
10.4. Examples 183
matrix is analyzed and factorized with cs_schol and cs_chol. The drop parame-
ter determines whether or not numerically zero entries should be dropped from the
matrix (they must be kept for cs_updown to work properly). cs_mex_put.sparse
returns L to the MATLAB caller. If two output parameters have been provided,
then the permutation p is computed and returned to MATLAB via cs_mex_put_int.
^include "cs_mex.h"
/* cs_chol: sparse Cholesky factorization */
void mexFunction (int nargout, mxArray *pargout C ], int nargin,
const mxArray *pargin [ ])
{
cs Amatrix, *A ;
int order, n, drop, *p ;
ess *S ;
csn *N ;
if (nargout > 2 I I nargin < 1 I I nargin > 2)
mexErrMsgTxt ("Usage: [L,p] = cs_chol(A,drop)") ;
A » cs_mex_get_sparse (ftAmatrix, 1, 1, pargin [0]) ; /* get A */
n = A->n ;
order = (nargout > 1) ? 1 : 0 ; /* determine ordering */
S = cs_schol (order, A) ; /* symbolic Cholesky */
N = cs_chol (A, S) ; /* numeric Cholesky */
if (!N) mexErrMsgTxt ("cs_chol failed: not positive definite\n") ;
drop = (nargin ==!)?!: mxGetScalar (pargin [1]) ;
if (drop) cs_dropzeros (N->L) ; /* drop zeros if requested*/
pargout [0] = cs_mex_put_sparse (&(N->L)) ; /* return L */
if (nargout > 1)
{
p = cs_pinv (S->pinv, n) ; /* p=pinv' */
pargout [1] = cs_mex_put_int (p, n, 1, 1) ; /* return p */
}
cs_nfree (N) ;
cs_sfree (S) ;
}
nargin and nargout are the number of input and output parameters, pargout and
pargin are arrays of pointers to the input and output parameters. The cs_chol
mexFunction makes use of a set of utility routines called cs_mex_* shared by all
CSparse mexFunctions, in the cs_mex. c file, listed below.
#include "cs_mex.h"
/* check MATLAB input argument */
void cs_mex_check (int nel, int m, int n, int square, int sparse, int values,
const mxArray *A)
{
int nnel, mm = mxGetM (A), nn = mxGetN (A) ;
if (values)
{
if (mxIsComplex (A)) mexErrMsgTxt ("matrix must be real") ;
if OmxIsDouble (A)) mexErrMsgTxt ("matrix must be double") ;
}
if (sparse && ImxIsSparse (A)) mexErrMsgTxt ("matrix must be sparse") ;
if (!sparse && mxIsSparse (A)) mexErrMsgTxt ("matrix must be full") ;
if (nel)
{
/* check number of elements */
184 Chapter 10. Sparse matrices in MATLAB
10.4 Examples
Gilbert, Moler, and Schreiber introduced sparse matrices into MATLAB [105], in-
cluding the first implementation of the sparse backslash. Additional sparse matrix
functions of Amestoy, Davis, and Duff (AMD [1,2]), Davis and Duff (UMFPACK
[27, 28, 31, 32]), Davis, Gilbert, Larimore, and Ng (COLAMD [33, 34]), Davis,
Hager, Chen, and Rajamanickam (CHOLMOD [30]), and Lehoucq, Sorensen, and
Yang (ARPACK [144, 145, 188]) have been included, condest is based on Higham
and Tisseur's [136] method, a generalization of Hager's 1-norm estimator [123].
Penny Anderson, Bobby Cheng, and Pat Quillen have written many of the sparse
matrix methods in MATLAB. For more information on MATLAB, see Higham and
Higham [133] or Davis and Sigmon [38]. Duff [47] discusses how random matrices
(sprand, sprandn, and sprandsym) can give misleading results when factorized.
Exercises
10.1. Compare the performance (speed and accuracy) of the CSparse mexFunctions
and MATLAB, using a range of large sparse matrices from real applications.
Appendix A
Basics of the C
programming language
Variables
Six of C's basic variable types are used in CSparse:
int an integer
unsigned int an integer that is always positive
double a double-precision floating-point value
size_t an integer large enough to hold a pointer
char a character
void an object of no specific type used for pointers
In MATLAB, variables do not need to be declared (except with the rarely
used global statement). They must be declared in C. For example, the follow-
ing C statements declare integer scalars i and j and a double-precision scalar x.
Declarations can include initializations as well. C also includes a complex type.
int i ;
double x ;
int j = 0 ;
187
188 Appendix A. Basics of the C programming language
and
y - ++x ;
C has a suite of assignment operators that modify the left-hand side. The following
table shows five of these and their equivalents using the regular assignment in C.
x += 2 x = x + 2
x -= 2 X = x - 2
x /= 2 X = x / 2
x *= 2 X = x * 2
x 7,= 2 X = x 7. 2
The 7o operator in C is the rem function in MATLAB, except that the meaning of
a 7e b when either a or b are negative is machine dependent (it is used in CSparse
only for positive numbers).
Variables can be typecast into values of a different type. For example, to
convert an int to a double,
x = (double) i ;
This conversion is done automatically by the assignment x=i and when variables of
one type are passed to a function expecting another type, so it is rarely needed.
Control structures
The while loop is almost the same in C and MATLAB. These two code fragments
are the same, the first in C and the second in MATLAB:
while (x < 10)
{
x = x + i ;
}
Both C and MATLAB have a for loop, but they differ in how they work.
These two code fragments are the same in C and MATLAB, respectively.
190 Appendix A. Basics of the C programming language
for i = 0:n-l
x =x +i ;
end
is identical to
initialization
while ( condition )
{
body
post
}
i =0 ;
while (i < n)
i
x =x +i ;
i = i +1;
>
Any of the four components of a for loop can be empty. Thus for (; ;) ; is an
infinite loop. The continue and break statements are identical in C and MATLAB;
the former causes the next iteration of the nearest enclosing for or while loop to
begin, the latter terminates the nearest enclosing loop immediately.
Functions
Parameters to C and MATLAB functions are both passed by value; a C function
and a MATLAB M-file function cannot modify their input parameters. However,
a pointer can be passed by value to a C function, and the contents of what that
pointer points to can be modified. Most CSparse functions do not modify their
inputs. The const keyword when used in a function header or prototype declares
that the function does not modify the contents of an array (or, equivalently, what
a pointer argument points to).
Appendix A. Basics of the C programming language 191
A C function can return only a single value to its caller. This value can be
a pointer. For example, the cs_malloc(n,b) function returns a pointer to a newly
allocated (but uninitialized) memory space of size large enough to hold an array of
n items each of b bytes. The sizeof operator is applied to a C type and returns
the number of bytes in an object of that type (normally, sizeof (int) is four, and
sizeof (double) is eight). The C return statement is like the MATLAB return,
except that it also defines the value returned to the caller. The following C function
and MATLAB function are the same, except MATLAB always uses double-precision
floating-point values to represent its integers (called a flint in MATLAB).
int *myfunc (int n)
{
int *p, i ;
p = cs_malloc (n, sizeof (int)) ;
for (i = 0 ; i < n ; i++) p [i] = i ;
return (p) ;
}
These functions are called in the same way in C and MATLAB: p=myfunc(n) ;. C
functions that are private to a specific source code file are declared static. They
can be called only by other functions in that same file, just like a nested or private
function in MATLAB. A prototype is a statement declaring the name, parameters
(and their type), and return value of a function. These are normally placed in
an include file (described below), so that they can be incorporated into any code
that calls the function. Prototypes for functions returning an int are not strictly
required. However, if the declaration and use of the function are different and no
prototype is present, the results are unpredictable. Prototypes should always be
used in well-written C code. MATLAB has no prototypes but checks each usage of
a function as it is called at run time. The prototype for myfunc is
int *myfunc (int n) ;
Both C and MATLAB can work with pointers to functions (called function
handles in MATLAB). Consider the cs_fkeep function with prototype:
int cs_fkeep (cs *A, int (*fkeep) (int, int, double, void *), void *other) ;
Its second argument is a pointer to a function with four parameters (two int's, a
double, and a pointer to void). The function cs_f keep calls the f keep function for
each entry in the matrix. An example of the use of cs_f keep is in the cs_droptol
function. Note that cs_droptol passes a pointer to cs_tol to cs_f keep.
static int cs_tol (int i, int j, double aij, void *tol)
{
return (fabs (aij) > *((double *) tol)) ;
}
int cs_droptol (cs *A, double tol)
{
return (cs_fkeep (A, &cs_tol, fttol)) ; /* keep all large entries */
}
192 Appendix A. Basics of the C programming language
Data structures
Both C and MATLAB can create a compound object called a struct. In C, these
must be declared statically. They can be dynamically created in MATLAB. The
following code fragments are identical. In C, the mystuf f type is first defined as a
structure containing a scalar integer and a pointer to a double. It is then used in a
declaration statement to define f of type mystuf f.
typedef struct mystuffstruct
{
int i ;
double *x ;
} mystuff ;
mystuff f ;
f.i = 3 ;
f.x = cs_calloc (4, sizeof (double)) ;
Examples
Consider the following statement from cs_pvec:
for (k = 0 ; k < n ; k++) x [k] = b [p ? p [k] : k] ;
x [k] = b CP [k]] ;
}
else
{
x [k] = b [k] ;
}
>
Both examples shown above compile into code that is equally fast. The former is
just more concise. The statements from cs_transpose
Ci [q = w [Ai [p]]++] = j ; /* place A ( i , j ) as entry C ( j , i ) */
if (Cx) Cx [q] = Ax [p] ;
C library functions
The following C library functions are used in CSparse:
f abs(x) absolute value of x
sqrt (x) square root of x
malloc (n) allocates a block of n bytes of memory
calloc(n,b) allocates a block of n items each of size b and sets it to zero
f ree(p) frees a block of memory
realloc(p,n) changes the size of a block of memory to n bytes
printf just like fprintf in MATLAB
f scanf just like f scanf in MATLAB
clock returns CPU time used (used only in the CSparse demos)
qsort sorts an array
rand random number generator
srand set random number seed
194 Appendix A. Basics of the C programming language
C preprocessor
The C preprocessor is a text-only preprocessing step, applied to a program before
it is compiled. Preprocessor statements start with #, usually in the first column.
The statements used in CSparse are listed below.
#include includes a file
#def ine defines a macro or token
#if def true if the token is defined
#if ndef true if the token is not defined
#else the "else" part to an if def or if ndef
#endif the "endif" part to an if def or if ndef
The #include statement has one of the forms
#include <file.h>
#include "file.h"
The only difference between the two is where the C compiler looks for the file called
file.h. The first one looks in a sequence of predetermined locations (dependent
on your compiler and operating system). The second ("file.h") looks first in the
same place as the current source file and, failing that, looks in the same place as
the <f ile.h> form. This file is copied into the source code that has the #include
statement before it is compiled.
The #def ine statement can define a token or a macro. A token is a single word
that has no parameters. For example, the word NULL is defined as 0, or ((void *)
0), in the <stdio.h> or <stdlib.h> file. The #def ine statement is used in CSparse
to define the memory management routines CSparse should use. If CSparse is
being compiled in a MATLAB mexFunction, the token MATLAB_MEX_FILE is defined,
and the MATLAB memory management routines are used (mxMalloc, mxCalloc,
mxFree, and mxRealloc) instead of their standard C counterparts (malloc, calloc,
free, and realloc). Consider three macros that are defined in the cs.h file:
#define CS_MAX(a,b) (((a) > (b)) ? (a) : (b))
tfdefine CS_MIN(a,b) (((a) < (b)) ? (a) : (b))
tfdefine CS_FLIP(i) (-(i)-2)
CS_MAX and CS_MIN are easier-to-read versions of the ternary ?: operator and com-
pute the maximum and minimum of a and b, respectively. The CS_FLIP(i) macro
computes the simple function -(i)-2, so named because it "flips" an integer about
the pivotal integer -1, somewhat analogous to flipping the sign-bit of a num-
ber. More precisely, CS_FLIP(-1) is -1, and for all integers (ignoring overflow)
CS_FLIP(CS_FLIP(i)) equals i.
Bibliography
195
196 Bibliography
[10] C. ASHCRAFT, Compressed graphs and the minimum degree algorithm, SIAM
J. Sci. Comput., 16 (1995), pp. 1404-1411. (Cited on p. 143.)
[11] C. ASHCRAFT AND R. GRIMES, SPOOLES: An object-oriented sparse matrix
library, in Proceedings of the Ninth SIAM Conference on Parallel Processing
for Scientific Computing, 1999. (Cited on p. 143.)
[12] C. ASHCRAFT, R. GRIMES, AND J. G. LEWIS, Accurate symmetric indefinite
linear equation solvers, SIAM J. Matrix Anal. Appl., 20 (1998), pp. 513-561.
(Cited on p. 143.)
[13] C. C. ASHCRAFT, R. G. GRIMES, J. G. LEWIS, B. W. PEYTON, AND H. D.
SIMON, Progress in sparse matrix methods for large linear systems on vector
supercomputers, Intl. J. Supercomp. Appl., 1 (1987), pp. 10-30. (Cited on
p. 143.)
[14] S. T. BARNARD, A. POTHEN, AND H. D. SIMON, A spectral algorithm for
envelope reduction of sparse matrices, Numer. Linear Algebra Appl., 2 (1995),
pp. 317-334. (Cited on p. 132.)
[15] R. BARRETT, M. W. BERRY, T. F. CHAN, J. DEMMEL, J. DONATO,
J. DONGARRA, V. ElJKHOUT, R. POZO, C. ROMINE, AND H. VAN DE
VORST, Templates for the Solution of Linear Systems: Building Blocks for
Iterative Methods, SIAM, Philadelphia, 1993. (Cited on p. 6.)
[16] C. H. BISCHOF, C.-T. PAN, AND P. T. P. TANG, A Cholesky up- and down-
dating algorithm for systolic and SIMD architectures, SIAM J. Sci. Comput.,
14 (1993), pp. 670-676. (Cited on p. 67.)
[17] A. BJORCK, Numerical Methods for Least Squares Problems, SIAM, Philadel-
phia, 1996. (Cited on p. 6.)
[18] J. R. BUNCH AND L. KAUFMANN, Some stable methods for calculating inertia
and solving symmetric linear systems, Math. Comp., 31 (1977), pp. 163-179.
(Cited on p. 141.)
[19] J. R. BUNCH, L. KAUFMANN, AND B. N. PARLETT, Decomposition of a
symmetric matrix, Numer. Math., 27 (1976), pp. 95-110. (Cited on p. 141.)
[20] N. A. CARLSON, Fast triangular factorization of the square root filter, AHA
Journal, 11 (1973), pp. 1259-1265. (Cited on p. 67.)
[21] W. M. CHAN AND A. GEORGE, A linear time implementation of the reverse
Cuthill-McKee algorithm, BIT, 20 (1980), pp. 8-14. (Cited on p. 132.)
[22] T. F. COLEMAN, A. EDENBRANDT, AND J. R. GILBERT, Predicting fill for
sparse orthogonal factorization, J. Assoc. Comput. Mach., 33 (1986), pp. 517-
532. (Cited on pp. 72, 81.)
[23] T. H. GORMEN, C. E. LEISERSON, AND R. L. RIVEST, Introduction to
Algorithms, MIT Press, Cambridge, MA, 1990. (Cited on pp. 6, 35, 132.)
Bibliography 197
[42] F. DOBRIAN, G.-K. KUMFERT, AND A. POTHEN, The design of sparse direct
solvers using object-oriented techniques, in Advances in Software Tools for
Scientific Computing, Springer-Verlag, Berlin, 2000, pp. 89-131. (Cited on
p. 143.)
[44] , Sparse extensions to the Fortran basic linear algebra subprograms, ACM
Trans. Math. Software, 17 (1991), pp. 253-262. (Cited on p. 24.)
[45] J. J. DONGARRA, J. R. BUNCH, C. B. MOLER, AND G. W. STEWART,
UNPACK Users' Guide, SIAM, Philadelphia, 1979. (Cited on p. 67.)
[46] J. J. DONGARRA, J. Du CROZ, I. S. DUFF, AND S. HAMMARLING, A set
of level-3 basic linear algebra subprograms, ACM Trans. Math. Software, 16
(1990), pp. 1-17. (Cited on p. 67.)
[47] I. S. DUFF, On the number of nonzeros added when Gaussian elimination is
performed on sparse random matrices, Math. Comp., 28 (1974), pp. 219-230.
(Cited on p. 186.)
[50] , Design features of a frontal code for solving sparse unsymmetric linear
systems out-of-core, SIAM J. Sci. Statist. Comput., 5 (1984), pp. 270-280.
(Cited on p. 143.)
Bibliography 199
[51] , MA57—a code for the solution of sparse symmetric definite and indef-
inite systems, ACM Trans. Math. Software, 30 (2004), pp. 118-144. (Cited
on p. 143.)
[53] , Direct Methods for Sparse Matrices, Oxford University Press, New
York, 1986. (Cited on pp. 6, 94, 131.)
[57] I. S. DUFF AND J. KOSTER, The design and use of algorithms for permuting
large entries to the diagonal of sparse matrices, SIAM J. Matrix Anal. Appl.,
20 (1999), pp. 889-901. (Cited on pp. 94, 132, 133.)
[64] , MA47, a Fortran code for the direct solution of indefinite sparse sym-
metric linear systems. Technical report RAL-95-001, Rutherford Appleton
Laboratory, Didcot, UK, 1995. (Cited on p. 143.)
200 Bibliography
[65] , The design of MA48: A code for the direct solution of sparse unsym-
metric linear systems of equations, ACM Trans. Math. Software, 22 (1996),
pp. 187-226. (Cited on p. 143.)
[67] I. S. DUFF AND J. A. SCOTT, The design of a new frontal code for solving
sparse, unsymmetric systems, ACM Trans. Math. Software, 22 (1996), pp. 30-
45. (Cited on p. 143.)
[69] , A parallel direct solver for large sparse highly unsymmetric linear sys-
tems, ACM Trans. Math. Software, 30 (2004), pp. 95-117. (Cited on p. 143.)
[75] , The theory of elimination trees for sparse unsymmetric matrices, SIAM
J. Matrix Anal. Appl., 26 (2005), pp. 686-705. (Cited on p. 94.)
[93] , Row ordering schemes for sparse Givens transformations: II. Implicit
graph model, Linear Algebra Appl., 75 (1986), pp. 203-223. (Cited on p. 81.)
[94] , Row ordering schemes for sparse Givens transformations: III. Analyses
for a model problem, Linear Algebra Appl., 75 (1986), pp. 225-240. (Cited
on p. 81.)
[95] , A data structure for sparse QR and LU factorizations, SIAM J. Sci.
Statist. Comput., 9 (1988), pp. 100-121. (Cited on pp. 72, 80, 81.)
[96] A. GEORGE AND E. NG, On row and column orderings for sparse least square
problems, SIAM J. Numer. Anal., 20 (1983), pp. 326-344. (Cited on p. 81.)
[97] , An implementation of Gaussian elimination with partial pivoting for
sparse systems, SIAM J. Sci. Statist. Comput., 6 (1985), pp. 390-409. (Cited
on pp. 72, 83, 94.)
[98] , Symbolic factorization for sparse Gaussian elimination with partial piv-
oting, SIAM J. Sci. Statist. Comput., 8 (1987), pp. 877-898. (Cited on p. 94.)
[99] N. E. GIBBS, Algorithm 509: A hybrid profile reduction algorithm, ACM
Trans. Math. Software, 2 (1976), pp. 378-387. (Cited on p. 132.)
[100] N. E. GIBBS, W. G. POOLE, JR., AND P. K. STOCKMEYER, A comparison
of several bandwidth and reduction algorithms, ACM Trans. Math. Software,
2 (1976), pp. 322-330. (Cited on p. 132.)
[101] J. R. GILBERT, Predicting structure in sparse matrix computations, SIAM J.
Matrix Anal. Appl., 15 (1994), pp. 62-79. (Cited on pp. 6, 17, 66, 83, 84,
135.)
[102] J. R. GILBERT, X. S. Li, E. G. NG, AND B. W. PEYTON, Computing
row and column counts for sparse QR and LU factorization, BIT, 41 (2001),
pp. 693-710. (Cited on pp. 66, 81.)
[103] J. R. GILBERT AND J. W. H. Liu, Elimination structures for unsymmetric
sparse LU factors, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 334-352.
(Cited on p. 94.)
[104] J. R. GILBERT, G. L. MILLER, AND S.-H. TENG, Geometric mesh parti-
tioning: Implementation and experiments, SIAM J. Sci. Comput., 19 (1998),
pp. 2091-2110. (Cited on p. 132.)
[105] J. R. GILBERT, C. MOLER, AND R. SCHREIBER, Sparse matrices in MAT-
LAB: Design and implementation, SIAM J. Matrix Anal. Appl., 13 (1992),
pp. 333-356. (Cited on pp. 24, 35, 141, 143, 186.)
[106] J. R. GILBERT AND E. G. NG, Predicting structure in nonsymmetric sparse
matrix factorizations, in Graph Theory and Sparse Matrix Computations,
A. George, J. R. Gilbert, and J. W. H. Liu, eds., vol. 56 of IMA Vol. Math.
Appl., Springer-Verlag, New York, 1993, pp. 107-139. (Cited on pp. 81, 83,
84, 94.)
Bibliography 203
[113] G. H. GOLUB, Numerical methods for solving linear least squares problems,
Numer. Math., 7 (1965), pp. 206-216. (Cited on p. 70.)
[114] G. H. GOLUB AND C. VAN LOAN, Matrix Computations, The Johns Hopkins
University Press, Baltimore, MD, 3rd ed., 1996. (Cited on pp. 6, 72, 81.)
[115] K. GOTO AND R. VAN DE GEIJN, On reducing TLB misses in matrix mul-
tiplication. Technical report TR-2002-55, University of Texas, Austin, TX,
2002. (Cited on p. 67.)
[116] N. I. M. GOULD, Y. Hu, AND J. A. SCOTT, A numerical evaluation of
sparse direct solvers for the solution of large sparse, symmetric linear systems
of equations, www.numerical.rl.ac.uk/reports/reports.html. Technical report
RAL-2005-005, Rutherford Appleton Laboratory, Didcot, UK, 2005, ACM
Trans. Math. Software, to appear. (Cited on pp. 6, 67, 134.)
[117] A. GREENBAUM, Iterative Methods for Solving Linear Systems, SIAM,
Philadelphia, 1997. (Cited on p. 6.)
[118] A. GUPTA, Improved symbolic and numerical factorization algorithms for un-
symmetric sparse matrices, SIAM J. Matrix Anal. AppL, 24 (2002), pp. 529-
552. (Cited on pp. 94, 143.)
[120] F. G. GUSTAVSON, Finding the block lower triangular form of a sparse matrix,
in Sparse Matrix Computations, J. R. Bunch and D. J. Rose, eds., Academic
Press, New York, 1976, pp. 275-290. (Cited on p. 132.)
[121] , Two fast algorithms for sparse matrices: Multiplication and permuted
transposition, ACM Trans. Math. Software, 4 (1978), pp. 250-269. (Cited on
P- 24.)
[122] S. M. HADFIELD, On the LU Factorization of Sequences of Identically Struc-
tured Sparse Matrices within a Distributed Memory Environment, Ph.D. the-
sis, University of Florida, Gainesville, FL, 1994. (Cited on p. 94.)
[123] W. W. HAGER, Condition estimates, SIAM J. Sci. Statist. Comput., 5 (1984),
pp. 311-316. (Cited on pp. 96, 186.)
[124] , Updating the inverse of a matrix, SIAM Rev., 31 (1989), pp. 221-239.
(Cited on p. 67.)
[125] , Minimizing the profile of a symmetric matrix, SIAM J. Sci. Comput.,
23 (2002), pp. 1799-1816. (Cited on p. 132.)
[126] W. W. HAGER, S. C. PARK, AND T. A. DAVIS, Block exchange in graph par-
titioning, in Approximation and Complexity in Numerical Optimization: Con-
tinuous and Discrete Problems, P. M. Pardalos, ed., Kluwer Academic Pub-
lishers, Dordrecht, The Netherlands, 2000, pp. 299-307. (Cited on p. 132.)
[127] D. R. HARE, C. R. JOHNSON, D. D. OLESKY, AND P. VAN DEN DRIESSCHE,
Sparsity analysis of the QR factorization, SIAM J. Matrix Anal. Appl., 14
(1993), pp. 655-669. (Cited on p. 81.)
[128] M. T. HEATH, Numerical methods for large sparse linear least squares prob-
lems, SIAM J. Sci. Statist. Comput., 5 (1984), pp. 497-513. (Cited on p. 81.)
[129] M. T. HEATH AND P. RAGHAVAN, A Cartesian parallel nested dissection
algorithm, SIAM J. Matrix Anal. Appl., 16 (1995), pp. 235-253. (Cited on
pp. 132, 143.)
[130] , Performance of a fully parallel sparse solver, Intl. J. Supercomp. Appl.,
11 (1997), pp. 49-64. (Cited on p. 143.)
[131] B. HENDRICKSON AND R. LELAND, An improved spectral graph partition-
ing algorithm for mapping parallel computations, SIAM J. Sci. Comput., 16
(1994), pp. 452-469. (Cited on p. 132.)
[132] P. HENON, P. RAMET, AND J. ROMAN, PaStiX: A high-performance parallel
direct solver for sparse symmetric positive definite systems, Parallel Comput.,
28 (2002), pp. 301-321. (Cited on p. 143.)
[133] D. J. HICHAM AND N. J. HIGHAM, MATLAB Guide, SIAM, Philadelphia,
2nd ed., 2005. (Cited on pp. 6, 186.)
Bibliography 205
[134] N. J. HICHAM, Fortran codes for estimating the one-norm of a real or com-
plex matrix, with applications to condition estimation, ACM Trans. Math.
Software, 14 (1988), pp. 381-396. (Cited on p. 96.)
[136] N. J. HICHAM AND F. TISSEUR, A block algorithm for matrix 1-norm esti-
mation, with an application to 1-norm pseudospectra, SI AM J. Matrix Anal.
Appl., 21 (2000), pp. 1185-1201. (Cited on pp. 96, 186.)
[139] G. KARYPIS AND V. KUMAR, A fast and high quality multilevel scheme for
partitioning irregular graphs, SIAM J. Sci. Comput., 20 (1998), pp. 359-392.
(Cited on p. 132.)
[142] G. KUMFERT AND A. POTHEN, Two improved algorithms for reducing the
envelope and wavefront, BIT, 37 (1997), pp. 559-590. (Cited on p. 132.)
[165] S. PARTER, The use of linear graphs in Gauss elimination, SIAM Rev., 3
(1961), pp. 119-130. (Cited on p. 39.)
[168] A. POTHEN, Predicting the structure of sparse orthogonal factors, Linear Al-
gebra Appl., 194 (1993), pp. 183-204. (Cited on p. 81.)
[169] A. POTHEN AND C. FAN, Computing the block triangular form of a sparse
matrix, ACM Trans. Math. Software, 16 (1990), pp. 303-324. (Cited on
p. 132.)
acyclic graph, 4, 30, 93, 94, 119 printf, 23, 159, 163, 167, 19
acyclic reduction, 139 qsort, 25, 193
adjacency list, 4, 100, 103, 112, 119 rand, 118, 193
adjacency matrix, 4, 119 realloc, 11, 186, 193, 194
adjacency set, 4 sqrt, 59, 65, 69, 105, 193
algorithm analysis, 5, 6 srand, 118, 193
amalgamation, 91 C language, 6, 187
AMD package, 101, 102, 112, 132, 140, C preprocessor, 194
186 cancellation, 4, 9, 17, 30, 39, 41, 56,
amortized analysis, 5, 102 64, 66, 70, 72, 119, 135, 18
ancestor, 4, 41, 47, 50, 54, 67, 68, 93 cheap match, 114, 140
ARPACK package, 174, 186 child, 4, 39, 41, 44-46, 49, 50, 52-54,
assembly tree, 93 63, 91, 93, 105, 156
asymptotic notation, 5 chol.update, 63, 67
augment, 115 chol_downdate, 65, 67
augmenting path, 112, 122 Cholesky factorization, 3, 6, 37, 141
column counts, 37, 46, 52, 61, 66,
backslash, see MATLAB, mldivide 76, 154, 175, 177
backsolve, see solve, 28 data structure, 59
bandwidth, 127 left-looking, 60, 66, 68, 85, 174
bipartite graph, 5, 112, 129 multifrontal, 62, 66
BLAS package, 62, 67, 93, 140, 141 normal equations, 72
block matrix, 2, 9, 17, 19, 27, 28, 37, ordering, 58, 68, 127, 128, 1
60, 62, 85, 89, 90, 119, 122, right-looking, 62, 89, 99
128, 129, 136, 177 row counts, 46, 54, 66
block triangular form, 72, 112, 118, solving Ax = b, 67, 135, 146, 17
133, 138 supernodal, 61, 66, 140, 174
breadth-first search, 6, 31, 114, 123, symbolic, 56, 84, 85, 152, 175
127 up-looking, 37, 58, 66, 140, 15
176
C functions update, 63, 153, 175, 176, 181
calloc, 10, 26, 186, 193, 194 chol_lef t, 60
clock, 163, 193 CHOLMOD package, 24, 42, 44, 46,
fabs, 17, 22, 69, 87, 191, 193 56, 58, 62, 66, 67, 132, 14
free, 10, 186, 193, 194 174, 175, 186
f scanf, 23, 193 chol_right, 62, 99
malloc, 10, 186, 193, 194 chol_super, 61, 63
211
212 Index
maximum transversal, see maximum path, 4, 30, 37, 39, 43, 47, 52, 64, 80,
matching 112, 119, 122
maxtrans, 115 path compression, 41
minimum degree, 6, 57, 58, 99, 128, path decomposition, 47, 49, 50
130, 131, 133, 136, 138, 140, permutation, 3, 20, 21, 56, 74, 84, 99,
141, 146, 150, 173, 174, 176 112, 118, 123, 127, 135, 151,
aggressive absorption, 104 153, 181
approximate, 101 permutation vector, 20
assembly tree, 103, 105 inverse, 20
column, 76, 131, 132, 137, 146, pivoting, see LU factorization
173 pointer, 187
deficiency, 131 positive definite, 1, 3, 6, 37, 58, 62,
element absorption, 100 63, 66, 72, 84, 88, 94, 135,
elimination graph, 100 136, 140, 144, 146, 173
indistinguishable nodes, 100 positive semidefinite, 3, 127
mass elimination, 101 postorder, 5, 37, 44, 46-51, 54, 58,
quotient graph, 100, 102, 103 67, 68, 76, 82, 102, 105, 133,
tie-breaking, 112 154-156, 175, 178
profile, 127, 128, 130, 132, 141
neighbor, 4, 32, 35, 100, 117, 127 proper, 5, 44
nested dissection, 81, 99, 128, 130- prototype, 24, 25, 61, 68, 95, 145, 158,
132, 141, 179 161, 190, 191
node, 4, 30 pseudoperipheral node, 127, 132
node cover, 5, 128, 129
node separator, 5, 128-130, 132, 180 QR factorization, 3, 6, 81, 141
node-induced subgraph, 4, 5 block triangular form, 123, 138
nonsingular, 3, 135, 136, 147 data structure, 59
nonzero, 3 Givens, 69, 78-81, 140, 141, 174
nonzero entry, see nonzero Householder, 69-71, 79, 81, 83,
norm, 3, 22, 23, 69, 82, 96, 136, 147, 93, 136, 141, 152, 180
148, 174, 175, 180, 186 left-looking, 29, 70, 71, 73
normlest, 96 multifrontal, 71, 80-82
ordering, 102, 112, 118, 131, 173
one-based, 7 right-looking, 70, 71
orthogonal, 3, 20, 69, 70, 76, 78, 79, row counts, 55, 74
85, 136, 180 row-merge, 81
orthonormal, 3 solving Ax = 6, 136
out-adjacency, 4 symbolic, 57, 74, 81, 84, 93, 140,
outer product, 2, 62, 88, 91 152
overdetermined, 122, 138, 139, 177, upper bound on LU, 83
180 qr_givens, 80
qr_givens_full, 80
parallel algorithms, 141 qr_left, 71, 74
parent, 4, 39, 40, 42, 52, 54, 63, 73, qr_right, 71
74, 91, 93, 105
partial pivoting, see LU factorization range, 3
Index 217