Current Trends in Numerical Linear Algebra

SOFSEM ’94 – Milovy, The Czech Republic, 27. 11. – 9. 12.
1994
Current Trends in Numerical Linear Algebra: From

Theory to Practice
Z. Strakoš, M. Tůma
Institute of Computer Science, Academy of Sciences of the Czech Republic

Pod vodárenskou věžı́ 2, CZ–182 07 Prague, The Czech Republic
Email: strakos@uivt.cas.cz tuma@uivt.cas.cz
Abstract: Our goal is to show on several examples the great progress made in numerical analysis in
the past decades together with the principal problems and relations to other disciplines. We
restrict ourselves to numerical linear algebra, or, more specifically, to solving Ax = b where
A is a real nonsingular n by n matrix and b a real n−dimensional vector, and to computing
eigenvalues of a sparse matrix A. We discuss recent developments in both sparse direct and
iterative solvers, as well as fundamental problems in computing eigenvalues. The effects of
parallel architectures to the choice of the method and to the implementation of codes are
stressed throughout the contribution.
Keywords: numerical analysis, numerical linear algebra, linear algebraic systems, sparse direct solvers,
iterative methods, eigenvalue computations, parallel architectures and algorithms
1 Motivation modelling and extensive use of the numerical

analysis, mentioning just one important area
Our contribution deals with numerical analy- of application.
sis. What is numerical analysis? And what Nick Trefethen proposes in his unpublished
is its relation to computer science (comput- note [38] the following definition:
ing sciences)? It is hard to give a definition.
Sometimes “pure” mathematicians do not con- Numerical analysis is the study of
sider numerical analysis as a part of mathe- algorithms for mathematical prob-
matics and “pure” computer scientists do not lems involving continuous variables
consider it as a part of computer science. Nev-
ertheless, hardly anyone would argue against The keywords are algorithm and continuous,
the importance of this field. Dramatic devel- the last typically means real or complex. Since
opment in high technology in the last decades real or complex numbers cannot be represented
would not be possible without mathematical exactly on computers, the part of the business
2. Solving large sparse linear algebraic systems
of numerical analysis must be to study round- the numerical method to the efficient and reli-
ing errors. To understand finite algorithms or able code. We demonstrate that especially on
direct methods, e.g. Gaussian elimination or the current trends in developing sparse direct
Cholesky factorization for solving systems of solvers.
linear algebraic equation, one have to under-
stand the computer architectures, operation 2 Solving large sparse linear alge-
counts and the propagation of rounding errors. braic systems
This example, however, does not tell you all Although the basic scheme of the symmetric
the story. Most mathematical problems involv- Gaussian elimination is very simple and can
ing continuous variables cannot be solved (or be casted in a few lines, the effective algo-
effectively solved) by finite algorithms. A clas- rithms which can be used for the really large
sical example - there are no finite algorithms problems usually take from many hundreds
for matrix eigenvalue problems (the same con- to thousands of code statements. The differ-
clusion can be extended to almost anything ence in the timings can then be many orders
nonlinear). Therefore the deeper business of of magnitude. This reveals the real complex-
numerical analysis is approximating unknown ity of the intricate codes which are necessary
quantities that cannot be known exactly even in to cope with the large real-world problems.
principle. A part of it is, of course, estimating Subsection 2.1 is devoted to the sparse direct
the precision of the computed approximation. solvers stemming from this symmetric Gaus-
Our goal is to show on several examples the sian elimination. Iterative solvers, which are
great achievements of the numerical analysis, based on the approaching the solution step by
together with the principal problems and rela- step from some initial chosen approximation,
tions to other disciplines. We restrict ourselves are discussed in subsection 2.2. A comparison
to numerical linear algebra, or more specifi- of both approaches is given in subsection 2.3.
cally, to solving Ax = b, where A is a real
nonsingular n by n matrix and b a real n- 2.1 Sparse direct solvers
dimensional vector (for simplicity we restrict This section provides a brief description of the
ourselves to real systems; many statements ap- basic ideas concerning sparsity in solving large
ply, of course, also to the complex case), and linear systems by direct methods. Solving the
to computing eigenvalues of a square matrix A. large sparse linear systems is the bottleneck in
Much of scientific computing depends on these a wide range of engineering and scientific com-
two highly developed subjects (or on closely putations. We restrict to the symmetric and
related ones). This restriction allows us to go positive definite case where most of the impor-
deep and show things in a sharp light (well, at tant ideas about algorithms, data structures
least in principle; when judging this contribu- and computing facilities can be explained. In
tion you must take into account our - certainly this case, the solution process is inherently sta-
limited - expertise and exposition capability). ble and we can thus avoid numerical pivoting
We emphasize two main ideas which can be which would complicate the description other-
found behind our exposition and which are es- wise.
sential for the recent trends and developments Efficient solution of large sparse linear sys-
in numerical linear algebra. First, our strong tems needs a careful choice of the algorith-
belief is that any “software → solution” without mic strategy influenced by the characteristics
a deep understanding of mathematical (physi- of computer architectures (CPU speed, mem-
cal, technical, ...) background of the problem ory hierarchy, bandwidths cache/main mem-
is very dangerous and may lead to fatal er- ory and main memory/auxiliary storage). The
rors. This is illustrated, e.g., on the prob- knowledge of the most important architectural
lem of computing eigenvalues and on charac- features is necessary to make our computations
terizing the convergence of the iterative meth- really efficient.
ods. Second, there is always a long way (with First we provide a brief description of the
many unexpectably complicated problems) from Cholesky factorization method for solving of a
sparse linear system. This is the core of the the column j by the column k at the lines (3)–
symmetric Gaussian elimination. Basic nota- (5) is denoted by cmod(j, k). Vector scaling at
tion can be found, e.g., in [2], [18]. the lines (8)–(10) is denoted by cdiv(j).
We use the square root formulation of the The right-looking approach can be de-
factorization in the form scribed by the following pseudo-code:
(1) for k = 1 to n
A = LDLT ,
(2) dk = ak
where L is lower triangular matrix and D is (3) for i = k + 1 to n if aik 6= 0
diagonal matrix. Having L, solution x can be (4) lik = aik /dk
computed using two back substitutions and one (5) end i
diagonal scaling: (6) for j = k + 1 to n if akj 6= 0
Lȳ = b ; y = D −1 ȳ ; LT x = y. (7) for i = k + 1 to n if lik 6= 0
(8) aij = aij − lik akj
Two primary approaches to factorization (9) end i
are as follows (we do not mention the row- (10) end j
Cholesky approach since its algorithmic prop-
(11) end k
erties usually do no fit well with the modern
computer architectures).
The left-looking approach can be described In the right-looking approach, once a col-
by the following pseudo-code: umn k is completed, it immediately generates
(1) for j = 1 to n all contributions to the subsequent columns,
(2) for k = 1 to j − 1 if akj 6= 0 i.e., columns to the right of it in the matrix.
(3) for i = k + 1 to n if lik 6= 0 A number of approaches have been taken in
(4) aij = aij − lik akj order to solve the problem of matching nonze-
(5) end i ros from columns j and k at the lines (6)-(10)
(6) end k of the pseudo-code as will be mentioned later.
(7) dj = ajj Similarly as above, operation at the lines (3)–
(8) for i = k + 1 to n (5) we denote cdiv(k) and column modification
(9) lij = aij /ajj at the lines (7)–(9) is denoted by cmod(j, k).
(10) end i Discretized operators from most of the ap-
(11) end j plications such as structural analysis, compu-
tational fluid dynamics, device and process
In this case, a column j of L is computed by simulation and electric power network prob-
gathering all contributions from the previously lems contain only a fraction of nonzero ele-
computed columns (i.e. the columns to the left ments: these matrices are very sparse. Explicit
of the column j in the matrix) to the column consideration of sparsity leads to substantial
j. Since the loop at the lines (3)–(5) in this savings in space and computational time. Usu-
pseudo-code involves two columns, j and k, ally, the savings in time are more important,
with potentially different nonzero structures, since the time complexity grows quicker as a
the problem of matching corresponding nonze- function of the problem size than the space
ros must be resolved. Vector modification of complexity (memory requirements).
   
∗ ∗ ∗ ∗ ∗ ∗ ∗

 ∗ ∗
∗ ∗



A= ∗ ∗ Ā =  ∗ ∗
   

∗ ∗ ∗ ∗
   
 
∗ ∗ ∗ ∗ ∗ ∗ ∗
Figure 1.1: The arrow matrix and in the natural and reverse ordering.
For a dense problem we have CPU time the diagonal. Natural orderings of many appli-
proportional to n3 while the necessary mem- cation problems lead to such concentrations of
ory is proportional to n2 . For the sparse prob- nonzeros.
lems arising from the mesh discretization of 3D Define fi = min {j | aji 6= 0} for i ∈ n̂.
problems CPU time is proportional (typically) This locates the leftmost nonzero element in
4
to n2 and space proportional to n 3 . each row. Set δi = i − fi . The profile is de-
fined by ni=1 δi . The problem of concentrat-
P
We will demonstrate the time differences
using the following simple example from [30]: ing elements around the diagonal can be thus
reformulated as the problem of minimizing the
Example 2.1 Elapsed CPU time for the ma- profile using symmetric reordering of matrix
trix arising from a finite element approxima- elements.
tion of a 3D problem (design of an automo-
bile chassis). Dimension of a the matrix: n = 
∗ ∗ ∗

44609. Proportion of nonzeros: 0.1%. Time on ∗
 ∗ ∗ ∗
Cray X-MP (1 processor) when considered as 
 ∗ ∗ ∗ ∗

a full (dense) matrix: 2 days. Time on Cray 
 ∗ ∗ ∗

X-MP when considered as a sparse matrix: 60 ∗ ∗ ∗ ∗ ∗
seconds.
Figure 1.2: An matrix illustrating the pro-
Considering matrix sparsity we must care file Cholesky scheme. We have f1 = 1, f2 =
about the positions of nonzeros in the matrix 1, f3 = 2, f4 = 3, f5 = 1.
patterns of A and L. For a given vector v ∈ Rk
define Sometimes we use a rougher measure of the
quality of the ordering - we are minimizing
Struct(v) = {j ∈ k̂|vj 6= 0}. only band of the matrix - (often) defined as
Usually, nonzero elements are introduced β = max δi , for i ∈ n̂.
into the new positions outside the pattern of
∗ ∗ ∗
 
A during the decomposition. These new ele-
∗ ∗ ∗ ∗
ments are known as fill-in elements. In order 


∗ ∗ ∗ ∗ ∗
to reduce time and storage requirements, it is

 
∗ ∗ ∗ ∗
 
necessary to minimize the number of the fill- 



in elements. This can be accomplished by a  ∗ ∗ ∗ ∗
combination of a good choice of data structures ∗ ∗
used for matrix elements, matrix ordering and
an efficient implementation of the pseudo-code. Figure 1.3: An example of a band matrix
A typical example showing how the matrix with β = 2.
ordering influences time and storage is the case
of an arrow matrix in the Figure 1.1. While in More advanced variant of this principle re-
the first matrix A we do not get any fill-in dur- lies on the dynamical reordering of the matrix
ing the Cholesky factorization process, reverse to get the nonzeros as close to the diagonal
ordering of variables provides matrix Ā, which as possible during the Cholesky factorization.
completely fills after the first step of the de- Such an algorithm we call the frontal method.
composition: In this case we use in any step only the el-
ements of a certain window which is moving
2.1.1 A Special sparsity structure down the diagonal.
– Profile, band and frontal For the algorithms to reorder matrices ac-
schemes cording to this principles see [18]. Advantages
A well-known way to make use of the sparsity of the methods considered in this subsection
in the Cholesky factorization is to move the are in their simplicity. To store the nonzeros
nonzero elements of A into the area “around” we need to store only that part of the matrix
A which is covered by the elements which de- mobile chassis). Dimension of a the matrix:
termine the profile or the band of A or the n = 44609. Memory size used and number of
dynamical window of the frontal method. floating-point operations for the factor L for
Simple result (see, for instance, [18]) guar- the frontal solver: 52.2 MByte / 25 Billion.
antees that all the nonzeros of L are inside Memory size used and number of floating-point
the above-mentioned part determined from the operations for L when general sparse solver was
pattern of A. This observation justifies the used: 5.2 MByte / 1.1 Billion.
three algorithmic types mentioned in this sub-
section. The second reason is architectural. Hard-
To implement band elimination we need ware gather/scatter facilities used in modern
only to store nonzero elements in a rectangular computers (see [26]) caused that even the sim-
array of the size β × n. Some problems gen- plicity of data structures for band, profile
erate matrices with a main diagonal and with and frontal solvers are not able to guarantee
one or more nonzero subdiagonals. In this case competitive computation times. They behave
we can store also these diagonals and diagonals worse than general sparse solvers even in the
necessary for fill-in as a set of “long” vectors. case when the difference in number of fill-in el-
To implement the profile method, we usually ements (size of the factor L) is not so dramatic.
need even less nonzero positions and one addi-
tional pointer vector to point to the first nonze- 2.1.2 General Sparse Solvers
ros in the matrix rows. Frontal elimination To describe basic ideas used in today’s general
needs vectors to perform row and column per- sparse solvers we need to introduce some ter-
mutations of the system dynamically, through- minology. Undirected graphs are useful tools
out the factorization steps. More complicated in the study of symmetric matrices. A given
implementation is compensated by the advan- matrix A can be structurally represented by its
tage of smaller working space. All these pos- associated graph G(A) = (X(A), E(A)), where
sibilities can be considered in both the right- nodes in X(A) = {1, . . . , n} correspond to rows
and left-looking implementations but the dif- and columns of the matrix and edges in E(A)
ferences are not, in general, very large since all correspond to nonzero entries.
these models are based on similar principles. Filled matrix F = L + LT contains gener-
Although all these three schemes are very ally more nonzeros than A. Structure of F is
simple to implement and also the data struc- captured by the filled graph G(F ). The prob-
tures are simple (nonzero parts of rows, lem how to get structure of nonzeros of F was
columns or diagonals are stored in vectors or solved first in [33] using graph-theoretic tools
rectangular arrays as dense pieces), they are to transform G(A) to G(F ).
not used very often as in recent sparse sym- An important concept in sparse factoriza-
metric solvers. tion is the elimination tree of L. It is defined
First reason is algorithmic. General sparse by
schemes may have much less fill-in elements
than the previous schemes would implicitly parent(j) = min {i | lij 6= 0, i > j}.
suppose. We demonstrate this fact on the pre-
viously mentioned example taken from [30]: In other words, column j is a child of column
Example 2.2 Comparison of factor size and i if and only if the first subdiagonal nonzero
number of floating-point operations for the ma- of column j in L is in row i. Figure 1.4 shows
trix arising from the finite element approxi- structures of matrices A and L and the elimi-
mation of a 3D problem (design of an auto- nation tree of L.
5
✉
 
∗ ∗

 ∗ ∗ ✉ ✉ 4
∗ ∗ ∗ 2
 

∗ ∗ ∗ f
 
✉ ✉
∗ ∗ f ∗
1 3
Figure 1.4: Structures of matrices A, L and of the elimination tree of L. By stars we denote
original nonzeros of A, additional fill-in in L is denoted by the letter f .
Following few lemmas recall some impor- Moreover, it can be shown that Tr [i] is a
tant properties of elimination trees which will pruned subtree of T [i] and that its leaves can
help us to understand basic principles of sparse be easily determined.
solvers. For a deeper insight into this subject The important corollary from these consid-
we refer to [25]. Note, that elimination tree erations is that structure of rows in L can be
can be computed directly from the structure easily determined using the elimination tree.
of nonzeros of A in complexity nearly linear in The second important problem concerning
n (see [27]). the implementations - determination of the
Lemma 2.1 If lij 6= 0, then the node i is an structures of columns of L - can be also very
ancestor of j in the elimination tree. easily solved also using the elimination tree:
This observation provides a necessary con-
dition in terms of the ancestor-descendant re- Lemma 2.4 Struct(L∗j ) ≡ {i|lij 6= ∅ ∧ i ≥ j}
= k is son j Struct(L∗k ) Struct(A∗j )
S S
lation in the elimination tree for an entry to be
nonzero in the filled matrix. in T (A)
−{1, . . . , j − 1}
Lemma 2.2 Let T [i] and T [j] be two disjoint
subtrees of the elimination tree (rooted at i and This is a simple corollary of the Cholesky
j, respectively). Then for all s ∈ T [i] and decomposition step and dependency relations
t ∈ T [j], lst = 0. captured by the elimination tree. This formula
means that in order to get the structure of a
One important problem concerning the
column i in L we need only to merge struc-
Cholesky factorization is how to determine row
tures of the column i in A with structures of
structures of L. For instance, in the left-
sons of i in L. Consequently, the algorithm
looking pseudo-code, nonzeros in a row k of
to determine the column structure of L can be
L correspond to columns from {1, . . . , k − 1}
implemented in O(m) operations where m is a
which contribute to column k.
number of nonzeros in L.
Lemma 2.3 lij 6= 0 if and only if the node j is The elimination tree gathers the most im-
an ancestor of some node k in the elimination portant structural dependencies in the Cholesky
tree, where aik 6= 0. factorization scheme. Numbering of its vertices
This result can be used to characterize the determines the order in which the matrix en-
row structure of the Cholesky factor. Define tries are processed by the solver. Moreover, we
Tr [i], the structure of the i − th row of the can renumber vertices and/or even modify the
Cholesky factor as follows elimination tree while preserving the amount
of fill-in elements in the correspondingly per-
Tr [i] = {j|lij 6= 0, j ≤ i}. muted factor L. Motivations for such changes
will be described in the following subsection.
Then we have Basic structure of the general sparse left-
looking solver can then be given in the follow-
Tr [i] ⊆ T [i]. ing four steps (not taking into account the or-
dering phase): Basic approach how to accomplish that is

to use rather operations with blocks of data in-
• Form the elimination tree. stead of operations with matrix elements. This
important observation is now commonly used
• Find the structure of columns of L (sym- in linear algebra operations (see [15]). It was
bolic factorization). introduced after the spreading of vector su-
percomputer architectures. The main problem
• Allocate the static data structures for L
with achieving the supercomputer performance
based on the result of the previous step.
on these architectures was to keep the vector
functional units busy - to get enough data for
• Do numerical updates corresponding to
them. Blocking data to use matrix-vector and
the left-looking pseudo-code.
matrix-matrix operations was found to be very
Implementation of the right-looking solver successful.
can be based on the similar scheme. A comparison of number of memory refer-
One of the popular type of the right-looking ences and number of flops for the three types
algorithm is the multifrontal method (see [12], of basic operations in dense matrix computa-
[3]). In that method, the update step of the tions is in the Table 1.1. We consider α to be
pseudo-code (steps (6)–(10)) is implemented a scalar, x, y to be vectors and B, C, D to be
using dense matrix operations. Updates are (square) matrices; n represents the size of the
stored separately, usually in the separate stack vectors and matrices.
working area, but some of them can make use #m.r.
operation # m. r. # flops #flops
of the final space for L. We do not discuss
3
this method as a special case in our overview y ← αx + y 3n 2n 2
of sparse techniques, because its implementa- 1
y ← Dx + y n2 2n2 2
tions on various computers can mix the ideas
2
previously mentioned. B ← DC + B 4n2 2n3 n
Since we can effectively compute structures
of columns and rows of L, the symbolic over- Table 1.1: Comparison of number of mem-
head in the computations is rather small. But ory references (m.r.) and number of flops for
this does not mean that our computations are the three types of block hierarchy in dense ma-
effective. The problem is that the choice of trix computations.
the algorithmic strategy has to match with the
computer architecture. We will consider some A similar principle can be applied also
issues of this kind in the following subsection. for the scalar computations in general sparse
solvers. Note, that vector pipelining is not
2.1.3 Let the architecture reign decisive; much more important is keeping
Suppose we have the usual memory hierarchy the CPU unit as busy as possible while
with fast and small memory parts at its top minimizing the data transfers. Thus, the
(registers, cache) and slower and bigger parts efficient implementations of both the left-
at the bottom. Usually, it is not possible to looking and right-looking algorithms require
embed the large problems completely into the that columns of L sharing the same sparsity
cache and transfer between memory hierarchy structure are grouped together into supern-
levels takes considerable time. On the other odes. More formally, the set of continguous
side, computations are the most effective only columns {j, j + 1, . . . , j + t} constitutes a su-
if the active data are as close to the top as pernode if Struct(l∗k ) = Struct(l∗k+1 ) ∪ {k}
possible. This implies that we must in some for j ≤ k ≤ j + t − 1. Note that these columns
way maximize the proportion of the number of have a dense diagonal block and have identical
floating-point operations (flops) to number of column structure below row j +t. A supernode
memory references which enable us to keep the can be treated as a computational and storage
data in the cache and registers. unit. Following example shows the difference
in timings for the sparse Cholesky right-looking Example 2.4 Comparison of number of page
decomposition using supernodes. faults for a matrix arising from the 9-point dis-
Example 2.3 Elapsed time for the linear sys- cretization of a regular 2D 180 × 180 grid (di-
tem with the SPD matrix of the dimension n = mension n = 32400). Number of page faults
5172 arising from the 3D finite element dis- for an level-by-level ordering of the elimination
cretization. Computer: 486/33 IBM PC com- tree from the bottom to the top: 1.670.000.
patible. Virtual paging system uses the mem- Number of page faults using postordering of
ory hierarchy: registers /cache /main mem- the elimination tree: 18.000.
ory /hard disk. General sparse right-looking Another strategy to obtain the equivalent
solver. Elapsed time without block implemen- reordering that can reduce the active storage
tation: 40 min. Elapsed time with supernodal is to rearrange the sequences of children in the
implementation: 4 min. elimination tree. The situation is depicted on
the Figure 1.6. While on the left part of the
Another way how to decrease the amount figure we have a tree with some initial pos-
of communication in the general sparse solver tordering, on the right side we are numbering
is to try to perform the update operations in “large” children first. We are doing it in any
such a way that the data necessary in a step node and recursively. For instance, considering
of the decomposition are as close together as vertex 18, we have numbered “largest” subtree
possible. Elimination tree can serve as an ef- of the right side elimination tree first.
ficient tool to describe this principle. Having More precisely, the new ordering is based
an elimination tree, we can renumber some of on the structural simulation of the Cholesky
its vertices in such a way that the Cholesky decomposition. Active working space in each
decomposition with the corresponding permu- step can be determined for various renumber-
tation will provide the factor L of the same ings of children of any node. In this way,
size. Consider the renumbering of the elimina- the recursive structural simulation can deter-
tion tree from the Figure 1.4. This elimination mine a new renumbering permuting the sub-
tree is renumbered by a so-called postordering trees corresponding to the tree nodes in order
(a topological ordering numbering any subtree to minimize the active working space without
by an interval of indices). This reordering is changing the elimination tree. Consider, for in-
equivalent to the original one in the sense that stance, vertex 18 in the elimination trees. De-
it provides the same factor L. cision, in which order we will process its sons
5
✉ (and, of course, all its subtrees considering pos-
tordered elimination tree) is based on the re-
cursively computed value of the temporary ac-
✉ ✉ 4
tive working space. The natural idea is to start
1
with processing of “large” subtrees. Both the
✉ ✉
children rearrangements are postorderings of
2 3 the original elimination tree. The actual size of
the working space depends directly on whether
we are minimizing working space for the mul-
Figure 1.5: Postordering of the elimination
tifrontal method or for some left-looking or
tree of the Figure 1.4
right-looking out-of-core (out-of-cache) solver.
Since the postordering numbers indices in If we are looking for the equivalent transfor-
such a way that vertices in any subtree are mations of the elimination tree in order to min-
numbered before giving numbers to any dis- imize the active working space, we can not only
joint subtree, we can expect much less data shrink vertices into supernodes and renumber
communications than in the previous case. elimination tree. We can also to change the
The difference is described by the number of whole structure of the elimination tree. Theo-
page faults in the virtual paging system in the retical basis of this transformation is described
following example (see [31]). in [29].
✉ 19
✉ 19
17
18 ✉ 18 ✉
✉
6
✉
16 ✉ ✉ 11
5 ✉ ✉17
5 10
11 ✛ ✲ 14 ✉ ✉15 ✉ ✉
3 ✉ ✉4 ✉ 16 ✉
✉ ✉ 3 ✉ 4 ✉ ✉ ✉ 8
✉ ✉ 9 ✉ ✉ ✉ ✉ 14
12 13 9
1 2 10 15
✉ ✉ ✉ ✉
✉ ✉ ✉ ✉
7 8 12 13 1 2 6 7
Figure 1.6: Rearranging of the children sequences in the elimination tree.
Instead of the balanced elimination tree as cal parameters we are not able even to de-
provided by some matrix reorderings minimiz- cide which combination of the left-looking and
ing fill-in elements we can use unbalanced elim- right-looking techniques is optimal. There are
ination tree which can in practice decrease the many open problems in this area. Theoretical
active working space about the size up to 20%. investigation leads often to the directly appli-
Balanced and unbalanced elimination trees cable results.
are schematically depicted on the Figure 1.7.
2.2 Recent development in iterative
In practice, all these techniques deal with solvers: steps towards a black box
the supernodes rather than with the individual iterative software?
entries. A large amount of black box software in the
Balanced and effective use of the described form of mathematical libraries as LAPACK
techniques is a difficult task strongly depend- (LINPACK), NAG, EISPACK, IMSL etc. and
ing on the computer architecture for which the general sparse solvers as MA27, MA42, MA48,
solver is constructed. For computers with rel- UMFPACK, etc. have been developed and are
atively quick (possibly vectorizable) floating- widely used in many applications.
point operations and slow scalar arithmetics, Users can exploit this software with high
one can effectively merge into the supernodes confidence for general problem classes. Con-
more vertices despite the fact that the resulting cerning systems of linear algebraic equations,
structure of L would have additional nonzeros codes are based almost entirely on direct meth-
(see [5]). On the other hand, sometimes it is ods. Black box iterative solvers would be
necessary to construct smaller supernodes by highly desirable and practical - they would
breaking large blocks of vertices into pieces (see avoid most of the implementation problems re-
[35]). This is the case of workstation and also lated to exploiting the sparsity in direct meth-
of some PC implementations. Elimination tree ods. In recent years many authors devote a
rearrangements provide an a posteriori infor- lot of energy into the field of iterative methods
mation for the optimal partitioning of blocks and a tremendous progress have been achieved.
of vertices. For a recent survey we refer e.g. to [14] and [8].
Computer architecture is the prominent Sometimes the feeling is expressed that this
source of information for implementing any progress have already established a firm base
general sparse solver. Without knowledge of for developing a black box iterative software.
the basic architectural features and techni- This is, however, very far from our feeling.
✉ ✉
✉ ✉
✉ ✉ ✛ ✲ ✉ ✉
✉ ✉ ✉ ✉ ✉
✉ ♣✉ ✉ ✉ ✉ ✉ ✉ ✉ ✉
✉ ✉
✉ ✉
✉ ✉
✉ ✉
Figure 1.7: Balanced and unbalanced elimination trees.
Strictly speaking, we do not believe that about the one previous step and it is station-
any good (fast, precise, reliable and robust) ary because neither B nor c depend upon the
black box iterative software for solving systems iteration step k. Everyone knows examples as
of linear algebraic equations will be available in the Richardson method, the Jacobi method,
the near future. This section describes several the Gauss-Seidel method, the Successive Over-
good reasons supporting our opinion. Many it- relaxation method (SOR) and the Symmetric
erative methods have been developed; for the Successive Overrelaxation method (SSOR). We
excellent surveys of the classical results we re- are not going to repeat the formulas or the the-
fer to [39], [41], [7] and [22]. For historical rea- ory of these methods here, that can be found
sons we recall briefly the basic iterative meth- elsewhere. Instead, we recall the very well
ods and then turn to the state-of-the-art: the known fact, that these simple methods (espe-
Krylov space methods. cially SOR and SSOR) may show an excellent
The best known iterative methods - the lin- performance while carefully tuned to a specific
ear stationary iterative methods of the first problem, but their performance is very prob-
kind - are characterized by the simple formula lem - sensitive. This lack of robustness avoid
their general use for a wide class of problems.
x(k) = Bx(k−1) + c
Nonstationary iterative methods differ
where x(k) is the current approximation to the from stationary methods in that the param-
solution at the k-th step, (x(0) is given at the eters of the formula for computing the current
beginning of computation), the n by n matrix approximation depend on the iteration step.
B and vector c characterize the method. Any Consequently, these methods are more robust;
method of this type is linear because x(k) is in many cases they are characterized by some
given as a linear function of the previous ap- minimizing property. In the last years most of
proximations; it is of the first kind because the the effort in this field is devoted into the Krylov
iteration formula involves the information just space methods.
Krylov space methods for solving linear U ∗ U = U U ∗ = I, I is the identity matrix, into
systems start with an initial guess x(0) for the the polynomial formulation of the methods.
solution and seek the k-th approximate solu- For nonnormal matrices, hovewer, no unitary
tion in the linear variety eigendecomposition exist. Despite that, many
authors extend intuitively the feeling that the
x(k) ∈ x(0) + Kk (A, r (0) ) spectrum of the coefficient matrix plays deci-
sive role in the characterization of convergence
where r (0) = b−Ax(0) is the initial residual and even in the nonnormal case. This is actually
Kk (A, r (0) ) is the k-th Krylov space generated a very popular belief which is (at least implic-
by A, r (0) , itly) present in discussions of the experimental
results in many papers. This belief is, however,
Kk (A, r (0) ) = span {r (0) , Ar (0) , . . . , Ak−1 r (0) }.
wrong!
Then, the k-th error and the k-th residual are As an example we can take the GMRES
written in the form method. GMRES approximations are chosen
to minimize the Euclidean norm of the resid-
x − x(k) = pk (A)(x − x(0) ) ual vector r (k) = b−Ax(k) among all the Krylov
space methods, i.e.,
r (k) = pk (A)r (0) ,
where pk ∈ Pk , Pk denotes the set of polyno- k r (k) k= min k b − Au k .

u∈x(0) +Kk (A,r (0) )
mials p(λ) of the degree at most k satisfying
p(0) = 1. Based on that, Krylov space meth- Residual norms of successive GMRES approx-
ods are also referred to as polynomial methods. imations are nonincreasing, since the residuals
An enormous variety of the Krylov space meth- are being minimized over a set of expanding
ods exists, including the famous conjugate gra- subspaces. Notice that due to the minimizing
dient method (CG), the conjugate residual property, the size of GMRES residuals forms a
method (CR), SYMMLQ, the minimal resid- lower bound for the size of the residual of any
ual method (MINRES), the biconjugate gra- other Krylov space method. In other words,
dient method (BiCG), the quasiminimal resid- GMRES shows how small residual can be found
ual method (QMR), the generalized minimal in the variety
residual method (GMRES), giving just a few
names. r (0) + AKk (A, r (0) ).
In the rest of this subsection we will concen-
trate on the Krylov space methods and show If GMRES performs poorly, then any other
several principial questions which must be sat- Krylov space method performs poorly as well
isfactorily answered prior to constructing any (measured by the size of the residual). But the
good black box iterative solver. size of the residual - that is the only easy-to-
First question can be formed as: What compute convergence characteristic.
characterizes convergence of the Krylov space A key result was proven in [21] (for the
method? This is certainly a key question. other related results see also [20]). It can be
Without giving an answer one can hardly build reformulated in a following way. Given a non-
up theoretically well justified preconditioned increasing positive sequence f (0) ≥ f (1) ≥
methods. For the Hermitian and even the nor- . . . f (n − 1) ≥ 0, and a set of nonzero com-
mal systems (i.e. the systems characterized by plex numbers {λ1 , . . . λn }, there is a n by n
the Hermitian or normal matrix) the answer matrix A having eigenvalues {λ1 , . . . λn }, and
is: the rate at which a Krylov space method an initial residual r (0) with k r (0) k= f (0)
converges is determined by the eigenvalue dis- such that the GMRES algorithm applied to
tribution and the initial approximation (ini- the linear system Ax = b, with initial residual
tial error, initial residual). It can be clearly r (0) , generates approximations x(k) such that
seen by substituting the unitary eigendecom- k r (k) k= f (k), k = 1, 2, . . . , n − 1. In other
position of the the matrix A, A = U ΛU ∗ , words, any nonincreasing convergence curve
can be obtained with GMRES applied to a ma- cannot say anything about the size of the ul-
trix having any desired eigenvalues! The re- timate (or “final”) residual in practical com-
sults of [20] and [21] demonstrate clearly that putations. Consequently - we cannot predict a
eigenvalues alone are not the relevant quanti- priori the precision level on which the iteration
ties in determining the behavior of GMRES for should be stopped! For more details we refer
nonnormal matrices. It remains an open prob- to [11] and [36].
lem to determine the most appropriate set of One can form many other questions of sim-
system parameters for describing the GMRES ilar importance. As stated earlier, a good it-
behavior. erative black box software must be fast, pre-
A second question is: What is the precision cise, reliable and robust. In all these at-
level which can be achieved by iterative meth- tributes it must compete with highly effective
ods and which stopping criteria can be effec- (sparse) modern direct codes. We have dis-
tively used? The user will always ask: Where cussed here some troubles related to the first
to stop the iteration? Stopping criteria should two attributes. Even from the short character-
guarantee a small error. If error is considered ization of iterative methods given above it is
as a distance to the true solution (measured clear that the third and fourth attributes cause
e.g. in the Euclidean norm), then the question also a lot of problems which are unresolved yet
is too hard - one usually has no tools to esti- (lacking in space we are not going into details
mate directly this so called forward error. The here). Based on that, we do not believe in con-
other possibility is to consider the error in the structing competitive black box iterative solvers
backward sense, i.e., consider the approxima- in the near future.
tion x(k) to the solution x as the exact solution
of a perturbed system 2.3 Direct or iterative solvers?
In this section we will first give some considera-
(A + ∆A)x(k) = b + ∆b, tions concerning complexity of direct and itera-
tive methods for the solution of linear systems
and try to make the perturbations ∆A and ∆b arising in one special but important applica-
as small as possible. It is well known, see [24], tion (see [40]). Then we will state objections
that for a given x(k) such a minimal perturba- against the straightforward generalization of
tions, measured in the Euclidean resp. spectral this simple case.
norms, exist, and their size is given by The matrix for our simple comparison
arises from the self-adjoint elliptic boundary-
min{ν : (A + δA)x(k) = b + ∆b, value problem on the unit cube in 2D or 3D.
k ∆A k / k A k≤ ν, The domain is covered with a mesh, uniform
k ∆b k / k b k≤ ν} = and equal in all 2 or 3 dimensions with mesh-
width h. Discretizing this problem we get the
= k b − Ax(k) k /(k A kk x(k) k + k b k).
symmetric and positive definite matrix A of the
dimension n and the right-hand side vector b.
Consequently, to guarantee a small backward
Consider the iterative method of conjugate
error, it is sufficient to use the stopping criteria
gradients applied to this system. Considering
based on the value of
exact arithmetics,
√ the error reduction per it-
κ−1
k b − Ax(k) k /(k A kk x(k) k + k b k). eration is ∼ κ+1 , where κ is the condition
√
number of A defined as κ = ||A||||A−1 ||.

The problem seems to be solved, because the Relation among the condition number κ
residual is always available and the spectral and problem dimension as follows: for a 3D
2
norm of A can be approximated e.g. by the problem we have κ ∼ h−2 ≈ n 3 , for a 2D prob-
easily computable Frobenius norm. A careful lem we have κ ∼ h−2 ≈ n.
reader will, however, raise a question about Then, for the error reduction below the
rounding errors. This is a real crucial point. level ǫ, we have that
Without a careful rounding error analysis we
3. Computing eigenvalues - a principal problem!
j
1− √1κ

√2 )j
torization (see [13]) when only one right-hand
1+ √1κ
≈ (1 − ≈ exp( −2j
κ
√ ) < ǫ =⇒
κ side is present.
log ǫ √
j∼− 2 κ. Summarizing, sparse direct solvers would
Assume the number of flops per iteration to win as a general purpose codes up to the very
be ∼ f n (f is a small integer standing for the large size of the problems n. For specific ap-
average number of nonzeros per row and the plications, or extremely large n, the iteration
overhead introduced by the iterative scheme). with preconditioning might be a better or even
Then the number of flops for the convergence the only alternative. This conclusion repre-
below the level ǫ is proportional to f nj ∼ n 3
4 sents the state of knowledge in the early 90’s
3
for 3D problems and ∼ f nj ∼ n 2 for 2D prob- and are, of course, subject to change depending
lems. on the progress in the field.
It is known, that many preconditioners (see 3 Computing eigenvalues - a princi-
[10]) are able to push the condition number of pal problem!
the system down to O(h−1 ). Then the number
7 To show how hopeless and dangerous might be
of flops per reduction to ǫ is given by ∼ n 6 and a naive “software → solution” approach with-
5
∼ n 4 for 3D and 2D problem, respectively. out understanding the “nature” of the problem
Consider now a direct method. Using ef- we consider an “elementary” problem - com-
fective ordering strategies, we can have for the puting eigenvalues {λ1 , λ2 , . . . , λn } of a n by n
matrix mentioned above number of operations matrix A.
4
∼ n2 and the size of the fill-in ∼ n 3 in 3 di- We will not specify the algorithm; just sup-
mensions. For 2D problem the corresponding pose that it is backward stable, i.e., the com-
3
numbers are ∼ n 2 for the number of opera- puted approximations {µ1 , µ2 , . . . µn } are ex-
tions and ∼ n log n for the fill-in size. The act eigenvalues of the matrix A which is just a
corresponding estimates and their relation to slight perturbation of the matrix A,
practical problems can be found in [34] and
4
[1]. Back substitution can be done in ∼ n 3 A = A + E, k E k≤ δ,
operations for the 3D problem and in n log n
operations for the 2D problem. where δ is small (proportional to the ma-
chine precision). That is the best we can
If we have to solve one system at a time
hope for. A question is, however, how close
then for large ǫ (small final precision) or very
are {µ1 , . . . , µn } to {λ1 , . . . , λn }. We define
large n, iterative methods may be preferable.
the (optimal) matching distance between the
Having more complicated mesh structure or
eigenvalues of A and A as
more right-hand sides, direct methods can be
usually preferable up to very large matrix di- md(A, A) = min{max |λπ(i) − µi |}
π i
mensions. Iterative methods are usually more
susceptible to instabilities (or slowing down where π is taken over all permutations of
the convergence) in finite precision arithmetics. {1, . . . , n}. Using a naive approach, one might
Moreover, notice that the additional effort due say: eigenvalues are continuous functions of
to the computation and use of the precondi- the matrix coefficients. Therefore we can ex-
tioner are not reflected in the asymptotic for- pect that for A close to A, the corresponding
mulas. For many large problems we need so- eigenvalues will be also close to each other and
phisticated preconditioners which increase sub- md(A, A) will be small.
stantially the computational effort described The last conclusion is, of course, wrong!
for the model problem. For a general matrix A, there is no bound on
The amount of memory needed for com- the md(A, A) linear in k E k, i.e., we can
putations makes also an important defference. not guarantee anything reasonable about the
This is usually much smaller for the iterative precision of the computed eigenvalue approx-
methods. On the other side, we can use space imations based on the size of the backward
of the size O(n) for the sparse Cholesky fac- error. Even worse, for any small δ and any
4. Sparse linear solvers: parallelism in attack or in defense?
large ω, one can find matrices A, A such that counterparts. One could add that for the
k E k=k A − A k≤ δ and md(A, A) ≥ ω. Any greater complexity and irregularity, sparse ma-
small perturbation of the matrix (any small trix computations are more realistic represen-
backward error) may in principle cause an ar- tatives of typical scientific computations, and
bitrary large perturbation of eigenvalues! In therefore more useful as benchmark criteria,
other words – even the best software gives you, than the dense matrix computations, that usu-
in general, no guarantee on the precision of the ally played this role.
computed results. This is certainly a striking Despite the difficulty with sparse matrix
statement. computations on the advanced computer ar-
Fortunately, there is an important class of chitectures, some noticeable success has been
matrices for which the situation is more opti- achieved in attaining very high performance
mistic. We recall the following theorem (for (see [9]) and the needs of sparse matrix com-
details see, e.g., [37]): putations have had notable effect on computer
Theorem 3.1 Let A be normal. Then design (indirect addressing with gather/scatter
md (A, A) ≤ (2n − 1) k E k. facilities). Nevertheless, it is ironic that sparse
For Hermitian matrices even stronger results matrix computations contain more inherent
can be proven. We see that for normal ma- parallelism than the corresponding dense ma-
trices the size of the backward error essentially trix computations, yet typically show signifi-
determine the precision of the computed eigen- cantly lower efficiency on today’s parallel ar-
value approximations. chitectures.
To summarize, for normal matrices any Roughly speaking, the most widely avail-
good (i.e. backward stable) method will give able and commercially successful parallel archi-
us what we want - good approximation to the tectures fall into three classes : shared-memory
true eigenvalues. For highly nonnormal ma- MIMD computers, distributed-memory MIMD
trices, however, the computed approximation architectures and SIMD computers. Some ma-
may be very far from the true eigenvalues even chines have an additional level of parallelism in
if the best software is used. the form of vector units within each individual
In this context please notice that many processor. We will concentrate on the general
times authors just plot the computed eigen- and widely applicable principles which can be
value approximations and declare it as the true used in wide variations of these computing en-
eigenvalues without paying any attention to vironments.
the normality (or other relevant properties) of In the sparse Cholesky decomposition we
their matrices. can analyze the following levels of parallelism
(see [28]):
4 Sparse linear solvers: parallelism • Large-grain parallelism in which each
in attack or in defense? computational task is the completion of
To show the difficulties related to parallel im- all columns in a subtree of the elimina-
plementations of linear solvers, we concentrate tion tree.
here on sparse direct solvers. Description of the
parallel implementation of the iterative meth- • Medium-grain parallelism in which each
ods is much more simple and can be found else- task correspond to one simple cycle of
where. column update cmod or column scaling
Dense matrix computations are of such ba- cdiv operation in the left- and right-
sic importance in scientific computing that looking pseudo-codes.
they are usually among the first algorithms
• Fine-grain parallelism in which each task
implemented in any new computing environ-
is a single floating-point operation or a
ment. Sparse matrix computations are equally
multiply-add pair.
as important, but both their performance and
their influence on computer system design have Fine-grain parallelism can be exploited in
tended to lag those of their dense matrix two distinctly different ways:
1. Vectorization of the inner cycles on vec- processors, the algorithm of sparse Cholesky
tor processors. decomposition is scalable. (by a scalable al-
gorithm we call an algorithm that maintains
2. Parallelizing the rank-one update in the efficiency bounded away from zero as the num-
right-looking pseudo-code. ber p of processors grows and the size of the
Vectorization of the operations is one of data structures grows roughly linearly in p,
the basic tools to improve effectiveness of see [19]). Up to date, however, even this
the sparse solvers using array processors, vec- method has not achieved high efficiency on a
tor supercomputers or RISC processors with highly parallel machine. With this note we left
some pipelining. Efficient vectorization was a the realm of the highly-parallel and massively-
very strong argument to promote band, pro- parallel SIMD machines aside and we will turn
file and frontal solvers when first vector pro- to the parallelism exploited in the most pop-
cessors appeared. As noted above, except for ular parallel implementations: large-grain and
special types of discretized partial differential medium-grain left-looking algorithms and mul-
equations, they are not very important now, tifrontal codes.
and other concepts are used for the Cholesky Let us turn now to the problem of medium-
decomposition of general matrices. This is grain algorithms. Of the possible formulations
caused by enormous work which was done of the sparse Cholesky algorithms, left-looking
in the research of direct sparse solvers, by algorithm is the simplest to implement. It
gather/scatter facilities in today’s computers is shown in [16], that the algorithm can be
for scientific computing and by the high-speed adapted in a straightforward manner to run ef-
scalar arithmetics in workstations. ficiently in parallel on shared-memory MIMD
Hardware gather/scatter facilities can usu- machines.
ally reach no more than 50% of the perfor- Each column j corresponds to a task
mance of the corresponding dense vector oper-
ations. No wonder, that the use of dense vec- T col(j) = {cmod(j, k)|k ∈ Struct(l∗j )}
tors and/or matrices in the inner cycles of the ∪{cdiv(j)}.
Cholesky decomposition is still preferable. The These tasks are given to a task queue in
above-mentioned multifrontal implementation the order given by some possible rearrange-
of the right-looking algorithm widely exploits ment of columns and rows of A, i.e., given
this idea. The structure of the elimination tree by some renumbering of the elimination tree.
enables to deal only with those elements which Processors obtain column tasks from this sim-
correspond to nonzeros in the factor L. ple “pool of tasks” in this order. The basic
To obtain better performance using vec- form of the algorithm has two significant draw-
tor functional units, we usually strive to have backs. First, the number of synchronization
dense vectors and matrices of the sufficiently operations is quite high. Second, since the al-
large dimensions (we are not going into the gorithm does not exploit supernodes, it will not
careful analysis of the situation which is usu- vectorize well on vector supernodes with mul-
ally quite more complex). Thus, forming su- tiple processors. The remedy is to use the su-
pernodes is usually highly desirable since it pernodes to decrease also the synchronization
helps to increase the dimension of the subma- costs.
trices processed in the inner cycle. Algorithms for distributed-memory ma-
The problem of parallelizing rank-one up- chines are usually characterized by the a pri-
date is a difficult one, and research on this ori distribution of the data to the processors.
topic is still in its infancy (see [19]). Note, In order to keep the cost of the interproces-
that the right-looking algorithm presents for sor communication at acceptable levels, it is
SIMD machines much better alternative that essential to use local data locally as much as
the column left-looking approach. When rows possible. The distributed fan-out (see [17]),
and columns of the sparse matrix A are dis- fan-in (see [6]), fan-both (see [4]) and multi-
tributed to the rows and columns of a grid of frontal algorithm (see overview [23]) are typi-
cal examples of the implementations. All these this algorithm are usually much lower than by
algorithms are designed in the following frame- the historically older fan-out algorithm.
work: The fan-both algorithm was described as
an intermediate parametrized algorithm parti-
• They require assignment of the matrix tioning both the subtasks T col(j) and T sub(j).
columns to the processors. Processors are sending both the aggregate col-
umn updates and completed columns.
• They use the column assignment to dis- The distributed multifrontal algorithm par-
tribute the medium-grained tasks in the titions among the processors the tasks upon
outer loop of left- or right-looking sparse which the sequential multifrontal method is
Cholesky factorization. based:
The differences among these algorithms 1. Partial dense right-looking Cholesky fac-
stem from the various formulations of the torization for the vertices of independent
sparse-Cholesky algorithm. subtrees of the elimination tree.
The fan-out algorithm is based on the
right-looking Cholesky algorithm. We will de- 2. Medium-grain or large-grain subtasks of
note the k−th task performed by the outer loop the partial Cholesky factorizations of the
of the algorithm (lines (3)–(10) of the right- dense matrices.
looking pseudo-code) by T sub(k), which is de-
The first source of the parallelism is prob-
fined by
ably the most natural. Its theoretical justifi-
T sub(k) = {cdiv(k)} cation is given by the Lemma 1.2. Indepen-
∪{cmod(j, k)|j ∈ Struct(l∗k )}. dent branches of the elimination tree can be
That is, T sub(k) first forms l∗k by scal- eliminated independently. Towards the root,
ing of the k − th column, and then perform number of independent tasks corresponding to
all column modifications that use the newly these branches decreases. Then tasks cor-
formed column. The fan-out algorithm par- responding to partial updates of the right-
titions each task T sub(k) among the proces- looking pseudo-code near to the root can be
sors. It is a data-driven algorithm, where the partitioned among the processors. Combining
data sent from one processor to another repre- these two principles we obtain the core of the
sent the completed factor columns. The outer distributed-memory (but also of the shared-
loop of the algorithm for a given processor reg- memory) multifrontal method. All the parallel
ularly checks the message queue for the incom- right-looking implementations of the Cholesky
ing columns. Received columns are used to factorization are essentially based on the same
modify every column j owned by the processor principles.
for which cmod(j, k) is required. When some Large-grain parallelism present in the con-
column j is completed, it is sent immediately temporary implementations is usually very
to all the processors, by which it is eventually transparent. We do not care very much about
used to modify subsequent columns of the ma- the mapping relation column–processor. The
trix. actual architecture provides hints how to solve
The fan-in algorithm is based on the left- this problem. Using for instance hypercube
looking Cholesky pseudo-code. It partitions architectures, the subtree-subcube mapping is
each task T col(j) among the processors in a very natural and we can speak about this type
manner similar to the distribution of tasks of parallelism.
T sub(k) in the fan-out algorithm. It is a Shared-memory vector multiprocessors
demand-driven algorithm where the data re- with a limited number of processing units rep-
quired from a processor pa to complete the resent a similar case. The natural way is to
j − th column on a given processor pb are gath- map rather large subtrees of the elimination
ered in the form of results cmod(j, k) and sent tree to the processor. The drawback in this sit-
together. Communication costs incurred by uation can be the memory management, since
5. Concluding remarks
we need more working space than if purely Naive hopes that with more processors we
scalar computations are considered. could avoid in some extent the difficulties faced
So far we have concentrated on the issues in scalar Cholesky decomposition came to an
related to the numerical factorization. We left disappointment. We are still trying to find bet-
out the issues of the symbolic decomposition, ter algorithmic alternatives which make both
initial ordering and making other symbolic ma- the scalar and parallel computations more ef-
nipulations in parallel. Although in case of fective.
scalar computations the numerical phase tim-
ing is usually dominant, in the parallel environ-
ment it is not always so. The following exam-
5 Concluding remarks
ple shows the proportion of the time spent in The world of numerical linear algebra is de-
the symbolic and numeric phases of the sparse veloping very fast. We have tried to show
Cholesky factorization. It is taken from [31]. that many problems which are considered by
Example 4.1 Comparison of ordering and the numerical analyst working in this area are
numeric factorization time for the matrix com- very complicated and still unresolved despite
ing from structural mechanics. Dimension n = the fact, that most of it has been formulated a
217918, number of nonzeros in A is 5.926.567, few decades or even one or two hundred years
number of nonzeros in L is 55.187.841. Or- ago.
dering time was 38s, right-looking factorization Among the current trends in this field we
time was 200s. want to point out the strong movement to-
Optimizing parallel performance of the wards the justification of the numerical algo-
symbolic manipulations (including matrix or- rithms. A method or algorithm programmed
dering, computation of row and column struc- to a code should not only give some output,
tures, tree manipulations, ...) is an important but it should guarantee an exactly defined re-
challenge. For the overview of the classical lation of the computed approximation to the
techniques see [23]. unknown true solution or warn the user about
Number of parallel steps to compute the the possible incorrectness of the result. Our
Cholesky decomposition is determined by the intuition must be checked carefully by formal
height of the elimination tree. But, while in proofs or at least by developing theories offer-
the one processor case we preferred the unbal- ing a deep insight into a problem. Attempts to
anced form of the elimination tree (see Figure solve problems without such insight may fail
1.7), the situation is different now. Unbalanced completely. Another trend can be character-
tree induces more parallel steps. Therefore, for ized by the proclamation: There is no simple
the parallel elimination, the balanced alterna- general solution to the difficult tasks as, e.g.,
tive seems to be better. On the other side, the solving large sparse linear systems. Consider-
cumulative size of working space is in this case ing parallel computers, things are even getting
higher. worse. A combination of different techniques
Also the problem of the renumbering of the is always necessary, most of which use very ab-
elimination tree is casted into another light in stract tools (as the graph theory etc. ) to
the parallel case. The level-by-level renum- achieve very practical goals. There is where
bering and balanced children sequences of the the way from theory to practice is very short.
elimination tree are the objects of further re-
search.
References
Even in the “scalar” case, all the implemen-
tations are strongly connected with computer [1] Agrawal, A. - Klein, P. - Ravi, R.: Cut-
architectures. It is only natural that, in the ting down on fill using nested dissection:
parallel environment, where some features of provably good elimination orderings, in:
the computing facilities (e.g. , communication) George, A. - Gilbert, J.R. - Liu, J.W.H.,
provide even more variations, this dependence eds. : Graph Theory and Sparse Matrix
is also more profound. Computation, Springer, 1993, 31–55.
[2] Aho, A.V. - Hopcroft, J.E. - Ullman, J.D.: [13] Eisenstat, S.C. - Schultz, M.H. - Sherman,
Data Structures and Algorithms, Addison A.H.: Software for sparse Gaussian elimi-
- Wesley, Reading, MA, 1983. nation with limited core memory, in: Duff,
I.S. - Stewart, G.W. eds., Sparse Matrix
[3] Ashcraft, C.: A vector implementation of Proceedings, SIAM, Philadelphia, 1979,
the multifrontal method for large sparse, 135–153.
symmetric positive definite linear systems,
Tech. Report ETA-TR-51, Engineering
[14] Freund, R. - Golub, G. - Nachtigal, N:
Technology Application Division, Boeing
Iterative solution of linear systems. Act.
Computer Services, Seattle, Washington, Numerica 1, 1992, pp 1-44.
1987.
[4] Ashcraft, C.: The fan-both family of [15] Gallivan, K.A. - Plemmons, R.J. - Sameh,
column-based distributed Cholesky fac- A.H.: Parallel algorithms for dense linear
torization algorithms, in: George, A., algebra computations, SIAM Review, 32
Gilbert, J.R., Liu, J.W.H., eds. : Graph (1990), 54-135.
Theory and Sparse Matrix Computation,
Springer, 1993, 159–190. [16] George, A. - Heath, M. - Liu, J.W.H.:
Solution of sparse positive definite sys-
[5] Ashcraft, C. - Grimes, R.: The influence tems on a shared-memory multiproces-
of relaxed supernode partitions on the sor, Internat. J. Parallel Programming, 15
multifrontal method, ACM Trans. Math. (1986), 309–325.
Software, 15 (1989), 291–309.
[6] Ashcraft, C. - Eisenstat, S. - Liu, J.W.H.: [17] George, A. - Heath, M. - Liu, J.W.H.:
A fan-in algorithm for distributed sparse Sparse Cholesky factorization on a local-
numerical factorization, SIAM J. Sci. memory multiprocessor, SIAM J. Sci.
Stat. Comput., 11 (1990), 593–599. Stat. Comput., 9 (1988), 327–340.
[7] Axelsson, O.: Iterative Solution Methods. [18] George, A. - Liu, J.W.H.: Computer So-
Cambridge University Press, Cambridge, lution of Large Sparse Positive Definite
1994. Systems, Prentice-Hall, Englewood Cliffs,
N.J., 1981.
[8] Barrett, R. - et al.: Templates for the so-
lution of linear systems: Building blocks
[19] Gilbert, J. - Schreiber, R.: Highly parallel
for iterative methods. SIAM, Philadelphia,
sparse Cholesky factorization, Tech . Re-
1994
port CSL-90-7, Xerox Palo Alto Research
[9] Browne, J. - Dongarra, J. - Karp, A. - Center, 1990.
Kennedy, K. - Kuck, D.: 1988 Gordon Bell
Prize, IEEE Software, 6 (1989), 78–85. [20] Greenbaum, A. - Strakoš, Z.: Matrices
that generate the same Krylov residual
[10] Chandra, R.: Conjugate gradient meth- spaces. In: Recent Advances in Iterative
ods for partial differential equations, PhD. Methods, G. Golub et.al. eds., Springer-
Thesis, Yale University, 1978. Verlag, New York, 1994.
[11] Drkošová, J - Greenbaum, A. - Rozložnı́k,
M. - Strakoš, Z.: Numerical stability of the [21] Greenbaum, A. - Pták, V. - Strakoš, Z.:
GMRES method. Submitted to BIT, 1994. Any nonincreasing convergence curve is
possible for GMRES (in preparation)
[12] Duff, I.S. - Reid, J.: The multifrontal solu-
tion of indefinite sparse symmetric linear [22] Hageman, L. - Young, D.: Applied itera-
equations, ACM Trans. Math. Software, 9 tive methods. Academic Press, New York,
(1983), 302–325. 1981
[23] Heath, M.Y. - Ng, E. - Peyton, B.W.: Par- ization, presented at the IBM Europe In-
allel algorithms for sparse linear systems, stitute, 1990.
in: Gallivan, K.A. - Heath, M.T. - Ng,
E. - Ortega, J.M. - Peyton, B.W. - Plem- [32] Ng, E.G. - Peyton, B.W.: Block sparse
mons, R.J. - Romine, C.H. - Sameh, A.H. - Cholesky algorithms on advanced unipro-
Voigt, R.G.: Parallel Algorithms for Ma- cessor computers, SIAM J. Sci. Comput.,
trix Computations, SIAM, Philadelphia, 14 (1993), 1034–1056.
1990, 83–124.
[33] Parter, S.: The use of linear graphs in
[24] Higham, N.J. - Knight, P.A.: Component- Gaussian elimination, SIAM Review, 3
wise error analysis for stationary iterative (1961), 364–369.
methods, in preparation.
[34] Pissanetzky, S.: Sparse Matrix Technol-
[25] Liu, J.W.H.: The role of elimination trees ogy, Academic Press, 1984.
in sparse factorization, SIAM J. Matrix
Anal. Appl. 11 (1990), 134-172. [35] Rothberg, E. - Gupta, A.: Efficient sparse
matrix factorization on high-performance
[26] Lewis, J. - Simon, H.: The impact of hard- workstations - exploiting the memory hi-
ware gather/scatter on sparse Gaussian erarchy, ACM Trans. Math. Software, 17
elimination, SIAM J. Sci. Stat. Comput., (1991), 313–334.
9 (1988), 304–311.
[36] Rozložnı́k, M. - Strakoš, Z.: Variants
[27] Liu, J.W.H.: A compact row storage
of the residual minimizing Krylov space
scheme for Cholesky factors using elimi-
methods. Submitted to BIT, 1994
nation trees, ACM Trans. Math. Software,
12 (1986), 127–148. [37] Stewart, G.W. - Sun, J.: Matrix perturba-
[28] Liu, J.W.H.: Computational models tion theory. Academic Press, Boston, 1990
and task scheduling for parallel sparse
[38] Trefethen, N.: The definition of numeri-
Cholesky factorization, Parallel Comput-
cal analysis. Dept. of comp. Sc., Cornell
ing, 3 (1986), 327–342.
Univ., 1991
[29] Liu, J.W.H.: Equivalent sparse matrix
reordering by elimination tree rotations, [39] Varga, R.: Matrix iterative analysis.
SIAM J. Sci. Stat. Comput., 9 (1988), Prentice-Hall Inc., NJ, 1961
424–444.
[40] Van der Vorst, H.: Lecture notes on iter-
[30] Liu, J.W.H.: Advances in direct sparse ative methods, 1992, manuscript.
methods, manuscript, 1991.
[41] Young, D.: Iterative solutions of large lin-
[31] Liu, J.W.H.: Some practical aspects of ear systems. Academic Press, New York,
elimination trees in sparse matrix factor- 1971

Current Trends in Numerical Linear Algebra

Uploaded by

Copyright:

Available Formats

Current Trends in Numerical Linear Algebra

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Current Trends in Numerical Linear Algebra

Uploaded by

Copyright:

Available Formats

SOFSEM ’94 – Milovy, The Czech Republic, 27. 11. – 9. 12.

Current Trends in Numerical Linear Algebra: From

Institute of Computer Science, Academy of Sciences of the Czech Republic

1 Motivation modelling and extensive use of the numerical

dering phase): Basic approach how to accomplish that is

Figure 1.6: Rearranging of the children sequences in the elimination tree.

Figure 1.7: Balanced and unbalanced elimination trees.

where pk ∈ Pk , Pk denotes the set of polyno- k r (k) k= min k b − Au k .

number of A defined as κ = ||A||||A−1 ||.

You might also like