See discussions, stats, and author profiles for this publication at:
https://www.researchgate.net/publication/220275814
The Parallel QR Factorization
Algorithm for Tridiagonal Linear
Systems.
Article in Parallel Computing · July 1995
DOI: 10.1016/0167-8191(95)00011-C · Source: DBLP
CITATIONS
READS
10
22
2 authors:
Pierluigi Amodio
Luigi Brugnano
76 PUBLICATIONS 658 CITATIONS
118 PUBLICATIONS 1,399
Università degli Studi di Bari …
SEE PROFILE
University of Florence
CITATIONS
SEE PROFILE
All content following this page was uploaded by Luigi Brugnano on 29 December 2014.
The user has requested enhancement of the downloaded file.
PARALLEL
COMPUTING
Parallel Computing 210995) 1097-1110
ELSEVIER
The parallel QR factorization algorithm for tridiagonal
linear systems
Pierluigi
Amodio
a, * , Luigi Brugnano
b
aDipartimento di Matematica, Universitri di Bari, via Orabona 4, 70125 Bari, Italy
b Dipartimento di Energetica, Universitcidi Firenze, fia Lombroso 6/ I7, 50134 Firenze, Italy
Received 11 February 1994; revised 3 December 1994 zyxwvutsrqponmlkjihgfedcbaZY
Abstract
We describe a new parallel solver in the class of partition methods for general,
nonsingular tridiagonal linear systems. Starting from an already known partitioning of the
coefficient matrix among the parallel processors, we define a factorization, based on the QR
factorization, which depends on the conditioning of the sub-blocks in each processor.
Moreover, also the reduced system, whose solution is the only scalar section of the
algorithm, has a dimension which depends both on the conditioning of these sub-blocks, and
on the number of processors. We analyze the stability properties of the obtained parallel
factorization, and report some numerical tests carried out on a net of transputers.
Keywords: Tridiagonal
linear systems; Parallel solvers; QR factorization;
Transputer
array
1. Introduction
Much interest has been devoted to the problem of solving tridiagonal linear
systems on parallel computers in recent years, and many parallel tridiagonal
solvers have been devised. These solvers can be grouped into two main categories:
direct solvers and iterative solvers. Among the direct methods, we can mention the
partition methods, which include very efficient parallel algorithms [1,2,4,5,10-151.
Among the iterative methods we mention the Jacobi method and the AGE method zyxwvutsrq
[91.
Concerning partition methods, a unifying approach for their derivation and
study is given in [1,2], where it is also shown that these methods are stable when
* Corresponding
author. Email: 00110570@vm.csata.it
0167-8191/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved
SSDZ 0167-8191(95)00011-9
P. Amodio, L. Brugnuno /Paralkl Computing 21 (1995) 1097-1110
1098
used to solve diagonally dominant linear systems. Diagonal dominance also implies
the convergence of the above mentioned iterative solvers.
Nevertheless, to our knowledge, no parallel algorithms exist for solving general
nonsingular tridiagonal linear systems. In this paper, we develop a new parallel
solver for this purpose. The derivation of the method is outlined in Sections 2 and
3, while its stability properties are discussed in Section 4. In Section 5 the parallel
algorithm is examined and a simple model for the expected speedup is reported
together with some numerical tests carried out on a linear array of transputers.
2. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Parallel factorizations
The derivation of the new method extends the concept of ‘parallel factorization’
given in [l]. Let us consider the tridiagonal matrix zyxwvutsrqponmlkjihgfedcbaZYXWVU
a,
Cl zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDC
b, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
a2
*.
A=
(1)
7
cn-I
Cl
a”
where for simplicity we assume that n = kp - 1 for some integer k, if p is the
number of the parallel processors we want to use. To obtain a parallel solver
tailored for this number of processors, we consider the matrix A partitioned as
follows:
\
A(1)
c&l?lek-l
a(kl)
c&2)ef
$1 lel_
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCB
1
ba)e
A=
1
A(2)
bi2?leE_
c~2ilek-1
1
af)
a(,P-‘)
bp)e,
@)ei
*
A(P)
J
(2)
where e, and ek_r are, respectively, the first and the last unit vectors of I@‘-i, the
blocks A(‘) are tridiagonal of size k - 1 and XI’)= ~~~_.i)~+~for x = a, b, c. Observe that on a distributed memory parallel computer, the upper index identifies
the processor where the given quantity is to be stored. Let us then consider the
following factorization of the matrix (1):
A=FTG,
(3)
P. Amodio, L. Brugnano /Parallel
Computing 21 (1995) lO !V- 1110
1099
where, by denoting with Zk_i the identity matrix of order k - 1,
v(l)
0
\
,(I)’
1
0
0
W(2)= 0
NC*) 0
0(*)= I zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJI
43F
0
‘1
1
F=
0
N3)
0
0
d3f
1
9
\
‘1k-l
0=
1
w(Pf
0
Ncp),
\
0
&)
OT
0
pm
&-I
OT
g*)
0
($2) OT
#3)
0 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
T= zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
‘k-1
o
I
p(3)
()T
&3)
a(P-l)
0
\
7 1)
)(
Ik-I,
p
\
o=
1
p)
0
G=
OT
OT
92’
OT
0
p)
1
$3)
0
OT
93’
OT
0
#3’
9
1 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSR
1
Z(P)
OT
S(P)
N(‘)S(‘) is a suitable factorization of the block A(‘), and the other unknowns of F,
G and T may be easily obtained from (3) by direct identification. By considering
different factorizations for A @),it is then possible to derive most of the parallel
methods in the class of the partition methods. For example in [1,2], supposing A”’
diagonally dominant, the LU factorization, the LUD factorization, and the cyclic
reduction were considered.
In the solution of the linear system zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONML
h=f,
(4)
P. Amoa’ia,L. Brugnano/Parallel Computing21 (1995) 1097-1110
1100
the two block tridiagonal systems with the matrices F and G are solved without
any synchronization or data communication among the processors. Therefore, this
section is completely parallel. The only sequential part of the algorithm consists in
the solution of the tridiagonal linear system with the ‘reduced matrix’: zyxwvutsrqponmlkjih
&)
p
p(2)
am
* . zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFE
Tp =
(5)
y(P-‘)
p(p-l)
a(P-l)
It must be noted that the size of Tp is independent of n, but depends only on the
number p of parallel processors.
In [1,2] it is suggested to study the factorization (3) locally by introducing the
(n +p - 1) x n matrix R, recursively defined as follows:
\
%-I
R,
= Zk_l,
Ri
1
1
=
\
,
i=2 ,*.*, P*
Ri-ll
From A = RFMRp, we obtain the matrix M which is p xp block diagonal with
tridiagonal main-diagonal blocks MC’) defined (with some slight differences for the
first and the last one (see (2))) as follows:
I 0
f$eF
(6)
where the matrices NC’) and S@) the vectors &), w@),y@) z w and the scalars PC’),
y(‘) are the same as in (3), and &) + (~(li+i)= &, for i =‘l,. . . , p - 1.
3. The parallel QR factorization
The approach in Section 2 is very general and useful, and very efficient
tridiagonal solvers have been obtained [l]. Nevertheless, it is evident that all the
blocks A(‘) need to be nonsingular for (3) to be defined. This is the case, for
example, when the matrix A is diagonally dominant. In this case, also the reduced
matrix turns out to be diagonally dominant [1,2].
1101
P. Amodio, L. Brugnarw /Parallel Computing 21 (1995) 1097-1110 zyxwvutsrqponmlkjihgfed
For zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
A generally nonsingular, it may happen that some of the blocks A(‘) are
singular or very ill-conditioned. It follows that the parallel solver may be not
defined or unstable even if stable scalar solvers exist (for example, the LU
factorization method with partial pivoting or the QR factorization method). For
this reason, in order to obtain a stable parallel solver, we consider a generalization
of the above approach which consists in a finer structuring of the factorization
itself.
Suppose that the (k - 1) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDC
X (k - 1) block A(‘) is singular. Then we proceed as
follows: let AT) be the largest nonsingular principal submatrix of A(‘), and let j - 1
be its size. We define the following factorization of the block MC’):
M(i)
= LOD(OuG)
1
1
(7)
7
where
11
0
L’;:‘=o
w(OT
0
Aq”
0
@=
1
OT
0,
0
I&j- l
0
oT
(1
$’
(J;“=
0
OT
\
0
sl”
g
1
OT
0,
0
0
Ik- j- 1
0
1
0
OT
o=
1 zyxwvutsrqpon
\
7
-1
/
and gi is the ith unit vector of size k - j - 1. This is equivalent to introducing the
variable x!‘) = x ._
in the reduced system. The same procedure is then
recursivelyJappligd l:ok%e tridiagonal block A’$’ of Dy ), and so on.
Therefore it follows that the factorization is always defined, eventually by
increasing the size of the reduced system,, that is, the scalar section of the
corresponding parallel solver. We observe that some of the blocks A$) may have
zero size; this means that two consecutive components in the solution vector
belong to the reduced system. Anyway, if the matrix A is nonsingular, then the
block A(‘), which has size k - 1, has rank at most equal to k - 3. It follows that in
exact arithmetic the size of the reduced system is still independent of the size n of
the linear system.
Since the blocks A? are only nonsingular, it is convenient to factorize them by
using a suitable stable factorization. Two obvious candidates for this purpose are
the LU factorization with partial pivoting and the QR factorization. We shall
analyze in more detail the parallel solver obtained by using the second factorization, which we shall call parallel QR factorization method, but a similar approach
can be repeated for the former. For simplicity, let us suppose that the generic
1102
P. Anwdio, L. Brugnano/ParaUel Computing 21 (1995) 1097-1110
block A(‘) is nonsingular,
factorize MC’) as:
since the generalization
is straightforward.
Then, we
(8)
where Q(%@) is the QR factorization
@)rw(O = c$=., ,
QWz(O
“ $0
p(i)
=
=
=
_
_
bpel,
R(iFvv(O
QWy(O
w(OTz(O
3
v(OTz(O,
&$I
=
=
=
b(i’
k
$l_
&I
_
=
e
1
k
lek_
u(OTyO)
k
yO)
of the matrix A(‘), and
_
1,
1,
(9)
,
_wU)Ty(Oe
We observe that, since QCi) is upper Hessemberg and R(‘) is upper triangular,
then only the vectors wCi)and zti) are full vectors (filZ-in vectors), while only the
last entry of yCi) and the last two entries of y(‘) may be nonzero.
To increase the stability of the factorization, we may also choose Ac;‘) as the
largest principal submatrix of A G) which has a condition number smaller than a
given tolerance. This choice is possible since we can monitor a very effective
approximation of the condition number of the current block. In fact, the matrices
Q(i)
and
R(i),
as well as the vector rvCi)can be formed step by step. In particular, at
the generic step j we have calculated Q,‘o and RI’) which are the principal
submatrices of size j of Q(‘) and R(‘), and the subvector WY)which contains the
first j components of the vector w(‘) . Since Q!i) is orthogonal, then the condition
number of Af) is the same as the condition mrmber of RI’). From (9) it follows
that q!‘) = (c$))-’ 11w!‘) 111 = ll(R!“)-’ II since T#) is the norm of the first row of
(RY))“. Ther e fore, we may obtain an e:timate of the condition number of A’!) at
each step. As soon as this estimate is larger than a given tolerance, we increase the
size of the reduced system, as previously seen, by introducing in it the variable ~7).
3.1. The parallel LU factorization with partial pivoting
To complete this discussion, we now sketch briefly the parallel LU factorization
with partial pivoting, which is obtained by considering the following factorization of
A-f(‘)(again, we shall suppose that A(‘) is nonsingular, for simplicity):
(10)
Here we factorize A(‘) as P(‘)A(‘) = L(‘)U(‘) in this case, one has
K(
A”‘)
<
K(
L”‘)K(
u”‘).
1103
P. Amodio, L. Brugnano /Parallel Computing 21 (1995) 1097-1110
To estimate the quantities on the right hand side one has, as for the parallel QR
factorization,
while for II(L(‘))-’ IIone of the following estimates can be used (k - 1 is the size of
L(O):
12(k-l), zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
II(L(i))-lIll=(b~))-lllz(i)lll.
II(L’yIJ
The actual implementation
elsewhere.
of this algorithm will be examined
in more detail
4. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Stability analysis
Let us now examine the stability of the parallel QR factorization. By fixing an
upper bound for the conditioning of the sub-blocks A:), the conditioning of the
whole system will depend on the reduced system. We shall therefore discuss how
perturbations on the matrix A affect the reduced matrix.
For this purpose, let us suppose to have fiied a suitable tolerance u for the
condition number KU:)) of A$?. For a given nonsingular tridiagonal matrix A, a
corresponding partition is then obtained. This partition can be represented in
compact form by introducing the following block odd-even permutation matrix of
order IZ: zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
0
0
...
0
0
0
...
0
...
o=
o=
1
0
0
or
or
o=
I...
...
...
‘I
P=
or
0
z
...
0
0
...
1
0
0
z
...
0
0
...
o=
o=
...
1
o=
...
which first takes the rows corresponding to the blocks A:) and then those
associated to the reduced system. Here we have denoted by Z a generic identity
matrix whose size is equal to the size of the corresponding block A? in the
partition. By means of zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFE
P we obtain zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONM
(11)
where A, and A, are block diagonal, A, having A:) as a generic diagonal block,
and A, of size z-p - 1, while B and C are block bidiagonal. The parallel QR
factorization is equivalent to factoring the matrix (11) as
(12)
1104
P. Anwdio, L. Brugnano /Parallel Computing 21 (1995) 1097-1110
where Q,R, is the QR factorization
of the matrix A, and
T,=A,-BA;lC
(13)
is the reduced matrix. Now we examine the propagation of small perturbations on
the matrix A in the factorization (12). In the following, for every given matrix X
the symbol 6X will denote its corresponding perturbation (in particular 6X-i and
6Xr will represent perturbations respectively on X-’ and XT). Then, one has:
C+iW
A,+6A,
P(A+GA)P==
(B+tiB)(R;'+SR,')Z
It is
R,+aR, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
(Q,+SQ/(C+aC)
Tp+c3Tp
'
I
i
obvious that lI6A, II,
II
6A,II,
II 6B II, II 6C II s II 6A II. Moreover, since both Q,
and Q, + SQl are orthogonal, it follows that IISQl IIis also bounded. Then, in the
above expression we need only to provide an explicit bound for II 6RlII, II 6R;'II
and 11ST,I(.
By using the 2-norm and a first order analysis, from the relations
QA =A,
(Q, + SQI)(RI + 64) =A, + 4,
(R;'+GR,')(Q;+6Q;)=A;'+c?A;',
one easily derives that
II4
II
II6A, II
3i%pllAlll+
IISQ1 II
IIQ,ll ’
and
II 6R;’ II II SA,’ II + II SQ1 II
IIR;‘II ’ llA;lll
II Q, II ’
Similarly, from the perturbation
of (13), it can be derived that
II 6T, II I II 6A, II + II B II II A;l II ll C ll
IISA,‘II
II6Bll
ll6CII
Il + jjq
+m
Here we have supposed that the relative perturbations on the matrix A are smaller
than E. In the above expressions, the two quantities K(A~) and II6A;'II II A;'II
-'
can be noted. Since we use a tolerance u for K(A(:')),
we have:
K(A,)=~~{K(
A?)}<a.
1105
P. Anwdio, L. Brugnano / Parallel Computing 21 (1995) 1097-1110
Concerning 1]&4;’ ]I I]A;’ I]-I, if we suppose that IIA;’ I] /I&4, II -C 1, then from
the relation
(A, +
SA,)_’ =A;l f aA;‘,
and by using standard results of linear algebra, one obtains: zyxwvutsrqponmlkjihgfedcbaZYX
II6.4,’ II zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
II6A, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFE
II
II 6A, II zyxwvutsrqponmlkjihgfedcbaZY
4%)
IIA;‘II
’ l-
IIA;lll
I16Alll(IA,=K(A1)
((AllI ’
Therefore, also this term is proportional to zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPO
K(A~).
We conclude that by fixing a
suitable upper bound u for K(A~) (which may depend on the size 12 of the
problem, on the number p of the processors used and on the machine precision U>
the factorization (12) is quite stable, at least when K(A) itself is not exaggerately
large. Indeed, the algorithm is essentially devoted to matrices which are not
diagonally dominant; but nevertheless well-conditioned or weakly well-conditioned
(see [6]). In fact, the case where the condition number of the matrix A grows
exponentially with its size is of no practical interest, since the problem becomes
too much ill-conditioned for relatively small values of n.
5. The parallel QR factorization algorithm
Let us now examine a pseudo-code which implements the parallel QR factorization algorithm; this will allow us to easily derive a simplified model for the
expected speedup with respect to the scalar QR factorization algorithm. The QR
factorization is obtained by using Givens’ rotation matrices. The data distribution
is carried out according to (21, where the upper index denotes the processor
number. Moreover, in order to obtain a simpler code, we add two null equations at
the beginning and at the end of the linear system which we call rows 0 and n + I,
respectively. The pseudo-code, written by using a Matlab-like language, follows.
procedure
%
pqr(k,
% k=size
% a,
%
b,
of
c,
the
a,
b,
local
c,
block
x=coefficients
assigned
3:
the
% tol=tolerance
to
the
of
tol,
on
of
computed
for
% proc=index
% nproc=number
x,
the
the
processor;
solution
the
norm
the
current
of the
parallel
proc,
of
nproc)
processor
problem
and
in output
the
inverse
of the
rhs
x contains
of
the
blocks
processor
processors
i=O
nr=l
break(l)
=O
while
(i <k-l)
call
qrl(k-i,
a(i),
b(i),
c(i),
x(i),
z(i),
tol,
j>
P. Amodio, L. Brugnuno /Parallel Computing 21 (1995) 1097-1110
1106
i=itj
nr=nr+l
breakcnr)
= i
end
if
(i ==k-1)
nr=nr+l
breakcnr)
=k
end
%the
indexes
X are
in
%in
the
% vector
of
the
the
variables
vector
vector
break(2:nr)
break(l:nr-1)
X indexes
are
% reduced
3:
system
(solution
of
to
the
processor
the
system
1,
nproc.
and
a,
b,
c,
in
the
These
coefficients
vectors
reduced
reduced
l<proc<nproc,
processor
recover
from
the
if
on
on
used
the
in
break(l:nr)
of
the
x.
system>
X
% update
for
of
the
rhs
and
back-substitution
i=l:nr-1
jO=break(i)
j=break(i+l)-1
if
Cj>jO)
x(j)=(x(j)-z(j)*x(jO)-co*x(j+l))/acj>
for
j=break(i+l)-2:-l:breako+l
x(j)=(x(j)-z(j)*x(jO)-c(j)*x(j+l)-b(j)*x(j+2))/a(j)
end
end
end
end
procedure
procedure
%
qrl(k,
X Performs
% the
a,
the
c,
x,
QR factorization
coefficients
X rhs x.
X diagonal
b,
a,
b,
The vector
b is
of the
matrix
c,
a(j)
last
ans
tot,
of
updating
the
j)
the
the
overwritten
R; z is the
X The informations
concerning
X stored
as follows:
c (0) =y,
X a(O)=a,,
3: b(O) =p,
X while
the
X in b(j-2)
z,
block
defined
corresponding
with
fill-in
reduced
the
second
vector.
system
by
upper
are
=(Y*,
two
c(j-I),
components
of
respectively.
the
vector
y
are
stored
P. Amodio, L. Brugnano /Parallel Computing 21 (1995) 1097-1110
1107
% zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
toosmall=<tolerance
for
w=-1
X
the
fill-in
wl=O
%
the
needed
zero
entries>
vector
u
is
informations
not
actually
are
stored
formed;
in
w,
WI,
w2
z(l)=b(O)
b(O)=0
norm=0
toll=tol*c(0)
for
i=l:k-2
%
computes
%
for
call
the
the
coefficients
vector
(a(i),
givens
(a(i),
of
the
Givens
matrix
b(i)IT
b(i),
r,
s)
tmp=r*a(i+l)-s*c(i)
if
(abs(tmp)<toosmall
(abs(b(i+l))<toosmall~i==k-2)
&
...
1 norm>toll)
break
end
a(i)=r*a(i)+s*b(i)
c(i)=r*c(i)+s*a(itl)
a(i+l)=tmp
b(i)=s*c(i+l)
c(i+l)=r*c(i+l)
z(i+l)=-s*z(i)
z(i)=r*z(i)
tmp=r*x(i+l)-s*x(i)
x(i)=r*x(i)+s*x(i+l)
x(i+l)=tmp
w2=wl
wl=w
w=-(w2*b(i-2)+wl*c(i-l))/aO
norm=norm+abs(w)
a(O)=a(O)-w*z(i)
x(0)=x(O)-w*x(i)
end
zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
j=itl
w2=wl
wl=w
w=-(w2*b(i-2)+wl*c(i-l))/ao
a(O)=a(O)-w*z(i)
x(0)=x(O)-w*x(i)
c(O)=-(wl*b(i-l)+w*c(i))
tmp=-b(i)/a(i)
b(O)=tmp*z(i)
a(j)=a(j)+tmp*c(i)
P. Amodio, L. Brugnarw/ParaUel Computing 21 (1!995)1097-1110 zyxwvutsrqponmlkjihgf
1108
x(j)=x(j)+tmp*x(i)
end
procedure
As one can see, the required storage on each processor essentially amounts to
five vectors for the coefficients (a, b and c), the right hand side x, and the fill-in
vector 2.
Concerning the operation count, it is well known that a call to the procedure
givens has a cost of 5 flops plus 1 square root (we count as 1 flop one of the four
elementary diadic operations). From the above pseudo-code, it is simple to derive
that, apart from the solution of the reduced system, the parallel solver requires
= 40n flops plus n square roots that can be performed in parallel on p processors.
The cost of the solution of the reduced system can not be exactly predicted since it
depends on the size of the system itself. However, since we observed that the
blocks A(‘) may have a null space of dimension at most two if the matrix A is
nonsingular, one may expect that the size of the reduced system grows linearly with
the number p of the processors, thus requiring O(p) flops for its solution. This is
true, at least when zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
K (A)
itself is not prohibitively large. In fact, if for example
K (A)
is O(n), then the condition number of the largest principal nonsingular
matrix of each block A(‘) should behave approximately as O(n/p) and the size of
the reduced system can be expected to be at most 3p.
As a further remark, in [3] it is suggested an approach for solving the reduced
system that, for the matrix (5), requires only O(log, p> scalar operations and
log, p - 1 data transmissions. By applying the same approach (but by using the
QR factorization) to the reduced system deriving from the parallel QR factorization described in Section 3, it results that the number of transmissions is unchanged even if the total number of data transmitted may be larger. Therefore, if
we suppose n %p, then the actual scalar cost of this part of the algorithm is
negligible. This approach has been used in the parallel code for the numerical
tests.
The scalar Givens method requires = 27n flops plus n square roots. It follows
that the expected asymptotic speedup on p processors is given by
L?~= lims,(n)
n-m
(27++
= lim
“‘*(40+?I);+o(p)
=-
27 + q
40+qP’
(14)
where r) is the ratio between the time to execute 1 square root and the time to
execute 1 flop, and O(p) is the scalar section of the code. Finally, it must be said
that (14) is only an upper bound for the effective asymptotic speedup, since the
operations in the scalar Givens algorithm are better chained and, hence, better
optimized by any ‘smart’ compiler.
6. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA
Numerical tests
Let us now report the measured speedup for the parallel QR factorization
algorithm obtained on a linear array of transputers with distributed memory. For
P. Amodio, L. Brugnano/Parallel Computing 21 (1995) 1097-1110
11
1109
zyxwvutsrqponmlkjihgfedcbaZYXWVUTS
Fig. 1. M easured speedups.
this computer, the value 77in (14) is about 3 and hence for n large with respect to
the expected speedup is ip = 3Op/43. The programming language used is
Fortran with the Express [16] parallel libraries. The numerical tests concern the
following problem:
p
(15)
which derives from the use of the mid-point method as boundary value method for
ODE [7,8]. The coefficient matrix in (15) has condition number (in l-norm or
m-norm) 2n and hence it is quite well-conditioned when nu K 1, where u is the
machine precision. Nevertheless, by using the usual partition methods, factorization (2) may be not defined since some of the blocks A”‘, i = 1,. . . , p - 1, may be
singular. Conversely, by using the parallel QR factorization algorithm described in
Section 3, we obtain a stable solver with a reduced system of size at most 2p - 1.
The measured speedups on 4, 8 and 16 processors versus the size n of the problem
are shown in Fig. 1. Observe that the obtained results clearly show the linear
dependence on p of the measured asymptotic speedup, as predicted in (14). zyxwvutsrqponmlk
Acknowledgements
The authors thank Ms. Paulene Butts for her help in the preparation
manuscript.
of the
1110
P. Amodio, L. Brugnano/Parallel Computing 21 (1995) 1097-1110
References
[l] P. Amodio and L. Brugnano, Parallel factorizations and parallel solvers for tridiagonal linear
systems, Linear Algebra Appl. 172 (1992) 347-364.
[2] P. Amodio, L. Brugnano and T. Politi, Parallel factorizations for tridiagonal matrices, SLAM J.
Numerical Anal. 30 (1993) 813-823.
[3] P. Amodio and N. Mastronardi, A parallel version of the cyclic reduction algorithm on a
hypercube, Parallel Comput. 19 (1993) 1273-1281.
[4] S. Bondeli, Divide and Conquer: a new parallel algorithm for the solution of a tridiagonal linear
system of equations, Tech. Report 130, ETH Ziirich, Department Informatik, Institut fiir Wissenschaftliches Rechnen, Zurich, Switzerland, 1990.
[5] L. Brugnano, A parallel solver for tridiagonal linear systems for distributed memory parallel
computers, Parallel Comput. 17 (1991) 1017-1023.
[6] L. Brugnano and D. Trigiante, Tridiagonal matrices: invertibility and conditioning, Linear Algebra
Appl. 166 (1992) 131-150.
[7] L. Brugnano and D. Trigiante, Stability properties of some BVM methods, Applied Numer. Math.
13 (1993) 291-304.
[8] L. Brugnano and D. Trigiante, A parallel preconditioning technique for BVM methods, Applied
Numer. Math. 13 (1993) 271-290.
[9] D.J. Evans and W.S. Yousif, The solution of two-point boundary value problems by the alternating
group explicit (AGE) method, SLAM 1. Scientific and Statis. Comput. 9 (1988) 474-484.
[IO] I.N. Hajj and S. Skelboe, A multilevel parallel solver for block tridiagonal and banded linear
systems, Parallel Comput. 15 (1990) 21-45.
[ll] S.L. Johnsson, Solving tridiagonal systems on enseble architectures, SLAMS. Scientifi and Statist.
Comput. 8 (1987) 354-392.
[12] A. Krenchel, H.J. Plum and K. Stiiben, Parallelization and vectorization aspects of the solution of
tridiagonal linear systems, Parallel Comput. 14 (1990) 31-49.
[13] V. Mehrmann, Divide and conquer methods for block tridiagonal systems, Parallel Comput. 19
(1993) 257-279.
[14] J.M. Ortega, Introduction to Parallel and Vector Solution of Linear systems (Plenum Press, New
York, 1988).
[U] H.H. Wang, A parallel method for tridiagonal linear systems, ACM Trans. Math. Software 7 (1981)
170-183.
[16] Express Fortran User’s Guide, ParaSoft Corporation, Pasadena, 1990.
View publication stats