The parallel QR factorization algorithm for tridiagonal linear systems

Pierluigi Amodio

The parallel QR factorization algorithm for tridiagonal linear systems

Parallel computing, 1995

ELSEVIER Parallel Computing 210995) 1097-1110 PARALLEL COMPUTING The parallel QR factorization algorithm for tridiagonal linear systems Pierluigi Amodio a, * , Luigi Brugnano b a Dipartimento di Matematica, Universitri di Bari, via Orabona 4, 70125 Bari, Italy b Dipartimento di Energetica, Universitcidi Firenze, fia Lombroso 6/ I7, 50134 Firenze, Italy Received 11 February 1994; revised 3 December 1994 zyxwvutsrqponmlkjihgfedcbaZY Abstract We describe a new parallel solver in the class of partition methods for general, nonsingular tridiagonal linear systems. Starting from an already known partitioning of the coefficient matrix among the parallel processors, we define a factorization, based on the QR factorization, which depends on the conditioning of the sub-blocks in each processor. Moreover, also the reduced system, whose solution is the only scalar section of the algorithm, has a dimension which depends both on the conditioning of these sub-blocks, and on the number of processors. We analyze the stability properties of the obtained parallel factorization, and report some numerical tests carried out on a net of transputers. Keywords: Tridiagonal linear systems; Parallel solvers; QR factorization; Transputer array 1. Introduction Much interest has been devoted to the problem of solving tridiagonal linear systems on parallel computers in recent years, and many parallel tridiagonal solvers have been devised. These solvers can be grouped into two main categories: direct solvers and iterative solvers. Among the direct methods, we can mention the partition methods, which include very efficient parallel algorithms [1,2,4,5,10-151. Among the iterative methods we mention the Jacobi method and the AGE method zyxwvutsrq [91. Concerning partition methods, a unifying approach for their derivation and study is given in [1,2], where it is also shown that these methods are stable when * Corresponding author. Email: 00110570@vm.csata.it 0167-8191/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDZ 0167-8191(95)00011-9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220275814 The Parallel QR Factorization Algorithm for Tridiagonal Linear Systems. Article in Parallel Computing · July 1995 DOI: 10.1016/0167-8191(95)00011-C · Source: DBLP CITATIONS READS 10 22 2 authors: Pierluigi Amodio Luigi Brugnano 76 PUBLICATIONS 658 CITATIONS 118 PUBLICATIONS 1,399 Università degli Studi di Bari … SEE PROFILE University of Florence CITATIONS SEE PROFILE All content following this page was uploaded by Luigi Brugnano on 29 December 2014. The user has requested enhancement of the downloaded file. PARALLEL COMPUTING Parallel Computing 210995) 1097-1110 ELSEVIER The parallel QR factorization algorithm for tridiagonal linear systems Pierluigi Amodio a, * , Luigi Brugnano b aDipartimento di Matematica, Universitri di Bari, via Orabona 4, 70125 Bari, Italy b Dipartimento di Energetica, Universitcidi Firenze, fia Lombroso 6/ I7, 50134 Firenze, Italy Received 11 February 1994; revised 3 December 1994 zyxwvutsrqponmlkjihgfedcbaZY Abstract We describe a new parallel solver in the class of partition methods for general, nonsingular tridiagonal linear systems. Starting from an already known partitioning of the coefficient matrix among the parallel processors, we define a factorization, based on the QR factorization, which depends on the conditioning of the sub-blocks in each processor. Moreover, also the reduced system, whose solution is the only scalar section of the algorithm, has a dimension which depends both on the conditioning of these sub-blocks, and on the number of processors. We analyze the stability properties of the obtained parallel factorization, and report some numerical tests carried out on a net of transputers. Keywords: Tridiagonal linear systems; Parallel solvers; QR factorization; Transputer array 1. Introduction Much interest has been devoted to the problem of solving tridiagonal linear systems on parallel computers in recent years, and many parallel tridiagonal solvers have been devised. These solvers can be grouped into two main categories: direct solvers and iterative solvers. Among the direct methods, we can mention the partition methods, which include very efficient parallel algorithms [1,2,4,5,10-151. Among the iterative methods we mention the Jacobi method and the AGE method zyxwvutsrq [91. Concerning partition methods, a unifying approach for their derivation and study is given in [1,2], where it is also shown that these methods are stable when * Corresponding author. Email: 00110570@vm.csata.it 0167-8191/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDZ 0167-8191(95)00011-9 P. Amodio, L. Brugnuno /Paralkl Computing 21 (1995) 1097-1110 1098 used to solve diagonally dominant linear systems. Diagonal dominance also implies the convergence of the above mentioned iterative solvers. Nevertheless, to our knowledge, no parallel algorithms exist for solving general nonsingular tridiagonal linear systems. In this paper, we develop a new parallel solver for this purpose. The derivation of the method is outlined in Sections 2 and 3, while its stability properties are discussed in Section 4. In Section 5 the parallel algorithm is examined and a simple model for the expected speedup is reported together with some numerical tests carried out on a linear array of transputers. 2. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Parallel factorizations The derivation of the new method extends the concept of ‘parallel factorization’ given in [l]. Let us consider the tridiagonal matrix zyxwvutsrqponmlkjihgfedcbaZYXWVU a, Cl zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDC b, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA a2 *. A= (1) 7 cn-I Cl a” where for simplicity we assume that n = kp - 1 for some integer k, if p is the number of the parallel processors we want to use. To obtain a parallel solver tailored for this number of processors, we consider the matrix A partitioned as follows: \ A(1) c&l?lek-l a(kl) c&2)ef $1 lel_ zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCB 1 ba)e A= 1 A(2) bi2?leE_ c~2ilek-1 1 af) a(,P-‘) bp)e, @)ei * A(P) J (2) where e, and ek_r are, respectively, the first and the last unit vectors of I@‘-i, the blocks A(‘) are tridiagonal of size k - 1 and XI’)= ~~~_.i)~+~for x = a, b, c. Observe that on a distributed memory parallel computer, the upper index identifies the processor where the given quantity is to be stored. Let us then consider the following factorization of the matrix (1): A=FTG, (3) P. Amodio, L. Brugnano /Parallel Computing 21 (1995) lO !V- 1110 1099 where, by denoting with Zk_i the identity matrix of order k - 1, v(l) 0 \ ,(I)’ 1 0 0 W(2)= 0 NC*) 0 0(*)= I zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJI 43F 0 ‘1 1 F= 0 N3) 0 0 d3f 1 9 \ ‘1k-l 0= 1 w(Pf 0 Ncp), \ 0 &) OT 0 pm &-I OT g*) 0 ($2) OT #3) 0 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA T= zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA ‘k-1 o I p(3) ()T &3) a(P-l) 0 \ 7 1) )( Ik-I, p \ o= 1 p) 0 G= OT OT 92’ OT 0 p) 1 $3) 0 OT 93’ OT 0 #3’ 9 1 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSR 1 Z(P) OT S(P) N(‘)S(‘) is a suitable factorization of the block A(‘), and the other unknowns of F, G and T may be easily obtained from (3) by direct identification. By considering different factorizations for A @),it is then possible to derive most of the parallel methods in the class of the partition methods. For example in [1,2], supposing A”’ diagonally dominant, the LU factorization, the LUD factorization, and the cyclic reduction were considered. In the solution of the linear system zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONML h=f, (4) P. Amoa’ia,L. Brugnano/Parallel Computing21 (1995) 1097-1110 1100 the two block tridiagonal systems with the matrices F and G are solved without any synchronization or data communication among the processors. Therefore, this section is completely parallel. The only sequential part of the algorithm consists in the solution of the tridiagonal linear system with the ‘reduced matrix’: zyxwvutsrqponmlkjih &) p p(2) am * . zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFE Tp = (5) y(P-‘) p(p-l) a(P-l) It must be noted that the size of Tp is independent of n, but depends only on the number p of parallel processors. In [1,2] it is suggested to study the factorization (3) locally by introducing the (n +p - 1) x n matrix R, recursively defined as follows: \ %-I R, = Zk_l, Ri 1 1 = \ , i=2 ,*.*, P* Ri-ll From A = RFMRp, we obtain the matrix M which is p xp block diagonal with tridiagonal main-diagonal blocks MC’) defined (with some slight differences for the first and the last one (see (2))) as follows: I 0 f$eF (6) where the matrices NC’) and S@) the vectors &), w@),y@) z w and the scalars PC’), y(‘) are the same as in (3), and &) + (~(li+i)= &, for i =‘l,. . . , p - 1. 3. The parallel QR factorization The approach in Section 2 is very general and useful, and very efficient tridiagonal solvers have been obtained [l]. Nevertheless, it is evident that all the blocks A(‘) need to be nonsingular for (3) to be defined. This is the case, for example, when the matrix A is diagonally dominant. In this case, also the reduced matrix turns out to be diagonally dominant [1,2]. 1101 P. Amodio, L. Brugnarw /Parallel Computing 21 (1995) 1097-1110 zyxwvutsrqponmlkjihgfed For zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA A generally nonsingular, it may happen that some of the blocks A(‘) are singular or very ill-conditioned. It follows that the parallel solver may be not defined or unstable even if stable scalar solvers exist (for example, the LU factorization method with partial pivoting or the QR factorization method). For this reason, in order to obtain a stable parallel solver, we consider a generalization of the above approach which consists in a finer structuring of the factorization itself. Suppose that the (k - 1) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDC X (k - 1) block A(‘) is singular. Then we proceed as follows: let AT) be the largest nonsingular principal submatrix of A(‘), and let j - 1 be its size. We define the following factorization of the block MC’): M(i) = LOD(OuG) 1 1 (7) 7 where 11 0 L’;:‘=o w(OT 0 Aq” 0 @= 1 OT 0, 0 I&j- l 0 oT (1 $’ (J;“= 0 OT \ 0 sl” g 1 OT 0, 0 0 Ik- j- 1 0 1 0 OT o= 1 zyxwvutsrqpon \ 7 -1 / and gi is the ith unit vector of size k - j - 1. This is equivalent to introducing the variable x!‘) = x ._ in the reduced system. The same procedure is then recursivelyJappligd l:ok%e tridiagonal block A’$’ of Dy ), and so on. Therefore it follows that the factorization is always defined, eventually by increasing the size of the reduced system,, that is, the scalar section of the corresponding parallel solver. We observe that some of the blocks A$) may have zero size; this means that two consecutive components in the solution vector belong to the reduced system. Anyway, if the matrix A is nonsingular, then the block A(‘), which has size k - 1, has rank at most equal to k - 3. It follows that in exact arithmetic the size of the reduced system is still independent of the size n of the linear system. Since the blocks A? are only nonsingular, it is convenient to factorize them by using a suitable stable factorization. Two obvious candidates for this purpose are the LU factorization with partial pivoting and the QR factorization. We shall analyze in more detail the parallel solver obtained by using the second factorization, which we shall call parallel QR factorization method, but a similar approach can be repeated for the former. For simplicity, let us suppose that the generic 1102 P. Anwdio, L. Brugnano/ParaUel Computing 21 (1995) 1097-1110 block A(‘) is nonsingular, factorize MC’) as: since the generalization is straightforward. Then, we (8) where Q(%@) is the QR factorization @)rw(O = c$=., , QWz(O “ $0 p(i) = = = _ _ bpel, R(iFvv(O QWy(O w(OTz(O 3 v(OTz(O, &$I = = = b(i’ k $l_ &I _ = e 1 k lek_ u(OTyO) k yO) of the matrix A(‘), and _ 1, 1, (9) , _wU)Ty(Oe We observe that, since QCi) is upper Hessemberg and R(‘) is upper triangular, then only the vectors wCi)and zti) are full vectors (filZ-in vectors), while only the last entry of yCi) and the last two entries of y(‘) may be nonzero. To increase the stability of the factorization, we may also choose Ac;‘) as the largest principal submatrix of A G) which has a condition number smaller than a given tolerance. This choice is possible since we can monitor a very effective approximation of the condition number of the current block. In fact, the matrices Q(i) and R(i), as well as the vector rvCi)can be formed step by step. In particular, at the generic step j we have calculated Q,‘o and RI’) which are the principal submatrices of size j of Q(‘) and R(‘), and the subvector WY)which contains the first j components of the vector w(‘) . Since Q!i) is orthogonal, then the condition number of Af) is the same as the condition mrmber of RI’). From (9) it follows that q!‘) = (c$))-’ 11w!‘) 111 = ll(R!“)-’ II since T#) is the norm of the first row of (RY))“. Ther e fore, we may obtain an e:timate of the condition number of A’!) at each step. As soon as this estimate is larger than a given tolerance, we increase the size of the reduced system, as previously seen, by introducing in it the variable ~7). 3.1. The parallel LU factorization with partial pivoting To complete this discussion, we now sketch briefly the parallel LU factorization with partial pivoting, which is obtained by considering the following factorization of A-f(‘)(again, we shall suppose that A(‘) is nonsingular, for simplicity): (10) Here we factorize A(‘) as P(‘)A(‘) = L(‘)U(‘) in this case, one has K( A”‘) < K( L”‘)K( u”‘). 1103 P. Amodio, L. Brugnano /Parallel Computing 21 (1995) 1097-1110 To estimate the quantities on the right hand side one has, as for the parallel QR factorization, while for II(L(‘))-’ IIone of the following estimates can be used (k - 1 is the size of L(O): 12(k-l), zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA II(L(i))-lIll=(b~))-lllz(i)lll. II(L’yIJ The actual implementation elsewhere. of this algorithm will be examined in more detail 4. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Stability analysis Let us now examine the stability of the parallel QR factorization. By fixing an upper bound for the conditioning of the sub-blocks A:), the conditioning of the whole system will depend on the reduced system. We shall therefore discuss how perturbations on the matrix A affect the reduced matrix. For this purpose, let us suppose to have fiied a suitable tolerance u for the condition number KU:)) of A$?. For a given nonsingular tridiagonal matrix A, a corresponding partition is then obtained. This partition can be represented in compact form by introducing the following block odd-even permutation matrix of order IZ: zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA 0 0 ... 0 0 0 ... 0 ... o= o= 1 0 0 or or o= I... ... ... ‘I P= or 0 z ... 0 0 ... 1 0 0 z ... 0 0 ... o= o= ... 1 o= ... which first takes the rows corresponding to the blocks A:) and then those associated to the reduced system. Here we have denoted by Z a generic identity matrix whose size is equal to the size of the corresponding block A? in the partition. By means of zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFE P we obtain zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONM (11) where A, and A, are block diagonal, A, having A:) as a generic diagonal block, and A, of size z-p - 1, while B and C are block bidiagonal. The parallel QR factorization is equivalent to factoring the matrix (11) as (12) 1104 P. Anwdio, L. Brugnano /Parallel Computing 21 (1995) 1097-1110 where Q,R, is the QR factorization of the matrix A, and T,=A,-BA;lC (13) is the reduced matrix. Now we examine the propagation of small perturbations on the matrix A in the factorization (12). In the following, for every given matrix X the symbol 6X will denote its corresponding perturbation (in particular 6X-i and 6Xr will represent perturbations respectively on X-’ and XT). Then, one has: C+iW A,+6A, P(A+GA)P== (B+tiB)(R;'+SR,')Z It is R,+aR, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA (Q,+SQ/(C+aC) Tp+c3Tp ' I i obvious that lI6A, II, II 6A,II, II 6B II, II 6C II s II 6A II. Moreover, since both Q, and Q, + SQl are orthogonal, it follows that IISQl IIis also bounded. Then, in the above expression we need only to provide an explicit bound for II 6RlII, II 6R;'II and 11ST,I(. By using the 2-norm and a first order analysis, from the relations QA =A, (Q, + SQI)(RI + 64) =A, + 4, (R;'+GR,')(Q;+6Q;)=A;'+c?A;', one easily derives that II4 II II6A, II 3i%pllAlll+ IISQ1 II IIQ,ll ’ and II 6R;’ II II SA,’ II + II SQ1 II IIR;‘II ’ llA;lll II Q, II ’ Similarly, from the perturbation of (13), it can be derived that II 6T, II I II 6A, II + II B II II A;l II ll C ll IISA,‘II II6Bll ll6CII Il + jjq +m Here we have supposed that the relative perturbations on the matrix A are smaller than E. In the above expressions, the two quantities K(A~) and II6A;'II II A;'II -' can be noted. Since we use a tolerance u for K(A(:')), we have: K(A,)=~~{K( A?)}<a. 1105 P. Anwdio, L. Brugnano / Parallel Computing 21 (1995) 1097-1110 Concerning 1]&4;’ ]I I]A;’ I]-I, if we suppose that IIA;’ I] /I&4, II -C 1, then from the relation (A, + SA,)_’ =A;l f aA;‘, and by using standard results of linear algebra, one obtains: zyxwvutsrqponmlkjihgfedcbaZYX II6.4,’ II zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA II6A, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFE II II 6A, II zyxwvutsrqponmlkjihgfedcbaZY 4%) IIA;‘II ’ l- IIA;lll I16Alll(IA,=K(A1) ((AllI ’ Therefore, also this term is proportional to zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPO K(A~). We conclude that by fixing a suitable upper bound u for K(A~) (which may depend on the size 12 of the problem, on the number p of the processors used and on the machine precision U> the factorization (12) is quite stable, at least when K(A) itself is not exaggerately large. Indeed, the algorithm is essentially devoted to matrices which are not diagonally dominant; but nevertheless well-conditioned or weakly well-conditioned (see [6]). In fact, the case where the condition number of the matrix A grows exponentially with its size is of no practical interest, since the problem becomes too much ill-conditioned for relatively small values of n. 5. The parallel QR factorization algorithm Let us now examine a pseudo-code which implements the parallel QR factorization algorithm; this will allow us to easily derive a simplified model for the expected speedup with respect to the scalar QR factorization algorithm. The QR factorization is obtained by using Givens’ rotation matrices. The data distribution is carried out according to (21, where the upper index denotes the processor number. Moreover, in order to obtain a simpler code, we add two null equations at the beginning and at the end of the linear system which we call rows 0 and n + I, respectively. The pseudo-code, written by using a Matlab-like language, follows. procedure % pqr(k, % k=size % a, % b, of c, the a, b, local c, block x=coefficients assigned 3: the % tol=tolerance to the of tol, on of computed for % proc=index % nproc=number x, the the processor; solution the norm the current of the parallel proc, of nproc) processor problem and in output the inverse of the rhs x contains of the blocks processor processors i=O nr=l break(l) =O while (i <k-l) call qrl(k-i, a(i), b(i), c(i), x(i), z(i), tol, j> P. Amodio, L. Brugnuno /Parallel Computing 21 (1995) 1097-1110 1106 i=itj nr=nr+l breakcnr) = i end if (i ==k-1) nr=nr+l breakcnr) =k end %the indexes X are in %in the % vector of the the variables vector vector break(2:nr) break(l:nr-1) X indexes are % reduced 3: system (solution of to the processor the system 1, nproc. and a, b, c, in the These coefficients vectors reduced reduced l<proc<nproc, processor recover from the if on on used the in break(l:nr) of the x. system> X % update for of the rhs and back-substitution i=l:nr-1 jO=break(i) j=break(i+l)-1 if Cj>jO) x(j)=(x(j)-z(j)*x(jO)-co*x(j+l))/acj> for j=break(i+l)-2:-l:breako+l x(j)=(x(j)-z(j)*x(jO)-c(j)*x(j+l)-b(j)*x(j+2))/a(j) end end end end procedure procedure % qrl(k, X Performs % the a, the c, x, QR factorization coefficients X rhs x. X diagonal b, a, b, The vector b is of the matrix c, a(j) last ans tot, of updating the j) the the overwritten R; z is the X The informations concerning X stored as follows: c (0) =y, X a(O)=a,, 3: b(O) =p, X while the X in b(j-2) z, block defined corresponding with fill-in reduced the second vector. system by upper are =(Y*, two c(j-I), components of respectively. the vector y are stored P. Amodio, L. Brugnano /Parallel Computing 21 (1995) 1097-1110 1107 % zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA toosmall=<tolerance for w=-1 X the fill-in wl=O % the needed zero entries> vector u is informations not actually are stored formed; in w, WI, w2 z(l)=b(O) b(O)=0 norm=0 toll=tol*c(0) for i=l:k-2 % computes % for call the the coefficients vector (a(i), givens (a(i), of the Givens matrix b(i)IT b(i), r, s) tmp=r*a(i+l)-s*c(i) if (abs(tmp)<toosmall (abs(b(i+l))<toosmall~i==k-2) & ... 1 norm>toll) break end a(i)=r*a(i)+s*b(i) c(i)=r*c(i)+s*a(itl) a(i+l)=tmp b(i)=s*c(i+l) c(i+l)=r*c(i+l) z(i+l)=-s*z(i) z(i)=r*z(i) tmp=r*x(i+l)-s*x(i) x(i)=r*x(i)+s*x(i+l) x(i+l)=tmp w2=wl wl=w w=-(w2*b(i-2)+wl*c(i-l))/aO norm=norm+abs(w) a(O)=a(O)-w*z(i) x(0)=x(O)-w*x(i) end zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA j=itl w2=wl wl=w w=-(w2*b(i-2)+wl*c(i-l))/ao a(O)=a(O)-w*z(i) x(0)=x(O)-w*x(i) c(O)=-(wl*b(i-l)+w*c(i)) tmp=-b(i)/a(i) b(O)=tmp*z(i) a(j)=a(j)+tmp*c(i) P. Amodio, L. Brugnarw/ParaUel Computing 21 (1!995)1097-1110 zyxwvutsrqponmlkjihgf 1108 x(j)=x(j)+tmp*x(i) end procedure As one can see, the required storage on each processor essentially amounts to five vectors for the coefficients (a, b and c), the right hand side x, and the fill-in vector 2. Concerning the operation count, it is well known that a call to the procedure givens has a cost of 5 flops plus 1 square root (we count as 1 flop one of the four elementary diadic operations). From the above pseudo-code, it is simple to derive that, apart from the solution of the reduced system, the parallel solver requires = 40n flops plus n square roots that can be performed in parallel on p processors. The cost of the solution of the reduced system can not be exactly predicted since it depends on the size of the system itself. However, since we observed that the blocks A(‘) may have a null space of dimension at most two if the matrix A is nonsingular, one may expect that the size of the reduced system grows linearly with the number p of the processors, thus requiring O(p) flops for its solution. This is true, at least when zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA K (A) itself is not prohibitively large. In fact, if for example K (A) is O(n), then the condition number of the largest principal nonsingular matrix of each block A(‘) should behave approximately as O(n/p) and the size of the reduced system can be expected to be at most 3p. As a further remark, in [3] it is suggested an approach for solving the reduced system that, for the matrix (5), requires only O(log, p> scalar operations and log, p - 1 data transmissions. By applying the same approach (but by using the QR factorization) to the reduced system deriving from the parallel QR factorization described in Section 3, it results that the number of transmissions is unchanged even if the total number of data transmitted may be larger. Therefore, if we suppose n %p, then the actual scalar cost of this part of the algorithm is negligible. This approach has been used in the parallel code for the numerical tests. The scalar Givens method requires = 27n flops plus n square roots. It follows that the expected asymptotic speedup on p processors is given by L?~= lims,(n) n-m (27++ = lim “‘*(40+?I);+o(p) =- 27 + q 40+qP’ (14) where r) is the ratio between the time to execute 1 square root and the time to execute 1 flop, and O(p) is the scalar section of the code. Finally, it must be said that (14) is only an upper bound for the effective asymptotic speedup, since the operations in the scalar Givens algorithm are better chained and, hence, better optimized by any ‘smart’ compiler. 6. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Numerical tests Let us now report the measured speedup for the parallel QR factorization algorithm obtained on a linear array of transputers with distributed memory. For P. Amodio, L. Brugnano/Parallel Computing 21 (1995) 1097-1110 11 1109 zyxwvutsrqponmlkjihgfedcbaZYXWVUTS Fig. 1. M easured speedups. this computer, the value 77in (14) is about 3 and hence for n large with respect to the expected speedup is ip = 3Op/43. The programming language used is Fortran with the Express [16] parallel libraries. The numerical tests concern the following problem: p (15) which derives from the use of the mid-point method as boundary value method for ODE [7,8]. The coefficient matrix in (15) has condition number (in l-norm or m-norm) 2n and hence it is quite well-conditioned when nu K 1, where u is the machine precision. Nevertheless, by using the usual partition methods, factorization (2) may be not defined since some of the blocks A”‘, i = 1,. . . , p - 1, may be singular. Conversely, by using the parallel QR factorization algorithm described in Section 3, we obtain a stable solver with a reduced system of size at most 2p - 1. The measured speedups on 4, 8 and 16 processors versus the size n of the problem are shown in Fig. 1. Observe that the obtained results clearly show the linear dependence on p of the measured asymptotic speedup, as predicted in (14). zyxwvutsrqponmlk Acknowledgements The authors thank Ms. Paulene Butts for her help in the preparation manuscript. of the 1110 P. Amodio, L. Brugnano/Parallel Computing 21 (1995) 1097-1110 References [l] P. Amodio and L. Brugnano, Parallel factorizations and parallel solvers for tridiagonal linear systems, Linear Algebra Appl. 172 (1992) 347-364. [2] P. Amodio, L. Brugnano and T. Politi, Parallel factorizations for tridiagonal matrices, SLAM J. Numerical Anal. 30 (1993) 813-823. [3] P. Amodio and N. Mastronardi, A parallel version of the cyclic reduction algorithm on a hypercube, Parallel Comput. 19 (1993) 1273-1281. [4] S. Bondeli, Divide and Conquer: a new parallel algorithm for the solution of a tridiagonal linear system of equations, Tech. Report 130, ETH Ziirich, Department Informatik, Institut fiir Wissenschaftliches Rechnen, Zurich, Switzerland, 1990. [5] L. Brugnano, A parallel solver for tridiagonal linear systems for distributed memory parallel computers, Parallel Comput. 17 (1991) 1017-1023. [6] L. Brugnano and D. Trigiante, Tridiagonal matrices: invertibility and conditioning, Linear Algebra Appl. 166 (1992) 131-150. [7] L. Brugnano and D. Trigiante, Stability properties of some BVM methods, Applied Numer. Math. 13 (1993) 291-304. [8] L. Brugnano and D. Trigiante, A parallel preconditioning technique for BVM methods, Applied Numer. Math. 13 (1993) 271-290. [9] D.J. Evans and W.S. Yousif, The solution of two-point boundary value problems by the alternating group explicit (AGE) method, SLAM 1. Scientific and Statis. Comput. 9 (1988) 474-484. [IO] I.N. Hajj and S. Skelboe, A multilevel parallel solver for block tridiagonal and banded linear systems, Parallel Comput. 15 (1990) 21-45. [ll] S.L. Johnsson, Solving tridiagonal systems on enseble architectures, SLAMS. Scientifi and Statist. Comput. 8 (1987) 354-392. [12] A. Krenchel, H.J. Plum and K. Stiiben, Parallelization and vectorization aspects of the solution of tridiagonal linear systems, Parallel Comput. 14 (1990) 31-49. [13] V. Mehrmann, Divide and conquer methods for block tridiagonal systems, Parallel Comput. 19 (1993) 257-279. [14] J.M. Ortega, Introduction to Parallel and Vector Solution of Linear systems (Plenum Press, New York, 1988). [U] H.H. Wang, A parallel method for tridiagonal linear systems, ACM Trans. Math. Software 7 (1981) 170-183. [16] Express Fortran User’s Guide, ParaSoft Corporation, Pasadena, 1990. View publication stats

Log In

The parallel QR factorization algorithm for tridiagonal linear systems