Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A method to parallelize tridiagonal solvers

1991, Parallel Computing

Parallel Computing 17 (1991) 181-188 North-Holland 181 A method to parallelize tridiagonal solvers * Silvia M. Miiller and Dieter Scheerer Institute for Computer Architecture, Computer Science Department (Bau 36), Unioersity of Saarland, D-6600 Saarbriicken 11, Germany Received 20 August 1989 Revised 10 September 1990 Abstract MiJller, S.M. and D. Scheerer, A method to parallelize tridiagonal solvers, Parallel Computing 17 (1991) 181-188 A general method to parallelize any tridiagonal solver is presented. The method, based on a suitable decomposition of the numerical problem, is rather simple but efficient. The communication overhead stays small and with respect to stability and vectorizability the parallel algorithms have nearly the same qualities as their sequential counterparts. When applied to standard sequential solvers the method yields some well-known as well as many new parallel algorithms. Keywords. Parallel tridiagonal solvers; Wang Decomposition; parallel LU-factorization. 1. Introduction By solving d i f f e r e n t i a l e q u a t i o n s in a n u m e r i c a l way, t r i d i a g o n a l systems with m o r e t h a n 10 000 u n k n o w n s m a y arise. T h e r e are m a n y sequential solvers for such systems, like G a u s s i a n E l i m i n a t i o n , L U F a c t o r i z a t i o n o r Cyclic R e d u c t i o n , a n d each solver has its a d v a n t a g e s with respect to v e c t o r i z a t i o n , speed a n d n u m e r i c a l qualities. But p a r a l l e l solvers are i n d i s p e n s i b l e , b e c a u s e the s e q u e n t i a l solvers are n o t fast enough. It w o u l d b e d e s i r a b l e to have a p a r a l l e l c o u n t e r p a r t for each sequential solver. A s i m p l e m e t h o d to p a r a l l e l i z e all the a l g o r i t h m s w o u l d be very helpful. In the following sections we give a general d e s c r i p t i o n of the m e t h o d , two e x a m p l e s for a p p l i c a t i o n a n d s o m e m e a s u r e d results. A t that we c o n s i d e r two n u m e r i c a l p r o b l e m s . First, a t r i d i a g o n a l s y s t e m with o n e right h a n d side has to be solved a n d second, a s y s t e m has to be solved with several right h a n d sides, which are fixed consecutively. 2. Description of the method S u p p o s e that the n u m b e r of p r o c e s s o r s p d i v i d e s the o r d e r n of the system. If o n e p a r t i t i o n s the original system A • x = d into p s u b s y s t e m s , then each p r o c e s s o r w o r k s on n i p c o n s e c u t i v e • Research partially funded by DFG, SFB 124 0167-8191/91/$03.50 © 1991 - Elsevier Science Publishers B.V. (North-Holland) 182 S.M. Miiller, D. Scheerer equations• The k th system has the following form: A = al bl c2 a2 ] b2 (la) ¢n cj an xj_ 1 xj ag bj Cj+ I aj+ lbj+ l x Cr- 1 at- 1 br- 1 cr ar : j (lb) I X r br dr - Xr + 1 j=(k-1) Xn~ +1 r=kxn~ 1=0. cI = b , = x o = X , + The subsystems (b) are neither quadratic nor independent• Each system, except the first and the last one, has two variables more than equations• The next two steps contain the main idea of this general method. The extra matrix elements vanish when introducing the first and the last variables of each system (lb) as parameters on the right hand side. For processor k system (2) results. ][Xl-d']i1 aj cj+l • . . X ~ Ak xk -- II dk Xj_ 1 fk -- Xr+ 1 -11 gk (2) Solving system (2) can also be seen as: Solve one tridiagonal linear system with three different fight hand sides and choose x k as a linear combination of the three partial solutions• Aku k = d k Akyk = f k Akzk = x k = u k - Xg_ly k - xr+lz k. gk (3) (4) The subsystems (3) are quadratic and can be solved without any communication between processors• Each processor uses an arbitrary sequential solver to do this, It also suffices to convert system (3) into diagonal form and to change equation (4) slightly: akx k = U~ - X j _ l y k - Xr+ lZ k. (5) Here a k is the principal diagonal of the transformed system (3). Before the computation of the global solution x k, the parameters connecting the subsystems must be determined. Taking the first and the last equation of each system (4.5) (only the last one of the first system and the first one of the last system) the tridiagonal system (6) of order A method to paralleBze tridiagonal soloers 183 2p - 2 arises. The two equations of processor k are: ajxj = uj a r X r = Ur -- X j _ l Y r k -- X r + I Z r Zn/p an/p an/p+l Yn/p+l Zn/p+l Y2n/p Z2n/p a2n/p • . . Yn(p-l)/p Xn/p dn/p dn/p+ l X2n/p + 1 d2n/p Xn/p + 1 )< dn( X n( p - 1 ) / p + l Xn( p - 1)/p p - Zn(p--l)/p an(p--1)/p an(p-1)/p+l Yn(p-1)/p+l (6) 1 )/p dn( p - 1 ) / p + ! System (6) can be solved in a sequential or in a parallel way. In order to solve the system in parallel, the above algorithm is used recursively, but with half the number of processors. For example, Gaussian Elimination could be used as solver. Each active processor solves a system of 4 equations with 3 right hand sides. After (log p - 1) recursions only one processor is active and solved the remaining system sequentially. The data flow for this step is visualized in Fig. 1 for p = 8. In order to solve system (6) sequentially, data must be collected. After that transfer of data one processor computes the solution and sends the results to all other processors. The transfer can be executed in at most 2 log p rounds of communications. The processors collect the coefficients of equations (6) as shown in Fig. 2. 0 [] [] l 2 3 4 5 [] solving system (2) (origional or r e d u c e d system) b ? solving system (6) in the last reoursion o n o n e p r o c e s s o r computing linear combination communication (4,,5) Fig. 1. 184 S.M. Mfifle~D. Scheerer o 1 2 3 [] collecting the coefficients of the equations (6) [] computing linear combination (4,5) 4 5 [] 6 7 solving system (6) sequentially and distributing the solution communication Fig. 2. The whole computation consists of three steps: First, each processor solves system (2) independent of all others, using an arbitrary sequential solver. This happens without any transfer of data. Second, all processors together determine the parameters. They only need time O(log p ) for communication and a little extra computation. Third, each processor computes the solution as a linear combination of the three partial solutions. This is also possible without any transfer of data. It is easy to give a coarse lower bound for the efficiency of these parallel algorithms. Let O M = number of all operations, whose result only depends on the matrix (sequential case), O R - - n u m b e r of all other operations especially those depending on the right hand side (sequential case), TP = number of all operations to solve system (6) and to transfer data (parallel case), and Eft = sequential runtime p * parallel runtime " Then Eff >/(OR + O M ) / ( O M + 3 O R + T P . p). OM and O R are linear in n and TP is linear in p in the worst case. If the dimension n is much bigger than the number of processors and the time to initiate communication, then eff >/½. In really the efficiency will be greater, because we have not used the special sparse structure of both new right hand sides. So far, the analysis of our method is independent of the sequential solver used. 3. A p p l i c a t i o n s a n d r e s u l t s Applying the above method to Gaussian Elimination yields the well known W a n g - D e c o m position [1,2]. Applying it to Cyclic Reduction yields a parallel algorithm developed by G M D [21. N o w we want to apply our method to two standard sequential solvers and compare the parallel counterparts with regard to efficiency and parallel runtime. For that purpose we need some numerical testproblems. They are described below. The resulting parallel algorithms have been implemented on a P A R S Y T E C Megaframe with 16 processors and on the D A T I S - P machine in Saarbrticken. (A short description of the second parallel computer follows below.) A m e t h o d to parallelize tridiagonal soloers 185 3.1. Numerical testproblems First the tridiagonal system (7) with one single right hand side has to be solved. Such systems sometimes arise, e.g. by solving ordinary differential equations by finite differences or finite elements. A x = d, A is a tridiagonal matrix. (7) Second, a system with several right hand sides is considered. The right hand sides are fixed consecutively: A x t = d t, t = 0...-r. (8) Such problems arise e.g. by solving time-dependent PDE's by the finite difference method [3]. In that case the right hand side has a special form: d' = B x '-~, where B is also a tridiagonal matrix. Then it takes additional operations of cost 4hA to compute B x t in the sequential case and operations of cost 4 n / p A in the parallel case. 3.2. Model of computation For the theoretical estimations a computational model is necessary. P processors with private memory are interconnected via a network, complete crossbar or a permutation network. More detailed analyses shows that a cube or a reconfigurable mesh would also suffice. One floating-point operation with double precision numbers has cost A and a transfer of n data between two processors costs S + n / B . S stands for startup time (time to establish a connection between the processors) and B for bandwidth. All other operations cost nothing. 3.3. Parallel computers and their parameters The algorithms have been implemented on two different parallel computer systems. One is the PARSYTEC Megaframe with 16 processors and a PAR_C-compiler. The underlying processor is the Transputer with a floating-point coprocessor. The measured system parameters are A=le, S=4e, B-2/e, e=10-Ss. The hardware would allow a better floating-point rate, but the compiler (perliminary version) seems not to support operations on double precision numbers. The second one is the DATIS-P also with a C-compiler. Some parts have to be hand coded. This parallel computer is not commercial, therefore we have to describe it. Its system parameters are: A=le, S=0.15e, B=l/e, e=1.3210-Ss. 3.4. DA T I S - P - a parallel computer The DATIS-P is a parallel computer developed at the University of Saarland, Saarbrticken. The machine is designed for a maximum of 256 node computers. Currently a prototype consisting of 32 nodes is running. The DATIS-P is a distributed memory machine that may be classified as a M I M D computer (Multiple Instruction Multiple Data). It is a symmetric multiprocessor, i.e. the slaves are physically identical and logically exchangeable. A system master which is basically identical to the node computers controls the other node computers (slaves) via a common bus system. This bus allows the master to manipulate the dual 186 S . M . MiJller, D. S c h e e r e r Master-Slave-Bus Inlerconneclion-Nelwork ] Fig. 3. ported slave RAM's, using selective or broadcast accesses. The master is also used as the host, running under a commercial multi-user operating system including editors, assembler, linker, C-compiler. Each node is equipped with a VME bus interface, so that RAM boards or other extensions may be added (see Fig. 3). Communication between the slaves is realized by an 8 bit wide interconnection network. A three stage permutation network of the Benes-type supporting the maximum of 256 slave computers has been developed. Synchronous circuit switching realized by a hardware protocol is implemented. The network configuration is realized by distributed control from the slaves. The time necessary both for a change of the configuration and for data exchange is comparable to memory access time. The slaves, the master and the network use a common systemclock generated by a special clock distribution board. The design depends on standard chips: CPU Motorola MC68020 (16Mhz), Motorola MC68881 floating-point coprocessors per node (200 Kflops each), serial interface, disc interface (SCSI), VME-bus interface, 1Mbyte SRAM, 128 Kbyte EPROM. One of the main goals of the DATIS-P project is the consideration of fault tolerance aspects: a slave working incorrectly or an error in some parts of the network do not necessarily force a computation to abort. 3.5. Application to Gaussian Elimination The Wang-Decomposition is obtained when our method is applied to Gaussian Elimination. The sequential version consists of two phases. During the first phase the entries below the principal diagonal are eliminated and during the second phase the remaining system is solved by backward elimination. It is easy to see that this can be done with 8 n - 7 floating-point operations. For each new right hand side which is fixed consecutively the same number of operations is necessary. Together with our method it is better to decompose the second phase of the Gaussian Elimination into two separate parts. During the first part of the second phase the entries above the principal diagonal are eliminated and during the second part the remaining system is solved. This means that system (2) is converted into diagonal form and Eq. (5) is used to determine the parameters and the linear combination. Solving system (2), each processor has to treat right hand sides. Making use of the special structure of the right hand sides processor k needs only to perform operations of cost 12 ( n i p - 1)A, as the following sequence illustrates. During the first phase processor k computes the values a~ = a~ - c ik // a ik- I b ik- I d, = d, - d / a L , d L , f k=f _ ,. g~ = g~ - c ~ / a L , g L , =_ = g~ 187 A method to parallelize tridiagonal solvers for each i = j + 1 ... r and during the second phase d ~ = d ,~ - - bik//ai+k ldi+k 1 Lk = Lk -- b ~i / a i + k lfj+ I g~ = gk k k k k k k -- b i / a i + l g i + l = - b i / a i + l g i + 1 for each i = r - 1 . . . j. In order to solve system (6) in parallel, each processor therefore needs operations of cost (46A + 2S + 1 2 / B ) log p - 21A. To determine solution x according to Eq. (5), each processor needs 5 ( n / p - 2) floating-point operations. The first and the last value of x k have already been computed during the previous steps. Altogether parallel runtime T p = ( 1 7 n / p - 4 3 ) A + log p(46A + 1 2 / B + 2S) results. If a system is solved with several right hand sides then the operations on the vectors f and g are the same as for the first right hand side and their computation can be dropped. Thus only Ti = ( 1 3 n / p - 39)A + log p(46A + 1 2 / B + 2 S ) is necessary for each additional right hand side. 3.6. Application to L U Factorization L U Factorization is the quickest direct solver for tridiagonal systems. The algorithm is well-suited for systems with more than one right hand side, even if they are fixed consecutively. First the matrix is factorized A = L - U and then two bidiagonal systems are solved: Ax=d L(Ux) =d Ly=d&Ux=y. L is a lower and U an upper bidiagonal matrix. In the sequential case 8n - 7 floating-point operations are necessary to solve a system with one right hand side, and only 5n - 4 operations for each additional right hand side. Therefore the parallel version is of some interest (it seems to be a new algorithm). To solve system (2) now means that each processor k factorizes A k and solves the two bidiagonal systems for its three right hand sides. The second bidiagonal system is only converted into diagonal form and for the following steps Eq. (5) is used. For the factorization each processor needs 3 ( n / p - 1) floating-point operations. L has only ones on its principal diagonal which together with the sparse structure of f k and gk lead to the amount of 9 ( n / p - 1) floating-point operations to solve the factorized system (2). The second and the third step can be copied from the Wang-Decomposition. Solving system (8) for each new right hand side fixed consecutively, the factorization of the matrix A and the operations on the vectors f and g can be dropped. Thus the parallel runtimes Tp = ( 1 7 n / p - 43)A + log p(46A + 1 2 / B + 2S) and Tp' -- ( l O n / p 36)A + log p(46A + 1 2 / b + 2 S ) result. 3. 7. Comparison o f both parallel algorithms The algorithms are compared with respect to runtime and efficiency. The number of processors and the granularity G = n / p are significant for the quality of the algorithms. The results given in the Tables 1 - 3 are quite good. This is caused by the small startup time (and the high time for a floating-point operation). The influence of the communication time Table 1 Efficiency (%) for the first problem using Wang-Decomposition or LU-Factorization G\p 500 1000 4 8 46.61 46.3 46.83 46.67 Theoretical 16 45.98 46.52 32 45.67 46.36 4 8 47.0 46.4 47.2 46.9 Megaframe 16 46.2 46.7 32 46.0 46.5 DATIS 188 S.M. Miiller, D. Scheerer Table 2 Efficiency (%) for the second problem using Wang-Decomposition or LU-Factorization. G = 1000 rip 8 16 32 8 16 32 100 1000 60.69 60.85 60.42 60.58 60.32 60.48 58.8 59.1 58,3 58,54 59.6 59.9 100 1000 49.24 49.28 48.96 48.99 48.59 48.89 47.1 47.8 46.4 46,6 47.1 47.7 Theoretical Megaframe Wang LU DATIS Table 3 Run time (sec) for the second problem using Wang-Decomposition or LU-Fractorization. G = 1000 •/ p 8 16 32 8 16 32 100 1000 13.18 131.5 13.24 132.1 17.5 174.5 13.24 132.1 13.3 132.8 17.69 176.8 100 1000 10.2 101.5 10.3 102.1 13.5 135.07 10.4 103.9 10.5 104.7 13.54 135.3 Theoretical Megaframe Wang LU DATIS c a n be neglected. T h e small deviation b e t w e e n the theoretical a n d the m e a s u r e d results is caused b y some a d m i n i s t r a t i v e overhead, like p r o c e d u r e calls, loops, address c o m p u t a t i o n s , etc. F o r p ~< 16 the c o m p u t e r p a r a m e t e r s of the P A R S Y T E C M e g a f r a m e are used to d e t e r m i n e the theoretical values a n d for p = 32 the D A T I S - P p a r a m e t e r s are used. 4. Conclusion O u r m e t h o d can be applied to a n y solver. It is also possible to generalize the m e t h o d for systems with more t h a n three diagonals. F o r rn diagonals, m - 1 a d d i t i o n a l right h a n d sides are required. If k stays small the algorithms are r e s o n a b l y efficient. B l o c k - t r i d i a g o n a l systems m a y also be parallelized b y this method. I n this case the a r i t h m e t i c o p e r a t i o n s c o n c e r n vectors a n d matrices instead of real n u m b e r s . References [1] D.B. Gannon and J. Van Rosendale, On the impact of communication complexity on the design of parallel numerical algorithms, IEEE TOC c-33 (12) (1984). [2] A. Krechel, H.J. Plum and K. Stiiben, Solving Tridiagonal Linear Systems in Parallel on Local Memory MIMD Machines, Arbeitspapier der GMD 372, Gesellschaft fiir Mathematik und Datenverarbeitung, St. Augustin (FRG), 1989. [3] J. Todd, Suroey of Numerical Analysis (McGraw, New York, 1962) 419.