Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers

Published: 01 May 1995 Publication History

Abstract

This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 * parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor. As a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. For the matrices resulting from three-dimensional finite difference grids, the scalability is quite good on a hypercube or the CM-5, but not as good on a 2-D mesh architecture. In the case of unstructured sparse matrices with a constant number of nonzero elements in each row, the parallel formulation of the PCG iteration is unscalable on any message passing parallel architecture, unless some ordering is applied on the sparse matrix. The parallel system can be made scalable either if, after reordering, the nonzero elements of the $N\times N$ matrix can be confined in a band whose width is $O(N^y)$ for any $y\char'74 1$, or if the number of nonzero elements per row increases as $N^x$ for any $x > 0$. Scalability increases as the number of nonzero elements per row is increased and/or the width of the band containing these elements is reduced. For unstructured sparse matrices, the scalability is asymptotically the same for all architectures. Many of these analytical results are experimentally verified on the CM-5 parallel computer.

References

[1]
E. Anderson, “Parallel implementation of preconditioned conjugate gradient methods for solving sparse systems of linear equations,” Cent. for Supercomput. Res. and Development, Univ. Illinois, Urbana, IL, Tech. Rep. 805, 1988.
[2]
C. Aykanat, F. Ozguner, F. Ercal, and P. Sadayappan, “Iterative algorithms for solution of large sparse systems of linear equations on hypercubes,” IEEE Trans. Comput. , vol. 37, pp. 1554–1567, Dec. 1988.
[3]
D. L. Eager, J. Zahorjan, and E. D. Lazowska, “Speedup versus efficiency in parallel systems,” IEEE Trans. Comput. , vol. 38, pp. 408–423, Mar. 1989.
[4]
A. George and J. W.-H. Liu, Computer Solution of Large Sparse Positive Difinite Systems . Englewood Cliffs, NJ: Prentice-Hall, 1981.
[5]
N. E. Gibbs, W. G. Poole, and P. K. Stockmeyer, “A comparison of several bandwidth and profile reduction algorithms,” ACM Trans. Math. Software , vol. 2, pp. 322–330, 1976.
[6]
G. H. Golub and C. Van Loan, Matrix Computations: Second Edition . Baltimore, MD: The Johns Hopkins University Press, 1989.
[7]
A. Grama, A. Gupta, and V. Kumar, “Isoefficiency: Measuring the scalability of parallel algorithms and architectures,” IEEE Parallel and Distrib. Technol. , vol. 1, pp. 12–21, Aug. 1993. Also available in Dep. of Comput. Sci., Tech. Rep. TR 93-24, Univ. Minnesota, Minneapolis, MN.
[8]
A. Gupta and V. Kumar, “A scalable parallel algorithm for sparse matrix factorization,” Dep. Comput. Sci., Univ. Minnesota, Minneapolis, MN, Tech. Rep. 94-19, 1994. A short version appeared in Supercomputing '94 .
[9]
——, “The scalability of FFT on parallel computers,” IEEE Trans. Parallel and Distrib. Syst. , vol. 4, pp. 922–932, Aug. 1993. A detailed version available in the Dep. Comput. Sci., Tech. Rep. TR 90-53, Univ. Minnesota, Minneapolis, MN.
[10]
J. L. Gustafson, “Reevaluating Amdahl's law,” Commun. ACM , vol. 31, no. 5, pp. 532–533, 1988.
[11]
J. L. Gustafson, G. R. Montry, and R. E. Benner, “Development of parallel methods for a 1024-processor hypercube,” SIAM J. Scientif. and Statist. Comput. , vol. 9, no. 4, pp. 609–638, 1988.
[12]
S. W. Hammond and R. Schreiber, “Efficient ICCG on a shared-memory multiprocessor,” Int. J. High Speed Comput. , vol. 4, no. 1, pp. 1–22, Mar. 1992.
[13]
K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability . New York: McGraw-Hill, 1993.
[14]
C. Kamath and A. H. Sameh, “The preconditioned conjugate gradient algorithm on a multiprocessor,” in Advances in Computer Methods for Partial Differential Equations , R. Vichnevetsky and R. S. Stepleman, Eds. New York; IMACS, 1984.
[15]
A. H. Karp and H. P. Flatt, “Measuring parallel processor performance,” Commun. ACM , vol. 33, no. 5, pp. 539–543, 1990.
[16]
S. K. Kim and A. T. Chronopoulos, “A class of Lanczos-like algorithms implemented on parallel computers,” Parallel Comput. , vol. 17, pp. 763–777, 1991.
[17]
K. Kimura and I. Nobuyuki, “Probabilistic analysis of the efficiency of the dynamic load distribution,” in Proc. Sixth Distrib. Memory Comput. Conf. , 1991.
[18]
V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms . Redwood City, CA: Benjamin-Cummings, 1994.
[19]
V. Kumar and A. Gupta, “Analyzing scalability of parallel algorithms and architectures,” Dep. Comput. Sci., Univ. Minnesota, Minneapolis, MN, Tech. Rep. TR 91-18, 1991; to appear in J. Parallel and Distrib. Comput. , 1994. A shorter version appears in Proc. 1991 Int. Conf. Supercomput. , 1991, pp. 396–405.
[20]
C. E. Leiserson, “Fat-trees: Universal networks for hardware efficient supercomputing,” in Proc. 1985 Int. Conf. Parallel Processing , 1985, pp. 393–402.
[21]
R. Melhem, “Toward efficient implementation of preconditioned conjugate gradient methods on vector supercomputers,” Int. J. Supercomput. Appli. , vol. I, no. 1, pp. 70–97, 1987.
[22]
D. Nussbaum and A. Agarwal, “Scalability of parallel machines,” Commun. ACM , vol. 34, pp. 57–61, 1991.
[23]
S. Ranka and S. Shani, Hypercube Algorithms for Image Processing and Pattern Recognition . New York: Springer-Verlag, 1990.
[24]
Y. Saad, “SPARSKIT: A basic tool kit for sparse matrix computations,” Res. Inst. Advanced Comput. Sci., NASA Ames Res. Cen., Moffet Field, CA, Tech. Rep. 90-20, 1990.
[25]
Y. Saad and M. H. Schultz, “Parallel implementations of preconditioned conjugate gradient methods,” Yale Univ., Dep. of Comput. Sci., New Haven, CT, Tech. Rep. YALEU/DCS/RR-425, 1985.
[26]
V. Singh, V. Kumar, G. Agha, and C. Tomlinson, “Scalability of parallel sorting on mesh multicomputers,” Int. J. Parallel Programming , vol. 20, no. 2, 1991.
[27]
Z. Tang and G.-J. Li, “Optimal granularity of grid iteration problems,” in Proc. 1990 Int. Conf. Parallel Processing , 1990, pp. I111–I118.
[28]
F. A. Van-Catledge, “Toward a general model for evaluating the relative performance of computer systems,” Int. J. Supercomput. Appli. , vol. 3, no. 2, pp. 100–108, 1989.
[29]
H. A. van der Vorst, “A vectorizable variant of some ICCG methods,” SIAM J. Scientif. and Statist. Comput. , vol. III, no. 3, pp. 350–356, 1982.
[30]
——, “Large tridiagonal and block tridiagonal linear systems on vector and parallel computers,” Parallel Comput. , vol. 5, pp. 45–54, 1987.
[31]
J. Woo and S. Sahni, “Computing biconnected components on a hypercube,” J. Supercomput. , June 1991. Also available from the Dep. Comput. Sci., Univ. Minnesota, Minneapolis, MN, Tech. Rep. TR 89-7.
[32]
P. H. Worley, “The effect of time constraints on scaled speedup,” SIAM J. Scientif. and Statist. Comput. , vol. 11, no. 5, pp. 838–858, 1990.
[33]
J. R. Zorbas, D. J. Reble, and R. E. VanKooten, “Measuring the scalability of parallel computer systems,” in Supercomput.'89 Proc. , 1989, pp. 832–841.

Cited By

View all

Recommendations

Reviews

Dorothy Bollman

An important problem in scientific computing is solving large sparse systems of linear equations Ax=b . In this fundamental research paper, the authors study the performance and scalability of parallel formulations of an iteration of the preconditioned conjugate gradient (PCG) algorithm for solving such systems. The scalability metrics used are system efficiency defined by E= W W+T 0 W, p , where W is problem size and T 0 W,p is the overhead incurred using p processors; and the isoefficiency function, which relates the problem size to the number of processors necessary to maintain a fixed efficiency. The latter metric, which was defined by the first two authors in previous work, allows one to predict the performance of a parallel algorithm on a large number of processors by testing its performance on just a few processors. The authors study two types of linear systems. First, they consider block tridiagonal matrices A resulting from square two-dimensional finite difference grids, and they analyze two methods for mapping data onto processors, as well as two different types of preconditioners. The second type of matrices considered is unstructured sparse matrices, with a constant number of nonzero elements per row. Overhead costs for processor communication and isoefficiency analyses are studied for mesh, hypercube, and CM-5 architectures. The authors experimentally verify some of their analytic results on a CM-5. For example, for a suitable mapping, the parallel version of a PCG iteration is very scalable on a CM-5 for the block tridiagonal case, especially for the truncated incomplete Cholesky preconditioner. The unstructured sparse case is unscalable on any message-passing architecture, unless nonzero elements can be reordered to satisfy certain restraints. It would be of interest to test the results on other types of architecture , such as mesh and hypercube.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 6, Issue 5
May 1995
117 pages
ISSN:1045-9219
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 May 1995

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2015)A Novel Method for Scaling Iterative Solvers: Avoiding Latency Overhead of Parallel Sparse-Matrix Vector MultipliesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231180426:3(632-645)Online publication date: 1-Mar-2015
  • (2014)Computer performance analysis and the Pi TheoremComputer Science - Research and Development10.1007/s00450-010-0147-829:1(45-71)Online publication date: 1-Feb-2014
  • (2011)Self-similarity of parallel machinesParallel Computing10.1016/j.parco.2010.11.00337:2(69-84)Online publication date: 1-Feb-2011
  • (2008)Improving the Performance of Multiple Conjugate Gradient Solvers by Exploiting OverlapProceedings of the 14th international Euro-Par conference on Parallel Processing10.1007/978-3-540-85451-7_74(688-697)Online publication date: 26-Aug-2008
  • (2005)Analysis of Parallel Preconditioned Conjugate Gradient AlgorithmsInformatica10.5555/1413792.141379316:3(317-332)Online publication date: 1-Aug-2005
  • (2001)The Parallel Algorithm of Conjugate Gradient MethodProceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing-Revised Papers10.5555/645642.664193(156-165)Online publication date: 1-Sep-2001
  • (2000)On the Influence of Start-Up Costs in Scheduling Divisible Loads on Bus NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/71.89579411:12(1288-1305)Online publication date: 1-Dec-2000
  • (2000)Parallel Krylov Methods for Econometric Model SimulationComputational Economics10.1023/A:100872202613616:1-2(173-186)Online publication date: 1-Oct-2000
  • (1998)Mapping Conjugate Gradient Algorithms for Neutron Diffusion Applications onto SIMD, MIMD, and Mixed-Mode MachinesInternational Journal of Parallel Programming10.1023/A:101879690355326:2(183-207)Online publication date: 1-Apr-1998
  • (1997)Relationships Between Efficiency and Execution Time of Full Multigrid Methods on Parallel ComputersIEEE Transactions on Parallel and Distributed Systems10.1109/71.5955738:6(562-573)Online publication date: 1-Jun-1997
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media