Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/62297.62330acmconferencesArticle/Chapter ViewAbstractPublication Pagesc3pConference Proceedingsconference-collections
Article
Free access

A novel approach to system-level fault tolerance in hypercube multiprocessors

Published: 01 January 1988 Publication History

Abstract

This paper addresses the issue of fault tolerance in hypercube architectures. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based fault detection. The basic idea used is to propose low cost fault detection and location schemes using high-level encodings on the data that are tailored to the algorithms being executed on the parallel machine. We have implemented system-level fault detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. In this paper we discuss the results of one such application, namely, Gaussian Elimination which is an extremely useful algorithm for solving a set of linear equations. We have performed extensive studies of fault coverage of our system level fault detection schemes in the presence of finite precision arithmetic which affects our system level encodings.

References

[1]
C.L. Seitz, "The Cosmic Cube," in Comm. oftheACM, pp. 22-33, Jan. 1985.
[2]
J.C. Peterson, J. Tua2on, D. Lieberman, and M. Pniel. "The Mark III Hypercube-Ensemble Concurrent Computer." Proc. 1985 Pm'aUel Pro~si~ Confemr~ pp. 71- 73, Aug. 1985.
[3]
I. Koren," A Reconfigurable and Fault-Tolerant VLSI Multiprocessor Array," in Proc. 8th Int. Syrup. on Conqmter Amhiteetum Minneapolis, Minnesota, pp. 425-442, May 1981.
[4]
D.K. Pradhan, "Fault-Tolerant Multiprocessor Link and Bus Network Architectures," in IEER Trans. Computers. pp. 33-45, Jan. 1985.
[5]
R. Negrini, M. Sami, and Stefanelli, "Fault-tolerance Techniques for Array Structures Used in Supercomputing," ~ Computer, pp. 78-87, Feb. 1986.
[6]
D.A. Rennels, "On Implementing Fault Tolerance in Binary Hypercubes,"m Proc. 16th Int. Syrup. on Fa~t-Tolemnt Computing, Vienna, Austria, pp. 344-349, July 1986.
[7]
J.G. Kuhl and S. M. Reddy, "Fault Diagnosis in Fully Distributed Systems," Proc. l lth Int. Syrup. on Fault-Tolerant Computing, pp. 100--105, Jun. 1981.
[8]
J.R. Armstrong and F. G. Gray, "Fault Diagnosis in a Boolean n- Cube Array of Microprocessors," ~ 2%u~. Computers, vol. C- 30, pp. 587-590, Aug. 1981.
[9]
E. Dilger and E. Ammann, "System Level Self-Diagnosis in n-Cube Connected Multiprocesor Networks," in Proc. 14th In?. Syrup. on Fault Tolerant Computing, Kissimmee, FL, pp. 184-189, Jun. 1984.
[10]
R.K. Iyer and D. J. Rossetti. "Permanent CPU Errors and System Activity: Measurement and Modelling," Proc. Real-Time Systems Syrup., 1983.
[11]
D.A. Rennels, "Fault Tolerant Computing - Concepts and Examples," IEEE Ttrms. Computers, vol. C-33, pp. 1116-1129, D~. 1984.
[12]
G.C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon, Solving Problems on Cor~amnt Proeessors, Nov. 1986.
[13]
J.A. Abraham, P. Banerjee, C.-Y. Chen, W. K. Fuchs, S. -Y. Kuo, and A. L. N. Reddy, "Fault tolerance techniques for systolic arrays," Computer Magazine (Sped~ Issue on Systolic An~s: From Coneept to Implementation), vol. 20, pp. 65-77, Jul. 1987.
[14]
K.H. Huang and J. A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," ~ Trims. Con~ut er s, vol. C-33, pp. 518-528, Jun. 1984.
[15]
J.Y. Jou and J. A. Abraham, "Fault-Tolerant Matrix Operations on Multiple Processor Systems Using Weighted Checksums," in SP/E ~ngs. Aug. 1984.
[16]
J.Y. Jou and J. A. Abraham, "Fault Tolerant FFr Networks," in Proc. 15th Int. Syrnp. on Fault Tolenmt Computing, Ann Arbor, MI, Jun. 1985.
[17]
M. Malek and Y. H. Choi, "A Fault-Tolerant Fgr Processor," in P. roc.15th Fauh-Tolerant Comp. Syrup., Ann Arbor, MI, Jun. 1985.
[18]
F. Lulc, "Algorithm-Based Fault Tolerance for Parallel Matrix Solvers," Proc. SPIE Real-Time ~ Processing Vlll, vol. 564, 1985.
[19]
P. Banerjee and J. A. Abraham, "Fault-Secure Algorithms for Multiple Processor Systems," in Proc. l l th In?. Sym. on Computer Architecture, Ann Arbor, MI, pp. 279-287, Jun. 1984.
[20]
A.L.N. Reddy and P. Banerjee, "Algorithm-based Fault Detection Techniques in Signal Processing Applications," I~ Ttm~. Computers, (submitted for publication).
[21]
C. Aykanat and F. Ozguner, "A Coneurrent Error Deteeting Conjugate Gradient Algorithm on a Hypercube Multiprocessor," in Proc. 17th Int. Syrup, on Fault- Tolerant Computing, Pittsbarg, PA, pp. 204-209, July 1987.
[22]
C. -Y. Chen and J. A. Abraham, "Fault-Tolerant Systems for the Computation of Eigenvalues and Singular Values," Proc. $PIE Conf., pp. 228-237, Aug. 1986.
[23]
P. Banerjee and J. A. Abraham, "Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems," ~ Trims. Computers, vol. C-35, pp. 296-306, Apr. 1986.
[24]
P. Banerjee and J. A. Abraham, "Concurrent Fault Diagnosis in Multiple Processor Systems," in Proc. 16th Fault Tolerant Computing Symposium, Vienna, Austria, pp. 298-303, Jul. 1986.
[25]
P. Banerjee and .I.A. Abraham, "A Probabilistic Model of Algorithm-Based Fault Detection and Tolerance in Array Processors for Real-Time Systems,"New Orleans, LA, pp. 72-78, Dee. 1986.
[26]
Intel Scientific Computers, "iPSC: The First Family of Ooneur rent Supercomputers," 1985. product description.
[27]
D.J. Lu, "Watchdog Processors and Structural Integrity Checking," Trans. Computers. vol. C-31, pp. 681-685, Jul. 1982.
[28]
J.P. Shen and M. A. Schuette, "On-line Self Monitoring Using Signatured Instruction Streams," Proc. Int. Test Conf., pp. 275-282, Oct. 1983.
[29]
P. Banerjee, J. T. Rahmeh, C. $tunkel, V. S. S. Nair, K. Roy, and J. A. Abraham, "An Evaluation of System-level Fault Tolerance on the Intel Hypercube Multiprocessor," in Proc. 18th Int. Syrup. Fault- Tolerant Computing, Tokyo, Japan, Jun. 1988.
[30]
G.A. Geist and M. T. Heath, 'Matrix Factorization on a Hypercube Multiprocessor," in Proc. SIAM 1st Conf. on Hypereube Multiprooesso~ Knoxville, TN, Aug. 1985.

Cited By

View all
  • (1997)Optimal Polling in Communication NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/71.5982738:5(449-461)Online publication date: 1-May-1997
  • (1994)Optimal polling in communication networksProceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing10.1109/SPDP.1994.346163(224-231)Online publication date: 26-Oct-1994
  • (1990)Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube MultiprocessorsIEEE Transactions on Software Engineering10.1109/32.4438116:2(183-196)Online publication date: 1-Feb-1990
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
C3P: Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
January 1988
895 pages
ISBN:0897912780
DOI:10.1145/62297
  • Editor:
  • Geoffrey Fox
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 1988

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

Hypercube88
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (1997)Optimal Polling in Communication NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/71.5982738:5(449-461)Online publication date: 1-May-1997
  • (1994)Optimal polling in communication networksProceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing10.1109/SPDP.1994.346163(224-231)Online publication date: 26-Oct-1994
  • (1990)Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube MultiprocessorsIEEE Transactions on Software Engineering10.1109/32.4438116:2(183-196)Online publication date: 1-Feb-1990
  • (1990)Compiler-Assisted Synthesis of Algorithm-Based Checking in MultiprocessorsIEEE Transactions on Computers10.1109/12.5483739:4(436-446)Online publication date: 1-Apr-1990
  • (1989)Algorithm-based error detection for signal processing applications on a hypercube multiprocessor[1989] Proceedings. Real-Time Systems Symposium10.1109/REAL.1989.63564(134-143)Online publication date: 1989

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media