Article

Free access

A novel approach to system-level fault tolerance in hypercube multiprocessors

Authors:

C. B. StunkelAuthors Info & Claims

C³P: Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1

Pages 307 - 311

https://doi.org/10.1145/62297.62330

Published: 01 January 1988 Publication History

Abstract

This paper addresses the issue of fault tolerance in hypercube architectures. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based fault detection. The basic idea used is to propose low cost fault detection and location schemes using high-level encodings on the data that are tailored to the algorithms being executed on the parallel machine. We have implemented system-level fault detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. In this paper we discuss the results of one such application, namely, Gaussian Elimination which is an extremely useful algorithm for solving a set of linear equations. We have performed extensive studies of fault coverage of our system level fault detection schemes in the presence of finite precision arithmetic which affects our system level encodings.

References

[1]

C.L. Seitz, "The Cosmic Cube," in Comm. oftheACM, pp. 22-33, Jan. 1985.

Digital Library

[2]

J.C. Peterson, J. Tua2on, D. Lieberman, and M. Pniel. "The Mark III Hypercube-Ensemble Concurrent Computer." Proc. 1985 Pm'aUel Pro~si~ Confemr~ pp. 71- 73, Aug. 1985.

[3]

I. Koren," A Reconfigurable and Fault-Tolerant VLSI Multiprocessor Array," in Proc. 8th Int. Syrup. on Conqmter Amhiteetum Minneapolis, Minnesota, pp. 425-442, May 1981.

Digital Library

[4]

D.K. Pradhan, "Fault-Tolerant Multiprocessor Link and Bus Network Architectures," in IEER Trans. Computers. pp. 33-45, Jan. 1985.

[5]

R. Negrini, M. Sami, and Stefanelli, "Fault-tolerance Techniques for Array Structures Used in Supercomputing," ~ Computer, pp. 78-87, Feb. 1986.

Digital Library

[6]

D.A. Rennels, "On Implementing Fault Tolerance in Binary Hypercubes,"m Proc. 16th Int. Syrup. on Fa~t-Tolemnt Computing, Vienna, Austria, pp. 344-349, July 1986.

[7]

J.G. Kuhl and S. M. Reddy, "Fault Diagnosis in Fully Distributed Systems," Proc. l lth Int. Syrup. on Fault-Tolerant Computing, pp. 100--105, Jun. 1981.

[8]

J.R. Armstrong and F. G. Gray, "Fault Diagnosis in a Boolean n- Cube Array of Microprocessors," ~ 2%u~. Computers, vol. C- 30, pp. 587-590, Aug. 1981.

[9]

E. Dilger and E. Ammann, "System Level Self-Diagnosis in n-Cube Connected Multiprocesor Networks," in Proc. 14th In?. Syrup. on Fault Tolerant Computing, Kissimmee, FL, pp. 184-189, Jun. 1984.

[10]

R.K. Iyer and D. J. Rossetti. "Permanent CPU Errors and System Activity: Measurement and Modelling," Proc. Real-Time Systems Syrup., 1983.

[11]

D.A. Rennels, "Fault Tolerant Computing - Concepts and Examples," IEEE Ttrms. Computers, vol. C-33, pp. 1116-1129, D~. 1984.

[12]

G.C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon, Solving Problems on Cor~amnt Proeessors, Nov. 1986.

Digital Library

[13]

J.A. Abraham, P. Banerjee, C.-Y. Chen, W. K. Fuchs, S. -Y. Kuo, and A. L. N. Reddy, "Fault tolerance techniques for systolic arrays," Computer Magazine (Sped~ Issue on Systolic An~s: From Coneept to Implementation), vol. 20, pp. 65-77, Jul. 1987.

Digital Library

[14]

K.H. Huang and J. A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," ~ Trims. Con~ut er s, vol. C-33, pp. 518-528, Jun. 1984.

[15]

J.Y. Jou and J. A. Abraham, "Fault-Tolerant Matrix Operations on Multiple Processor Systems Using Weighted Checksums," in SP/E ~ngs. Aug. 1984.

[16]

J.Y. Jou and J. A. Abraham, "Fault Tolerant FFr Networks," in Proc. 15th Int. Syrnp. on Fault Tolenmt Computing, Ann Arbor, MI, Jun. 1985.

[17]

M. Malek and Y. H. Choi, "A Fault-Tolerant Fgr Processor," in P. roc.15th Fauh-Tolerant Comp. Syrup., Ann Arbor, MI, Jun. 1985.

[18]

F. Lulc, "Algorithm-Based Fault Tolerance for Parallel Matrix Solvers," Proc. SPIE Real-Time ~ Processing Vlll, vol. 564, 1985.

[19]

P. Banerjee and J. A. Abraham, "Fault-Secure Algorithms for Multiple Processor Systems," in Proc. l l th In?. Sym. on Computer Architecture, Ann Arbor, MI, pp. 279-287, Jun. 1984.

Digital Library

[20]

A.L.N. Reddy and P. Banerjee, "Algorithm-based Fault Detection Techniques in Signal Processing Applications," I~ Ttm~. Computers, (submitted for publication).

Digital Library

[21]

C. Aykanat and F. Ozguner, "A Coneurrent Error Deteeting Conjugate Gradient Algorithm on a Hypercube Multiprocessor," in Proc. 17th Int. Syrup, on Fault- Tolerant Computing, Pittsbarg, PA, pp. 204-209, July 1987.

[22]

C. -Y. Chen and J. A. Abraham, "Fault-Tolerant Systems for the Computation of Eigenvalues and Singular Values," Proc. $PIE Conf., pp. 228-237, Aug. 1986.

[23]

P. Banerjee and J. A. Abraham, "Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems," ~ Trims. Computers, vol. C-35, pp. 296-306, Apr. 1986.

Digital Library

[24]

P. Banerjee and J. A. Abraham, "Concurrent Fault Diagnosis in Multiple Processor Systems," in Proc. 16th Fault Tolerant Computing Symposium, Vienna, Austria, pp. 298-303, Jul. 1986.

[25]

P. Banerjee and .I.A. Abraham, "A Probabilistic Model of Algorithm-Based Fault Detection and Tolerance in Array Processors for Real-Time Systems,"New Orleans, LA, pp. 72-78, Dee. 1986.

[26]

Intel Scientific Computers, "iPSC: The First Family of Ooneur rent Supercomputers," 1985. product description.

[27]

D.J. Lu, "Watchdog Processors and Structural Integrity Checking," Trans. Computers. vol. C-31, pp. 681-685, Jul. 1982.

[28]

J.P. Shen and M. A. Schuette, "On-line Self Monitoring Using Signatured Instruction Streams," Proc. Int. Test Conf., pp. 275-282, Oct. 1983.

[29]

P. Banerjee, J. T. Rahmeh, C. $tunkel, V. S. S. Nair, K. Roy, and J. A. Abraham, "An Evaluation of System-level Fault Tolerance on the Intel Hypercube Multiprocessor," in Proc. 18th Int. Syrup. Fault- Tolerant Computing, Tokyo, Japan, Jun. 1988.

[30]

G.A. Geist and M. T. Heath, 'Matrix Factorization on a Hypercube Multiprocessor," in Proc. SIAM 1st Conf. on Hypereube Multiprooesso~ Knoxville, TN, Aug. 1985.

Cited By

Rescigno A(1997)Optimal Polling in Communication NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/71.5982738:5(449-461)Online publication date: 1-May-1997
https://dl.acm.org/doi/10.1109/71.598273
Rescigno (1994)Optimal polling in communication networksProceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing10.1109/SPDP.1994.346163(224-231)Online publication date: 26-Oct-1994
https://dl.acm.org/doi/10.1109/SPDP.1994.346163
Balasubramanian VBanerjee P(1990)Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube MultiprocessorsIEEE Transactions on Software Engineering10.1109/32.4438116:2(183-196)Online publication date: 1-Feb-1990
https://dl.acm.org/doi/10.1109/32.44381
Show More Cited By

Index Terms

A novel approach to system-level fault tolerance in hypercube multiprocessors

Recommendations

Subcube Fault Tolerance in Hypercube Multiprocessors

In this paper, we study the problem of constructing subcubes in faulty hypercubes. First a divide-and-conquer technique is used to form the set of disjoint subcubes in the faulty hypercube. The concept of irregular subcubes is then introduced to take ...
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

The design of fault-tolerant hypercube multiprocessor architecture is discussed. The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of ...
Using dynamic task level redundancy for OpenMP fault tolerance
ARCS'12: Proceedings of the 25th international conference on Architecture of Computing Systems

Obtaining fault tolerant applications and systems is one of today's most important topics of research. Fault tolerance is becoming more and more essential in shared memory parallel programs and in multi/many core architectures due to the decreasing size ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

C³P: Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1

January 1988

895 pages

ISBN:0897912780

DOI:10.1145/62297

Editor:
Geoffrey Fox
California Institute of Technology, Pasadena

Copyright © 1988 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 1988

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

Hypercube88

Sponsor:

SIGARCH

Hypercube88: Third Conference on Hypercube Concurrent Computers and Applications

January 19 - 20, 1988

California, Pasadena, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
270
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rescigno A(1997)Optimal Polling in Communication NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/71.5982738:5(449-461)Online publication date: 1-May-1997
https://dl.acm.org/doi/10.1109/71.598273
Rescigno (1994)Optimal polling in communication networksProceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing10.1109/SPDP.1994.346163(224-231)Online publication date: 26-Oct-1994
https://dl.acm.org/doi/10.1109/SPDP.1994.346163
Balasubramanian VBanerjee P(1990)Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube MultiprocessorsIEEE Transactions on Software Engineering10.1109/32.4438116:2(183-196)Online publication date: 1-Feb-1990
https://dl.acm.org/doi/10.1109/32.44381
Balasubramanian VBanerjee P(1990)Compiler-Assisted Synthesis of Algorithm-Based Checking in MultiprocessorsIEEE Transactions on Computers10.1109/12.5483739:4(436-446)Online publication date: 1-Apr-1990
https://dl.acm.org/doi/10.1109/12.54837
Balasubramanian VBanerjee P(1989)Algorithm-based error detection for signal processing applications on a hypercube multiprocessor[1989] Proceedings. Real-Time Systems Symposium10.1109/REAL.1989.63564(134-143)Online publication date: 1989
https://doi.org/10.1109/REAL.1989.63564

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents