Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/62297.62330acmconferencesArticle/Chapter ViewAbstractPublication Pagesc3pConference Proceedingsconference-collections
Article
Free access

A novel approach to system-level fault tolerance in hypercube multiprocessors

Published: 01 January 1988 Publication History
  • Get Citation Alerts
  • Abstract

    This paper addresses the issue of fault tolerance in hypercube architectures. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based fault detection. The basic idea used is to propose low cost fault detection and location schemes using high-level encodings on the data that are tailored to the algorithms being executed on the parallel machine. We have implemented system-level fault detection mechanisms for various parallel applications on a 16-processor Intel iPSC hypercube multiprocessor. In this paper we discuss the results of one such application, namely, Gaussian Elimination which is an extremely useful algorithm for solving a set of linear equations. We have performed extensive studies of fault coverage of our system level fault detection schemes in the presence of finite precision arithmetic which affects our system level encodings.

    References

    [1]
    C.L. Seitz, "The Cosmic Cube," in Comm. oftheACM, pp. 22-33, Jan. 1985.
    [2]
    J.C. Peterson, J. Tua2on, D. Lieberman, and M. Pniel. "The Mark III Hypercube-Ensemble Concurrent Computer." Proc. 1985 Pm'aUel Pro~si~ Confemr~ pp. 71- 73, Aug. 1985.
    [3]
    I. Koren," A Reconfigurable and Fault-Tolerant VLSI Multiprocessor Array," in Proc. 8th Int. Syrup. on Conqmter Amhiteetum Minneapolis, Minnesota, pp. 425-442, May 1981.
    [4]
    D.K. Pradhan, "Fault-Tolerant Multiprocessor Link and Bus Network Architectures," in IEER Trans. Computers. pp. 33-45, Jan. 1985.
    [5]
    R. Negrini, M. Sami, and Stefanelli, "Fault-tolerance Techniques for Array Structures Used in Supercomputing," ~ Computer, pp. 78-87, Feb. 1986.
    [6]
    D.A. Rennels, "On Implementing Fault Tolerance in Binary Hypercubes,"m Proc. 16th Int. Syrup. on Fa~t-Tolemnt Computing, Vienna, Austria, pp. 344-349, July 1986.
    [7]
    J.G. Kuhl and S. M. Reddy, "Fault Diagnosis in Fully Distributed Systems," Proc. l lth Int. Syrup. on Fault-Tolerant Computing, pp. 100--105, Jun. 1981.
    [8]
    J.R. Armstrong and F. G. Gray, "Fault Diagnosis in a Boolean n- Cube Array of Microprocessors," ~ 2%u~. Computers, vol. C- 30, pp. 587-590, Aug. 1981.
    [9]
    E. Dilger and E. Ammann, "System Level Self-Diagnosis in n-Cube Connected Multiprocesor Networks," in Proc. 14th In?. Syrup. on Fault Tolerant Computing, Kissimmee, FL, pp. 184-189, Jun. 1984.
    [10]
    R.K. Iyer and D. J. Rossetti. "Permanent CPU Errors and System Activity: Measurement and Modelling," Proc. Real-Time Systems Syrup., 1983.
    [11]
    D.A. Rennels, "Fault Tolerant Computing - Concepts and Examples," IEEE Ttrms. Computers, vol. C-33, pp. 1116-1129, D~. 1984.
    [12]
    G.C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon, Solving Problems on Cor~amnt Proeessors, Nov. 1986.
    [13]
    J.A. Abraham, P. Banerjee, C.-Y. Chen, W. K. Fuchs, S. -Y. Kuo, and A. L. N. Reddy, "Fault tolerance techniques for systolic arrays," Computer Magazine (Sped~ Issue on Systolic An~s: From Coneept to Implementation), vol. 20, pp. 65-77, Jul. 1987.
    [14]
    K.H. Huang and J. A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," ~ Trims. Con~ut er s, vol. C-33, pp. 518-528, Jun. 1984.
    [15]
    J.Y. Jou and J. A. Abraham, "Fault-Tolerant Matrix Operations on Multiple Processor Systems Using Weighted Checksums," in SP/E ~ngs. Aug. 1984.
    [16]
    J.Y. Jou and J. A. Abraham, "Fault Tolerant FFr Networks," in Proc. 15th Int. Syrnp. on Fault Tolenmt Computing, Ann Arbor, MI, Jun. 1985.
    [17]
    M. Malek and Y. H. Choi, "A Fault-Tolerant Fgr Processor," in P. roc.15th Fauh-Tolerant Comp. Syrup., Ann Arbor, MI, Jun. 1985.
    [18]
    F. Lulc, "Algorithm-Based Fault Tolerance for Parallel Matrix Solvers," Proc. SPIE Real-Time ~ Processing Vlll, vol. 564, 1985.
    [19]
    P. Banerjee and J. A. Abraham, "Fault-Secure Algorithms for Multiple Processor Systems," in Proc. l l th In?. Sym. on Computer Architecture, Ann Arbor, MI, pp. 279-287, Jun. 1984.
    [20]
    A.L.N. Reddy and P. Banerjee, "Algorithm-based Fault Detection Techniques in Signal Processing Applications," I~ Ttm~. Computers, (submitted for publication).
    [21]
    C. Aykanat and F. Ozguner, "A Coneurrent Error Deteeting Conjugate Gradient Algorithm on a Hypercube Multiprocessor," in Proc. 17th Int. Syrup, on Fault- Tolerant Computing, Pittsbarg, PA, pp. 204-209, July 1987.
    [22]
    C. -Y. Chen and J. A. Abraham, "Fault-Tolerant Systems for the Computation of Eigenvalues and Singular Values," Proc. $PIE Conf., pp. 228-237, Aug. 1986.
    [23]
    P. Banerjee and J. A. Abraham, "Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems," ~ Trims. Computers, vol. C-35, pp. 296-306, Apr. 1986.
    [24]
    P. Banerjee and J. A. Abraham, "Concurrent Fault Diagnosis in Multiple Processor Systems," in Proc. 16th Fault Tolerant Computing Symposium, Vienna, Austria, pp. 298-303, Jul. 1986.
    [25]
    P. Banerjee and .I.A. Abraham, "A Probabilistic Model of Algorithm-Based Fault Detection and Tolerance in Array Processors for Real-Time Systems,"New Orleans, LA, pp. 72-78, Dee. 1986.
    [26]
    Intel Scientific Computers, "iPSC: The First Family of Ooneur rent Supercomputers," 1985. product description.
    [27]
    D.J. Lu, "Watchdog Processors and Structural Integrity Checking," Trans. Computers. vol. C-31, pp. 681-685, Jul. 1982.
    [28]
    J.P. Shen and M. A. Schuette, "On-line Self Monitoring Using Signatured Instruction Streams," Proc. Int. Test Conf., pp. 275-282, Oct. 1983.
    [29]
    P. Banerjee, J. T. Rahmeh, C. $tunkel, V. S. S. Nair, K. Roy, and J. A. Abraham, "An Evaluation of System-level Fault Tolerance on the Intel Hypercube Multiprocessor," in Proc. 18th Int. Syrup. Fault- Tolerant Computing, Tokyo, Japan, Jun. 1988.
    [30]
    G.A. Geist and M. T. Heath, 'Matrix Factorization on a Hypercube Multiprocessor," in Proc. SIAM 1st Conf. on Hypereube Multiprooesso~ Knoxville, TN, Aug. 1985.

    Cited By

    View all
    • (1997)Optimal Polling in Communication NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/71.5982738:5(449-461)Online publication date: 1-May-1997
    • (1994)Optimal polling in communication networksProceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing10.1109/SPDP.1994.346163(224-231)Online publication date: 26-Oct-1994
    • (1990)Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube MultiprocessorsIEEE Transactions on Software Engineering10.1109/32.4438116:2(183-196)Online publication date: 1-Feb-1990
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    C3P: Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
    January 1988
    895 pages
    ISBN:0897912780
    DOI:10.1145/62297
    • Editor:
    • Geoffrey Fox
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 1988

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    Hypercube88
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 29 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (1997)Optimal Polling in Communication NetworksIEEE Transactions on Parallel and Distributed Systems10.1109/71.5982738:5(449-461)Online publication date: 1-May-1997
    • (1994)Optimal polling in communication networksProceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing10.1109/SPDP.1994.346163(224-231)Online publication date: 26-Oct-1994
    • (1990)Tradeoffs in the Design of Efficient Algorithm-Based Error Detection Schemes for Hypercube MultiprocessorsIEEE Transactions on Software Engineering10.1109/32.4438116:2(183-196)Online publication date: 1-Feb-1990
    • (1990)Compiler-Assisted Synthesis of Algorithm-Based Checking in MultiprocessorsIEEE Transactions on Computers10.1109/12.5483739:4(436-446)Online publication date: 1-Apr-1990
    • (1989)Algorithm-based error detection for signal processing applications on a hypercube multiprocessor[1989] Proceedings. Real-Time Systems Symposium10.1109/REAL.1989.63564(134-143)Online publication date: 1989

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media