Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/232973.232981acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article
Free access

COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

Published: 01 May 1996 Publication History
  • Get Citation Alerts
  • Abstract

    Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.

    References

    [1]
    AGARWAL, A., CHAIKEN, D., JOHNSON, K., KRANZ, D., KUBIATOWICZ, J., KURIHARA, K., LIM, B., MA, G., AND NUSSBAUM, D. The MIT Alewife machine : A largescale distributed memory multiprocessor. Research report MIT/LCS/TM-454, MIT Laboratory for Computer Science, June 1991.]]
    [2]
    BARTLETT, J., GRAY, J., AND HORST, B. Fault tolerance in Tandem computer systems. In The Evolution of Fault- Tolerant Computing, A. Avizienls, H. Kopetz, and J. Laprie, Eds., vol. 1. Springer Verlag, 1987, pp. 55-76.]]
    [3]
    BERNSTEIN, P. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing. IEEE Computer 21, 2 (February 1988), 37-45.]]
    [4]
    BIRMAN, K. Replication and fault-tolerance in the ISIS system. In Proc. of lOth A CM Symposium on Operating Systems Principles (Washington, December 1985), pp. 79-86.]]
    [5]
    CHAIKEN, D., FIELDS, C., KURIHARA, g., AND AGARWAL, A. Directory-based cache coherence in large-scale multiprocessors. IEEE Computer 23, 6 (June 1990), 49-58.]]
    [6]
    DAVIS, n., GOLDSCHMIDT, S., AND HENNESSY, J. Multiprocessor simulatlon using Tango. In Proc. of 1991 International Conference on Parallel Processing (August 1991), vol. II, pp. 99-107.]]
    [7]
    DIN, M., GRYGIER, A., HESSENAUER, H., HILDEBRAND, U., HONIG, J., HOHL, W., MICHEL, E., AND PATARICZA, A. Fault tolerance in distributed shared memory multiprocessors. In Parallel Computer Architectures (1994), A. Bode and M. Cin, Eds., vol. 732 of Lecture Notes in Computer Science, Springer Verlag, pp. 31-48.]]
    [8]
    ELNOZAHY, E., JOHNSON, D., AND ZWAENEPOEL, W. The performaalce of consistent checkpoint. In Proc. of 11 th Symposium on Reliable Distributed Systems (October 1992), pp. 39-47.]]
    [9]
    FRANK, S., BURKHARDT, H., AND ROTHNIE, J. The KSR1 : Bridging the gap between shared memory and MPPs. In Proc. of spring COMPCON'93 (February 1993), I. C. Society, Ed., pp. 285-294.]]
    [10]
    GEFFLAUT, A. Proposition et ~valuation d'une architecture multiprocesseur extensibIe d m~moire partagie tol~rante aux }autes. PhD thesis, Universit6 de Rennes I, January 1995.]]
    [11]
    GEFFLAUT, A., MORIN, C., AND BAN~.TRE, M. Tolerating node failu~:es in cache only memory architectures. In Proc. of Supercomputing'9~ (November 1994).]]
    [12]
    HAGERSTEN, E., LANDIS, A., AND HARIDI, S. DDM- a cacheonly memory architecture. IEEE Computer 25, 9 (September 1992), 44-54.]]
    [13]
    HARRISON, E., AND SCHMITT~ E. The structure of SYS- TEM/88, a fault-tolerant computer. IBM Systems Journal 26, 3 (1987), 293-318.]]
    [14]
    JANSSENS, B., AND FUCHS, W. Experimental evaluation of multiprocessor cache-based error recovery. In Proc. of 1991 International Conference on Parallel Processing (August 1991), vol. I, pp. 505-508.]]
    [15]
    KERMARREC, A., CABILLIC, G., GEFFLAUT, A., MORIN, C., AND PUAUT~ I. A recoverable distributed shared memory integrating coherence and recoverability. IEEE Computer Society Press, pp. 289-298.]]
    [16]
    KUSKIN, J., OFELT, D., HEINRICH, M., HEINLEIN, J., SIMONI, R., GHARACHORLOO, g., CHAPIN, J., NAKAHIRA, D., BAX- TER, J., HOROWITZ, M., GUPTA, A., ROSENBLUM, M., AND HENNP.SSY, J. The Stanford FLASH multiprocessor. In Proc. of 21th Annual International Symposium on Computer Architecture (Chicago, Illinois, April 1994), pp. 302-313.]]
    [17]
    LAMPSON, B. Atomic transactions. In Distributed Systems and Architecture and Implementation : an Advanced Course, vol. 105 o{ Lecture Notes in Computer Science. Springer Verlag, 1981, pp. 246-265.]]
    [18]
    LARUS, J. Abstract execution : A technique for efficiently tracing programs. Software Practice and Experience 20~ 12 (December 1990), 1251-1258.]]
    [19]
    LEE, P., AND ANDERSON, T. Fault Tolerance: Principles and Practice, second revised ed., vol. 3 of Dependable Computing and Fault-.Tolerant Systems. Springer Verlag, 1990.]]
    [20]
    LENOSKI, D., LAUDON, J., GHARACHORLOO, K., WEBER, W., GUPTA, A., HENNESSY, J., HOROWITZ, M., AND LAM, M. The Stanford DASH multiprocessor. IEEE Computer 25, 3 (March 1992), 63-79.]]
    [21]
    SAULSBURY, A., WILKINSON, T., CARTER, J., AND LANDIS, A. An argument for simple COMA. In Proc. of 1st IEEE Symposium on High-Performance Computer Architecture (January 1995 ).]]
    [22]
    SCHWETMAN, H. CSIM user's guide, rev. 2. Tech. Rep. ACT- 126-90, Rev. 2, MCC, July 1992.]]
    [23]
    SINGH, J., WEBER, W., AND GUPTA, A. SPLASH:Stanford parallel applications for shared-memory. Tech. Rep. CSL- TR-91-469, Computer Systems Laboratory, Stanford University, April 1991.]]
    [24]
    STENSTROM, P., JOE, T., AND GUPTA, A. Comparative performance evaluation of cache-coherent NUMA and COMA architectures, in Proc. of 19th Annual International Symposium on Computer Architecture (May 1992), pp. 80-91.]]

    Cited By

    View all
    • (2020)On Providing OS Support to Allow Transparent Use of Traditional Programming Models for Persistent MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/338863716:3(1-24)Online publication date: 23-Jun-2020
    • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
    • (2015)ThyNVMProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830802(672-685)Online publication date: 5-Dec-2015
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '96: Proceedings of the 23rd annual international symposium on Computer architecture
    May 1996
    318 pages
    ISBN:0897917863
    DOI:10.1145/232973
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 24, Issue 2
      Special Issue: Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96)
      May 1996
      303 pages
      ISSN:0163-5964
      DOI:10.1145/232974
      Issue’s Table of Contents

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 May 1996

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Scalable Shared Memory Multiprocessors
    2. backward error recovery
    3. coherence protocol
    4. fault-tolerance

    Qualifiers

    • Article

    Conference

    ISCA96
    Sponsor:
    ISCA96: International Conference on Computer Architecture
    May 22 - 24, 1996
    Pennsylvania, Philadelphia, USA

    Acceptance Rates

    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)29
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)On Providing OS Support to Allow Transparent Use of Traditional Programming Models for Persistent MemoryACM Journal on Emerging Technologies in Computing Systems10.1145/338863716:3(1-24)Online publication date: 23-Jun-2020
    • (2020)ACR: Amnesic Checkpointing and Recovery2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00013(30-43)Online publication date: Feb-2020
    • (2015)ThyNVMProceedings of the 48th International Symposium on Microarchitecture10.1145/2830772.2830802(672-685)Online publication date: 5-Dec-2015
    • (2011)ReboundACM SIGARCH Computer Architecture News10.1145/2024723.200008339:3(153-164)Online publication date: 4-Jun-2011
    • (2011)ReboundProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000083(153-164)Online publication date: 4-Jun-2011
    • (2006)SWICHIEEE Micro10.1109/MM.2006.10026:5(28-40)Online publication date: 1-Sep-2006
    • (2006)ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery ServersThe Twelfth International Symposium on High-Performance Computer Architecture, 2006.10.1109/HPCA.2006.1598129(203-214)Online publication date: 2006
    • (2003)Modeling and evaluating the time overhead induced by BER in COMA multiprocessorsJournal of Systems Architecture: the EUROMICRO Journal10.1016/S1383-7621(03)00024-948:13-15(377-385)Online publication date: 1-May-2003
    • (2002)ReViveProceedings of the 29th annual international symposium on Computer architecture10.5555/545215.545228(111-122)Online publication date: 25-May-2002
    • (2002)ReViveACM SIGARCH Computer Architecture News10.1145/545214.54522830:2(111-122)Online publication date: 1-May-2002
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media