Framework for Enabling Highly Available Distributed Applications for Utility Computing

Lakshmi, J.; Nandy, S. K.; Narayan, Ranjani; Varadarajan, Keshavan

doi:10.1007/11946441_52

J. Lakshmi²²,
S. K. Nandy²²,
Ranjani Narayan²³ &
…
Keshavan Varadarajan²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4330))

Included in the following conference series:

International Symposium on Parallel and Distributed Processing and Applications

Abstract

The move towards IT outsourcing is the first step towards an environment where compute infrastructure is treated as a service. In utility computing this IT service has to honor Service Level Agreements (SLA) in order to meet the desired Quality of Service (QoS) guarantees. Such an environment requires reliable services in order to maximize the utilization of the resources and to decrease the Total Cost of Ownership (TCO). Such reliability cannot come at the cost of resource duplication, since it increases the TCO of the data center and hence the cost per compute unit. We, in this paper, look into aspects of projecting impact of hardware failures on the SLAs and techniques required to take proactive recovery steps in case of a predicted failure. By maintaining health vectors of all hardware and system resources, we predict the failure probability of resources based on observed hardware errors/failure events, at runtime. This inturn influences an availability aware middleware to take proactive action (even before the application is affected in case the system and the application have low recoverability).

The proposed framework has been prototyped on a system running HP-UX. Our offline analysis of the prediction system on hardware error logs indicate no more than 10% false positives. This work to the best of our knowledge is the first of its kind to perform an end-to-end analysis of the impact of a hardware fault on application SLAs, in a live system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A methodology to assess the availability of next-generation data centers

Article 15 April 2019

Capacity Planning for Dependable Services

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

References

Ross, J.W., Westerman, G.: Preparing for utility computing: The role of it architecture and relationship management. IBM Syst. J. 43, 5–19 (2004)
Article Google Scholar
Sahai, A., Singhal, S., Joshi, R., Machiraju, V.: Automated policy-based resource construction in utility computing environments. In: IEEE/IFIP Network Operations and Management Symposium, pp. 381–393 (2004)
Google Scholar
Lussier, B., Chatila, R., Ingrand, F., Killijian, M., Powell, D.: On fault tolerance and robustness in autonomous systems. In: Proceedings of the 3rd IARP-IEEE/RAS-EURON joint workshop on technical challenges for dependable robots in human environments, manchester (gb), September 7-9 (2004)
Google Scholar
Erez, M., Jayasena, N., Knight, T.J., Dally, W.J.: Fault tolerance techniques for the merrimac streaming supercomputer. In: Gschwind, T., Aßmann, U., Nierstrasz, O. (eds.) SC 2005. LNCS, vol. 3628, p. 29. Springer, Heidelberg (2005)
Google Scholar
Prabhakaran, V., Bairavasundaram, L.N., Agrawal, N., Gunawi, H.S., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Iron file systems. In: SOSP 2005: Proceedings of the twentieth ACM symposium on Operating systems principles, pp. 206–220. ACM Press, New York (2005)
Chapter Google Scholar
Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: HPDC 2003: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 2003), p. 126. IEEE Computer Society, Washington (2003)
Chapter Google Scholar
Zhang, X., Zagorodnov, D., Hiltunen, M., Marzullo, K., Schlichting, R.: Fault-tolerant grid services using primarybackup: Feasibility and performance (2004)
Google Scholar
Xie, Z., Sun, H., Saluja, K.: Survey of fault-tolerant techniques in modern micro-processors. Technical report (Department of Electrical and Computer Engineering, University of Wisconsin-Madison)
Google Scholar
Foster, I., Kesselman, C., Lee, C., Lindell, R., Nahrstedt, K., Roy, A.: A distributed resource management architecture that supports advance reservations and co-allocation. In: Proceedings of the International Workshop on Quality of Service (1999)
Google Scholar
Foster, I., Kesselman, C.: The Grid - Blueprint for a New Computing Infrastructure, 2nd edn. Elsevier Publication, Amsterdam (2004)
Google Scholar
Nainwal, K.C., Lakshmi, J., Nandy, S.K., Narayan, R., Varadarajan, K.: A framework for QoS adaptive grid meta scheduling. In: DEXA Workshops, pp. 292–296 (2005)
Google Scholar
Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A.E.K., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery oriented computing (roc): Motivation, definition, techniques, and case studies. Technical Report UCB//CSD-021175, UC Berkeley Computer Science (2002)
Google Scholar
Narayan, R., Varadarajan, K., Lakshmi, J., Nandy, S.K., Agrawal, T., Singh, H.: Fault localization and recovery engine: A learning system. In: Hewlett-Packard Technical Conference (2005)
Google Scholar
Cooper, C., Moore, C.: HP-UX 11i internals, 1st edn. Prentice Hall Publication, Englewood Cliffs (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Science Bangalore, India
J. Lakshmi, S. K. Nandy & Keshavan Varadarajan
Morphing Machines, Bangalore, India
Ranjani Narayan

Authors

J. Lakshmi
View author publications
You can also search for this author in PubMed Google Scholar
S. K. Nandy
View author publications
You can also search for this author in PubMed Google Scholar
Ranjani Narayan
View author publications
You can also search for this author in PubMed Google Scholar
Keshavan Varadarajan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Minyi Guo
Department of Computer Science, St. Francis Xavier University, Antigonish, Canada
Laurence T. Yang
Dipartimento di Ingegneria dell’ Informazione - Second, University of Naples - Italy, Real Casa dell’Annunziata, via Roma, 29 81031, Aversa (CE), Italy
Beniamino Di Martino
Institute of Scientific Computing, University of Vienna, Nordbergstr. 15/C/3, A-1090, Vienna, Austria/JPL, Caltech, USA
Hans P. Zima
Computer Science Department, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra
Grid Computing Center, Shanghai Jiao Tong University, 800 Dongchuan Road, 200240, Shanghai, China
Feilong Tang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lakshmi, J., Nandy, S.K., Narayan, R., Varadarajan, K. (2006). Framework for Enabling Highly Available Distributed Applications for Utility Computing. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2006. Lecture Notes in Computer Science, vol 4330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946441_52

Download citation

DOI: https://doi.org/10.1007/11946441_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68067-3
Online ISBN: 978-3-540-68070-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Framework for Enabling Highly Available Distributed Applications for Utility Computing

Abstract

Access this chapter

Preview

Similar content being viewed by others

A methodology to assess the availability of next-generation data centers

Capacity Planning for Dependable Services

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Framework for Enabling Highly Available Distributed Applications for Utility Computing

Abstract

Access this chapter

Preview

Similar content being viewed by others

A methodology to assess the availability of next-generation data centers

Capacity Planning for Dependable Services

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation