Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Framework for Enabling Highly Available Distributed Applications for Utility Computing

  • Conference paper
Parallel and Distributed Processing and Applications (ISPA 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4330))

Abstract

The move towards IT outsourcing is the first step towards an environment where compute infrastructure is treated as a service. In utility computing this IT service has to honor Service Level Agreements (SLA) in order to meet the desired Quality of Service (QoS) guarantees. Such an environment requires reliable services in order to maximize the utilization of the resources and to decrease the Total Cost of Ownership (TCO). Such reliability cannot come at the cost of resource duplication, since it increases the TCO of the data center and hence the cost per compute unit. We, in this paper, look into aspects of projecting impact of hardware failures on the SLAs and techniques required to take proactive recovery steps in case of a predicted failure. By maintaining health vectors of all hardware and system resources, we predict the failure probability of resources based on observed hardware errors/failure events, at runtime. This inturn influences an availability aware middleware to take proactive action (even before the application is affected in case the system and the application have low recoverability).

The proposed framework has been prototyped on a system running HP-UX. Our offline analysis of the prediction system on hardware error logs indicate no more than 10% false positives. This work to the best of our knowledge is the first of its kind to perform an end-to-end analysis of the impact of a hardware fault on application SLAs, in a live system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Ross, J.W., Westerman, G.: Preparing for utility computing: The role of it architecture and relationship management. IBM Syst. J. 43, 5–19 (2004)

    Article  Google Scholar 

  2. Sahai, A., Singhal, S., Joshi, R., Machiraju, V.: Automated policy-based resource construction in utility computing environments. In: IEEE/IFIP Network Operations and Management Symposium, pp. 381–393 (2004)

    Google Scholar 

  3. Lussier, B., Chatila, R., Ingrand, F., Killijian, M., Powell, D.: On fault tolerance and robustness in autonomous systems. In: Proceedings of the 3rd IARP-IEEE/RAS-EURON joint workshop on technical challenges for dependable robots in human environments, manchester (gb), September 7-9 (2004)

    Google Scholar 

  4. Erez, M., Jayasena, N., Knight, T.J., Dally, W.J.: Fault tolerance techniques for the merrimac streaming supercomputer. In: Gschwind, T., Aßmann, U., Nierstrasz, O. (eds.) SC 2005. LNCS, vol. 3628, p. 29. Springer, Heidelberg (2005)

    Google Scholar 

  5. Prabhakaran, V., Bairavasundaram, L.N., Agrawal, N., Gunawi, H.S., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Iron file systems. In: SOSP 2005: Proceedings of the twentieth ACM symposium on Operating systems principles, pp. 206–220. ACM Press, New York (2005)

    Chapter  Google Scholar 

  6. Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: HPDC 2003: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC 2003), p. 126. IEEE Computer Society, Washington (2003)

    Chapter  Google Scholar 

  7. Zhang, X., Zagorodnov, D., Hiltunen, M., Marzullo, K., Schlichting, R.: Fault-tolerant grid services using primarybackup: Feasibility and performance (2004)

    Google Scholar 

  8. Xie, Z., Sun, H., Saluja, K.: Survey of fault-tolerant techniques in modern micro-processors. Technical report (Department of Electrical and Computer Engineering, University of Wisconsin-Madison)

    Google Scholar 

  9. Foster, I., Kesselman, C., Lee, C., Lindell, R., Nahrstedt, K., Roy, A.: A distributed resource management architecture that supports advance reservations and co-allocation. In: Proceedings of the International Workshop on Quality of Service (1999)

    Google Scholar 

  10. Foster, I., Kesselman, C.: The Grid - Blueprint for a New Computing Infrastructure, 2nd edn. Elsevier Publication, Amsterdam (2004)

    Google Scholar 

  11. Nainwal, K.C., Lakshmi, J., Nandy, S.K., Narayan, R., Varadarajan, K.: A framework for QoS adaptive grid meta scheduling. In: DEXA Workshops, pp. 292–296 (2005)

    Google Scholar 

  12. Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A.E.K., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery oriented computing (roc): Motivation, definition, techniques, and case studies. Technical Report UCB//CSD-021175, UC Berkeley Computer Science (2002)

    Google Scholar 

  13. Narayan, R., Varadarajan, K., Lakshmi, J., Nandy, S.K., Agrawal, T., Singh, H.: Fault localization and recovery engine: A learning system. In: Hewlett-Packard Technical Conference (2005)

    Google Scholar 

  14. Cooper, C., Moore, C.: HP-UX 11i internals, 1st edn. Prentice Hall Publication, Englewood Cliffs (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lakshmi, J., Nandy, S.K., Narayan, R., Varadarajan, K. (2006). Framework for Enabling Highly Available Distributed Applications for Utility Computing. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2006. Lecture Notes in Computer Science, vol 4330. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946441_52

Download citation

  • DOI: https://doi.org/10.1007/11946441_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68067-3

  • Online ISBN: 978-3-540-68070-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics