Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Towards efficient error detection in large-scale HPC systems

[thumbnail of WRAP_THESIS_Gurumdimma_2015.pdf]
Preview
PDF
WRAP_THESIS_Gurumdimma_2015.pdf - Submitted Version - Requires a PDF viewer.

Download (7MB) | Preview

Request Changes to record.

Abstract

The need for computer systems to be reliable has increasingly become important as the dependence on their accurate functioning by users increases. The failure of these systems could very costly in terms of time and money. In as much as system's designers try to design fault-free systems, it is practically impossible to have such systems as different factors could affect them. In order to achieve system's reliability, fault tolerance methods are usually deployed; these methods help the system to produce acceptable results even in the presence of faults. Root cause analysis, a dependability method for which the causes of failures are diagnosed for the purpose of correction or prevention of future occurrence is less efficient. It is reactive and would not prevent the first failure from occurring. For this reason, methods with predictive capabilities are preferred; failure prediction methods are employed to predict the potential failures to enable preventive measures to be applied.

Most of the predictive methods have been supervised, requiring accurate knowledge of the system's failures, errors and faults. However, with changing system components and system updates, supervised methods are ineffective. Error detection methods allows error patterns to be detected early to enable preventive methods to be applied. Performing this detection in an unsupervised way could be more effective as changes to systems or updates would less affect such a solution. In this thesis, we introduced an unsupervised approach to detecting error patterns in a system using its data. More specifically, the thesis investigates the use of both event logs and resource utilization data to detect error patterns. It addresses both the spatial and temporal aspects of achieving system dependability. The proposed unsupervised error detection method has been applied on real data from two different production systems. The results are positive; showing average detection F-measure of about 75%.

Item Type: Thesis (PhD)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Library of Congress Subject Headings (LCSH): High performance computing
Official Date: February 2016
Dates:
Date
Event
February 2016
Submitted
Institution: University of Warwick
Theses Department: Department of Computer Science
Thesis Type: PhD
Publication Status: Unpublished
Supervisor(s)/Advisor: Jhumka, Arshad
Extent: xviii, 180 leaves : illustrations, charts
Language: eng
Persistent URL: https://wrap.warwick.ac.uk/77699/

Export / Share Citation


Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item