Gurumdimma, Nentawe (2016) Towards efficient error detection in large-scale HPC systems. PhD thesis, University of Warwick.
Preview |
PDF
WRAP_THESIS_Gurumdimma_2015.pdf - Submitted Version - Requires a PDF viewer. Download (7MB) | Preview |
Abstract
The need for computer systems to be reliable has increasingly become important as the dependence on their accurate functioning by users increases. The failure of these systems could very costly in terms of time and money. In as much as system's designers try to design fault-free systems, it is practically impossible to have such systems as different factors could affect them. In order to achieve system's reliability, fault tolerance methods are usually deployed; these methods help the system to produce acceptable results even in the presence of faults. Root cause analysis, a dependability method for which the causes of failures are diagnosed for the purpose of correction or prevention of future occurrence is less efficient. It is reactive and would not prevent the first failure from occurring. For this reason, methods with predictive capabilities are preferred; failure prediction methods are employed to predict the potential failures to enable preventive measures to be applied.
Most of the predictive methods have been supervised, requiring accurate knowledge of the system's failures, errors and faults. However, with changing system components and system updates, supervised methods are ineffective. Error detection methods allows error patterns to be detected early to enable preventive methods to be applied. Performing this detection in an unsupervised way could be more effective as changes to systems or updates would less affect such a solution. In this thesis, we introduced an unsupervised approach to detecting error patterns in a system using its data. More specifically, the thesis investigates the use of both event logs and resource utilization data to detect error patterns. It addresses both the spatial and temporal aspects of achieving system dependability. The proposed unsupervised error detection method has been applied on real data from two different production systems. The results are positive; showing average detection F-measure of about 75%.
Item Type: | Thesis (PhD) |
---|---|
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software |
Library of Congress Subject Headings (LCSH): | High performance computing |
Official Date: | February 2016 |
Dates: | Date Event February 2016 Submitted |
Institution: | University of Warwick |
Theses Department: | Department of Computer Science |
Thesis Type: | PhD |
Publication Status: | Unpublished |
Supervisor(s)/Advisor: | Jhumka, Arshad |
Extent: | xviii, 180 leaves : illustrations, charts |
Language: | eng |
Persistent URL: | https://wrap.warwick.ac.uk/77699/ |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |