Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

Understanding Error Log Event Sequence F

The document discusses understanding error log event sequences for failure analysis. It presents an approach to find similarities in patterns of log events that lead to failures. The approach clusters similar log events, extracts message content from event logs for failure episodes, and uses the Jenson-Shannon distribution metric to capture similarities in patterns of failure sequences. The goal is to help with root cause analysis and failure prediction by analyzing log files to better understand failures in high performance computing systems.

Uploaded by

Sheila Robert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Understanding Error Log Event Sequence F

The document discusses understanding error log event sequences for failure analysis. It presents an approach to find similarities in patterns of log events that lead to failures. The approach clusters similar log events, extracts message content from event logs for failure episodes, and uses the Jenson-Shannon distribution metric to capture similarities in patterns of failure sequences. The goal is to help with root cause analysis and failure prediction by analyzing log files to better understand failures in high performance computing systems.

Uploaded by

Sheila Robert
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Full Length Research Article

Science World Journal Vol 13(No 4) 2018


www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

UNDERSTANDING ERROR LOG EVENT SEQUENCE FOR FAILURE


ANALYSIS
Nentawe Gurumdimma1, Desmond Bala Bisandu2*

1,2Department of Computer Science, University of Jos, Nigeria

*Corresponding Author Email Address: yusufn@unijos.edu.ng / bisandud@unijos.edu.ng

ABSTRACT before it occurs; this will allow proactive or necessary corrective


Due to the evolvement of large-scale parallel systems, they are measures to be taken to avoid or prevent the failure (Gainaru et
mostly employed for mission critical applications. The anticipation al., 2012). This is thought to go a long in helping not only the
and accommodation of failure occurrences is crucial to the cluster system administrators, but also the businesses and other
design. A commonplace feature of these large-scale systems is applications that runs on this (Cinque et al., 2014).
failure, and they cannot be treated as exception. The system state One of the challenges for researchers trying to do root cause
is mostly captured through the logs. The need for proper analysis or failure prediction is the understanding of log files to
understanding of these error logs for failure analysis is extremely perform their analysis (Fronza et al., 2013). Failure prediction
important. This is because the logs contain the “health” tend to make use of these log events, study the existing failures
information of the system. In this paper we design an approach and understand the pattern in order to build models that can
that seeks to find similarities in patterns of these logs events that predict future ones (Samak et al., 2012; Bisandu et al., 2018).
leads to failures. Our experiment shows that several root causes These log events tends to be chatty, huge and maybe, without
of soft lockup failures could be traced through the logs. We any particular standard format in which they were logged (Liang et
capture the behavior of failure inducing patterns and realized that al., 2006).
the logs pattern of failure and non-failure patterns are dissimilar. In this paper, we introduce an approach that seeks to find
similarities in patterns of these logs events that leads to failures.
Keywords: Failure Sequences; Cluster; Error Logs; HPC; Our contribution is as follows:
Similarity i. Work has shown that so many of these log events are
similar events occurring or logged at different times. These
INTRODUCTION events can be reduced to foster easy analyses. We use a
High Performance Computer (HPC) systems have become the distance metric and cluster these events according to their
backbone for major tasks like weather forecasting, stocks trading, similarity and assign a unique Id to them. These IDs can
simulation etc. Its use has greatly exploded due to this reliance on be easily used for analysis algorithms.
it (Nagarajan et al., 2007). It posses great challenge to users not ii. Message content of the events logs provides useful
only for its high power dissipation due to size, but also in ensuring information to the errors that occurs. These English
that these computers stay without failure (Zheng et al., 2010). In messages part of the events logs are extracted for every
cloud, the servers or clusters that serve the clouds and other failure episode. We use Latent Semantic Indexing to
services are prone to failures too. These days, almost all perform dimensional reduction of our terms matrix for
applications make use of cluster computers that are yet very these failure sequences which can be easily used in our
prone to failure. And the challenge is how can these failures be similarity metric algorithm.
detected and / or avoided since any small down time can be iii. In order to understand the patterns in which these events
significantly costly (Zheng et al., 2010). occur, we studied failures and observe if similar failure
This then means that cluster systems must be at their optimum sequences contains the same pattern, and how long in
performance and very effective. However, this is not always the terms of time can these similarities remain observable
case, the cluster nodes fail due to errors. The effects of a single across the failure sequences. Our approach employs the
node failure can be immense and propagate leading to failure of Jenson – Shannon Distribution which as a metric captures
more nodes and eventually the whole cluster system if not the similarity in patterns of the failure sequences.
corrected (Fu and Xu, 2007). The paper is structured as follows. Section II explains some
The challenge is to identify these failures for proper or necessary related works previously done. Section III describes our data and
steps to be taken to mitigate the effects or avoid the failures. Root the failures. We explain our data pre-processing in section IV,
cause analysis is an approach towards solving this problem, and which also involves clustering log events based on similarity.
research in this area seeks to study the cluster log files for failures Section V contains our approach towards finding similarity in the
and their probable cause (Yang, 2003). These log file contains failure sequences patterns. We discuss our results in section VI
systems’ health information and other activities going on within and conclude in section VII.
the cluster system. However, root cause analysis only seeks to
know or identify the failures that occur and the probable cause Related Work
after the damage is done. The failure could lead to huge cost; Over the years, logs have been considered as text files in human
therefore, it is better to know if a failure will occur before it actually readable forms that are readable for administrators and
happens (Fronza et al., 2013). This is called failure prediction. A developers. They signify one of the few mechanisms of gaining
failure can be predicted from the initial symptoms or other means visibility of system behavior (Hadžiosmanović et al., 2012).

8
Understanding Error Log Event Sequence for Failure Analysis
Science World Journal Vol 13(No 4) 2018
www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

Because of this reason logs have been applied in many areas an approach that analyzed logs by investigating both failure logs
with different context of application domains in the last decades. A and usage.
non-exhaustive list of such applications includes the following: The authors in Rouillard, (2004) presented their study making
mobile devices and control systems (Leveson, 2003), operating difference between application and system failures. Job logs and
systems (Makanju et al., 2010), large-scale applications (Liang et RAS logs to filter out failures having no effect on system running
al., 2006), and supercomputers (Gainaru et al., 2013). A jobs, allowing them to make couple of observations that are
significant understanding of failure mode of higher performance interesting and could be helpful in future failure predictors. An
systems has been achieved due to the contributions made by approach more general was proposed by the authors in
these studies: Kang and Grimshaw, (2007), which made the Rajachandrasekar et al., (2012) which is based on solution of
possibility to improve the release of the systems successful. Data middleware between various application and analysis modules.
log models use many state-of-the-art techniques software Decision-making engines and failure predictors relying on the
packages for manipulating them, e.g. (Makanju et al., 2010; information of distributed failure are to facilitate fault tolerance
Barringer et al., 2010), ad-hoc algorithms and strategies for mechanisms such as preemptive job migration from either their
identifying failure logs in a failure-related entries to coalesce framework. Differently authors in Lou et al., (2010a) and Xu et al.,
related entries of the same problem are needed (Goto et al., (2009) proposed approaches by investigating parameter
2007). Achieving accurate measurements is the critical objective correspondence among various log messages of application to
of the tasks which has been re-arranged in this area of research, extract dependencies between system components. Time-series
such as, (Cotroneo et al., 2007), (Oliner and Stearley, 2007). analysis also have been used for the implementation of different
Failure prediction approaches in relation to theory of reliability and methods of processing, such as subspace method and spike
preventive maintenance have been designed over the years detection in finding patterns among outliers which shows
(Dhiman et al., 2013). Incorporating many factors into the anomalies in monitored systems. Author in Liang et al., (2006)
distribution, for example complexity of code Stearley, (2004) has analyze BlueGene/L system logs by combining spatial and
been one of the reasons for model evolving (Kalbfleisch and temporal filtering, specifically, designed predictive methods for
Prentice, 2002). These methods have been tailored more through failure which was tested to be effective with about 80% of network
long-term predictions and fail to appropriately work for prediction and memory failures. Five supercomputer systems logs were
of failures that are online. analyzed in Oliner and Stearley, (2007), by providing an optimized
More recent approaches for predicting short-term failures are algorithm of the algorithm proposed by the authors in Liang et al.,
based on runtime monitoring typically as the account of system (2006). However, the filtering algorithm proposed might remove
current state is taking (Gurumdimma et al., 2016; Gurumdimma alerts that are independent by coincidence, happening at same
and Jhumka, 2017). The literatures have indicated two levels of time on different nodes.
predicting failures, they are: component level and system level. In
the first level components are observed (mother board and hard- LOG DATA
disk) using parameters that are specific to them and the A. Error Logs
knowledge of their domain with each having different approaches Log files are mostly the only means and source of information
given the best results for prediction (Sullivan and Chillarege, about the workings of any computer systems. It is system
1991). Approaches that compare the execution of the failed administrator's guide to diagnosing faults in computer systems.
components with good ones are an example. Several researches Computer systems grow more complex and this means increased
from different fields fitting this category are in (Bolander et al., in the logs also. The task of the system's administrators become
2009; Patra et al., 2010). The community of Higher Performance complex also or maybe impossible using the large log messages
Computing (HPC), an example is in Zheng et al., (2007), where (Barringer et al., 2010; Janbeglou et al., 2010; Lou et al., 2010b).
the record of system performance matrices using matrices at all Log messages have become the main source of information for
intervals. Outliers are detected by the algorithms afterwards by Root Cause Analysis of failures of systems.
identifying majority nodes and nodes that are far from them. Event log files which contains logs with basic information about
The second level which is the prediction of failures at the system the state of the system, the activities going on and the system's
level, different system parameters are observed by the monitoring health related information is what we will focus on for this work.
daemons (scheduler logs, system logs, performance metrics, etc.) The challenge with the log file data is that they are generally
and the existence of correlation between different events are unstructured, often incomplete, not clearly understood, and most
investigated. Significant number of researches has been times has no particular message structure. This is often the
proposed focusing on HPC systems prediction by analyzing them challenge with the logs; therefore, we process our data to give it
in the last couple of years. Most predictors however, uses some structure. Formatting our data into a structure that is
information extracted in the phase of training for short span uniform and can give us the necessary information we need for
predictions which required a new phase of training. An example is our analysis. A careful investigation of the log messages showed
the work done by the authors in Zheng et al., (2010) and Zhang et that there is a pattern of occurrence of these errors before a
al., (2004) which for training it takes up to 3 months, and half failure (Pecchia et al., 2011). Therefore we decided to analyze
month for the predictions, while the later authors make further to find patterns and the relationship of the events to
comparison of two approaches for predicting failure and the failures.
observation window influence on the results are also studied. The
authors in Gu et al., (2008) applied meta-learning predictor in B. Failure Identification
choosing between statistical and rule-based method pending on It is necessary we know that our data contain failure events; and
the predictor that give best result corresponding to the system that these failures did take place at the super computing system
state. Analogously the authors in Nakka et al., (2011) proposed where the log files are recorded.

Understanding Error Log Event Sequence for Failure Analysis 9


Science World Journal Vol 13(No 4) 2018
www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

Soft Lockup Failures: These occur when requests are not Ranger (syslog)
attended to maybe due to resource unavailability and also, a Mar 31 15:56:57 i149-405 kernel: [9155992.130789] Machine
process of reconnection initiated is also refused. This repeated check events logged
process may lead to deadlock processes or the nodes or servers Mar 31 15:57:07 i144-110 kernel: [9155996.614494] Machine
may hang. Eventually, this will lead to loss of data or service check events logged
access and causes termination of jobs. These soft lockups can be Mar 31 15:58:07 i102-406 kernel: [414646.585574] BUG: soft
either node soft lockups or server soft lockups. The occurrence of lockup detected on CPU#2, pid:22297, uid:0, comm:ldlm_bl_24
the failure, soft lock is of interest to us. We want to know the kinds Mar 31 15:58:07 i102-406 kernel: [414646.689471] spurious soft
of events that would likely lead to this failure. lockup detection on CPU#2
A study of the log files and the expert's knowledge has shown that Blue Gene/L
Soft lockups are some of commonly occurring failure found in the 1117838702 2005.06.03 R02-M1-N0-C:J12-U11 2005-06-03-
Cluster or HPC systems. Yuan et al., (2014) has shown that 15.45.02.981210 R02-M1-N0-C:J12-U11 RAS KERNEL INFO
Machine Check Exceptions causes soft lockup in computer instruction cache parity error corrected
systems and also, Evict/RPC events. 1117838703 2005.06.03 R02-M1-N0-C:J12-U11 2005-06-03-
Machine Check Exception (MCE) is a way the Computer's 15.45.03.145256 R02-M1-N0-C:J12-U11 RAS KERNEL INFO
hardware reports about an error that the hardware cannot correct. instruction cache parity error corrected
When the kernel logs an uncorrected hardware error, measures 1117838704 2005.06.03 R02-M1-N0-C:J12-U11 2005-06-03-
can be taken by the cluster software to rectify the problem, re- 15.45.04.007681 R02-M1-N0-C:J12-U11 RAS KERNEL INFO
running the job on another node and/or reporting the failure to the instruction cache parity error corrected
administrator (Janbeglou et al., 2010). Therefore, MCE error Figure 2: Log events for RANGER syslog and BUE GENE/L
makes it possible to predict failures early. Soft lockups led by
Evict/RPC Events are characterized by evict and recovery events In an attempt towards overcoming these challenges, Fronza et al
preceding the failure. Cotroneo et al., (2007) have verified these (Fronza et al., 2013) proposed that log files should not be written
hypotheses using a correlation and regression technique and for systems administrators, who have a good understanding of the
obtained a high correlation between the MCE and Evict/RPC systems alone, but the logs should contain extra information in a
events and the soft lockup failure events. well-structured format. In their proposal, the log file format
Log pre-processing standard must clearly define the type of information to be
In this section we present the detail steps involved in processing contained in the logs and must be uniquely represented. As long
the log files into the format we can easily use for our analysis. as information about structure, environment, error event's unique
features are not captured in general formats, log pre-processing
Log file-Containing events from the cluster will always be a necessity for failure prediction purposes (Eason
et al., 1995).
We use the standard Linux syslog error events obtained from
Tokenization and Parsing Ranger supercomputer. Though this data has been given some
structure or formats as shown Figure 2 above. The logs were
recorded in 2010. In our work, some of the fields are considered
Message Extraction not needed for the purpose of this prediction, and hence can be
discarded. For example, a field like protocol is not necessary. We
Unique Event ID assignment
also remove all the unnecessary tokens from the messages, for
example, tokens containing symbols. The message part, which
contains the sequence of English words, explains the error, and
Removing Similar Events this is important to us since we can use text mining techniques to
analyze the data. The error message is broken down into tokens
and fields that we consider important to our research.
Figure 1: Log Pre-processing steps
D. Message Extraction
C. Error log Tokenization and Parsing The message is considered as one of the key part of the error
The inconsistency in the format of error log entries means more event in our research. In our observation, the message contains
challenge in dealing with the logs, especially when all information English words and alphanumeric tokens. The English tokens
carried by the logs is important in the analysis (Liang et al., 2006). show a pattern from our observation; and it provides us with clues
It becomes very difficult to automatically analyze/process log files regarding what an error message is all about or what is going on
because of the differences in the formats, system's information with the system. The alphanumeric tokens, according to experts,
contained in the reports. These systems information can be suggest the interacting components or software functions within
proprietary in formats and contains so many messages that are involved. These component do not occur frequently and shows
not needed. See Figure 2. less or no pattern, hence it become less important in the message
and is extracted out (Yuan et al., 2014)

E. Log Event Clustering and ID Assignment


Error log messages need to be labeled with a unique ID for every
error event with different and unique message. It is based on the
cluster similarity of the events. These messages, which are

Understanding Error Log Event Sequence for Failure Analysis 10


Science World Journal Vol 13(No 4) 2018
www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

basically natural language texts providing more insights to the can logs several similar messages that are triggered by
error logs are used for this ID assignment. In our algorithm to same fault.
perform this, we use the well-known Levenshtein distance, which ii. Similar error events that are reported by different nodes in
measures the difference between our message strings, to a sequence and within a time threshold.
automatically assign an ID to the error events (Janbeglou et al.,
2010). Messages with metric less than a threshold are regarded This could be triggered by same fault resulting in similar
as similar hence the events will be assigned same ID as seen in misbehaviour by affected cluster nodes.
Figure 3. The essence is to enable us identify unique events and Time threshold is a time from the first similar event in a sequence.
the pattern in which they occur throughout the whole error log It is pertinent to note that it is possible that same error messages
data. logged by different nodes are caused by different faults and at
different times, hence the time threshold is necessary. Also, our
Input = log events
error event similarity is obtained as explained earlier in the
LD= Levenshtein Distance
section. The process of identifying and grouping the error events
Int i=0;
exhibiting the above properties is done using a combination of
For i=1 to N {
both tupling and time grouping heuristics (Salfner, 2005;
Sim(event(i)) = LD(event(1),event(i));
Kalbfleisch and Prentice, 2002; El-Sayed and Schroeder, 2013).
We define some heuristics that captured the properties outlined in
}
Section IV. It becomes easier to manipulate the data with reduced
All events with close value of Sim are clustered together
size. One of the aims of this research is to determine if there is
as similar events
pattern of occurrence of events leading up to a particular failure,
For j = 1 to N {
and if so, does these patterns have some similarity? The
ID = getID(EventCluster(j))
challenge now is that we cannot work with the whole data to find
patterns within the log events, hence leading to next section, error
//Assign ID to events according to their cluster.
pattern and window size estimation (Fu and Xu, 2007).
For each event in EventCluster(j){
Assign ID to events.
Error pattern and window size estimation
}
Error log events are quite large as millions of them are logged
}
within long period. This makes it difficult to handle. Given a series
Figure 3: Algorithm for event clustering and ID labeling of events that occur before a failure, we want to understand the
pattern of occurrence of these events and identify any signature.
F. Time Conversion In an attempt to obtain failure pattern, it is necessary to note that
Our error log messages contain date and time in which these different failures with different signatures or pattern can occur and
errors are reported. This is very useful to us and for any all the error events are logged together (El-Sayed and Schroeder,
meaningful failure prediction or analysis. However, this time 2013). Therefore it becomes necessary to know the time window
format is not good for us to use for manipulation. The formats of that can adequately capture a particular failure pattern. From
the timestamps 2010 Mar 31 15:56:57 is for an error that t
occurred at 15th hour, 56th minute and 57th second on March 31, Figure 4, we want to know the best minimum time window w
2010. We convert this to epoch timestamp format. The equivalent that can be considered 'good enough' to give us a pattern that led
epoch timestamp for the above is 1270051017 which can be to failure f1 .
manipulated easily.
ek+j
e1 e2 e3 . . . ek ek+1 . . . en
time
G. Removing Similar Events
f1 f2
There are several seemingly same error events reported
frequently in the logs, according to our observation. We also
observed that these errors are sometimes reported by same
cluster node and within a small time difference. Some are Time Window
reported by different cluster nodes but same error message and Figure 4: Error events sequence in time
at same time or within a small time difference. According to
authors in Kornack and Rakic, (2001) and Heien et al., (2011), We considered all messages within time window tw, of 1 - 6 hours
occurrence of similar or same events' errors within same time or for some failure events that occurred within a period of time. We
small time difference might likely be caused by same fault. studied the events within the months of March to May 2010. In
Therefore, removing the redundant messages is necessary. In this section, we explain our attempt towards obtaining a pattern
another sense, one would say leaving the `redundant' events leading to failure. The workflow is as shown in Figure 5.
could be useful in understanding the behaviour of a particular fault
in terms of the frequency of the event log generated within the
period. However in our case, we consider this not wise, since it is
not always the case in a cluster system that this behaviour is
observed. Therefore, we reduce these redundant error events
which we considered having the following properties:
i. Similar error events that are reported in sequence by same
node within a small time threshold. This is because nodes
11
Understanding Error Log Event Sequence for Failure Analysis
Science World Journal Vol 13(No 4) 2018
www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

Matrix Normalization: Consider having many failure sequences


Pre-processed logs with similar terms between them; the term frequency matrix may
contain the combination of all the terms in the failure episode, for
example, when we have 100 thousand terms, it means the
Create failure episodes from the pre-processed
covariance matrix will have be of dimension 100,000 by 100,000.
event logs. Failure episodes with time windows
This is quite high dimension. This high dimension data is reduced
of 1hr -6hrs. using Latent Semantic Indexing, LSI. LSI performs dimensional
reduction just like PCA, the difference is that in LSI, pre-
processing of data, which involve vector normalization to zero
Message Extraction: messages of each failure
mean and normalization to unit variance is not done. Normalizing
episodes are extracted.
the feature data to unit variance will unnecessarily scale the
weight of rarely occurring terms in the failure sequences. Some
relationship between terms vectors which were not clearly known,
Data Transformation to obtain term frequency
LSI expressed this by reducing the noisy relationships (Nagarajan
matrix of failure episodes and normalization.
et al., 2007; Goto et al., 2007; Samak et al., 2012; Pecchia et al.,
2011). It performs this by decomposing the raw matrix M into

Failure episode similarity measurement using three reduced matrices, USV . Obtaining a k-dimension
Jenson Shannon Divergence metric.
reduced matrix Mk as in Equation 1.

Figure 5: Failure sequences similarity and window estimation Mk = UkSkVTk (1)


workflow
Where U = term vector, S = computed diagonal matrix of
H. Data Transformation decreasing singular values, V = failure sequences vector.
Since we are considering only the message part of the error The term weights or frequency across the failure sequences are
events for our analysis, the unstructured messages were normalized to a value within 0 and 1 as in Equation 2.
extracted as explained in preprocessing. Hence, we can apply wt ( fi )
some text analysis techniques to obtain some form of relationship nt  n

w ( f )
between the contents of the error messages and correlate
semantically related terms in the messages. However, this is not t i
just possible without transforming our data to the format that can i 1 (2)
easily be used by the analysis algorithms. nt wt ( f )
The messages from each failure sequences are transformed into Where is normalized term weight, is weight of
term - frequency matrix. The rows of the matrix contains the terms term t in failure episode f.
while the columns are the failure sequences. A failure sequence
consists of events that precedes the failure event within a given I. Failure Pattern Similarity Measure Using Jenson –
time window (Singh et al., 2012). The transformation of data to Shannon Divergence metric
matrix format is because matrix format can be used easily by our After identifying different soft lockup failures, this section seeks to
pattern analysis algorithm. Hence we considered this a wise know how similar these failure patterns or the error events of each
choice. failure sequences are to the other. From section II, we established
Consider event logs of a cluster system containing a particular that soft lockup failures can be caused by either Machine Check
failure f i , and i  1...n , we extracts all the events that occurred Exception events (MCE) or Evict/RPC events. The similarities
between these failures are obtained to be sure if MCE or
before the failure within a time window t w ; (in our experiment, we Evict/RPC led soft lockups contain similar failure pattern.
use t w : 1hr - 6 hrs). For example, given n number of soft lockup The Jenson – Shannon Divergence (JSD), measures the
failures, with t w = 1 hour, the matrix M, for this will contain m divergence or similarity between two or more probability
distributions (Kalbfleisch and Prentice, 2002; Rouillard, 2004).
rows of terms and n columns of failures as shown in Figure 6: Messages from the failure sequences that are similar, yet
semantically unrelated are not expected to be considered similar;
 t1 f1 t1 f 2 . . . t1 f n  however most of the metrics do not take that into consideration.
  JSD does this by considering the entropies of these messages,
 t2 f1  hence our choice for it (Makanju et al., 2010).
Consider failure sequences containing several log events having
 .  information regarding the likely cause of failure. We want to
M   establish that failures led by the same events should not vary
 .  much.
 .  Given distribution of failures sequences F  { f1 , f 2 ,..., f n } ,
 
 tm f1 tm f 2 . . . tm f n  and fi  {t1 ,..., tk } contains events’ term-frequency

Figure 6: Data matrix M for n failure sequences

12
Understanding Error Log Event Sequence for Failure Analysis
Science World Journal Vol 13(No 4) 2018
www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

k
distribution, where t i
1 and 0  ti  1 for all
i 1

i  1, 2,..., k .
Let the weights of the distributions of failure sequences be i ,
then the JSD for fi is given by Equation 3:
n n
JSD ( fi )  H (  i fi )    i H ( fi ) (3)
i 1 i 1

fi
Where H ( fi ) is Shannon entropy for the distribution and

 i
 1; 0   i  1
.
Now for two failure sequences, f1 and f2, Figure 7: Distributions of some events

JSD ( f1 , f 2 )  H 1 ( f1  f 2 )  1  H ( f1 )  H ( f 2 ) 
2 2 
k
; where H ( f )    f log f is the Shannon entropy and
1 2 i
i 1

  1 .
2
Hence, the similarity between the two failure sequences is given
by Equation 4:
Sim( f1 , f 2 )  1  JSD( f1 , f 2 ) (4)
With similarity value ranging between 0 and 1.

RESULTS AND DISCUSSION


The result of our experiment with error log events was carried out
on logs from Ranger supercomputer of Texas Advanced Figure 8: Failure sequences Similarity Across time
Computing Centre (TACC), University of Texas, Austin. The logs windows with no redundant events
are for the period March 2010 and June 2010. In performing this
experiment, we manually study the logs and identify soft lockup
failure events within the stipulated time (March - June 2010).
Figure 7 shows the distribution of soft lockup, interrupt, rpc/evict
events within these months. These events are highly correlated to
soft lockup failures, which are regarded as events that like
precedes these failures.
There are several root causes of soft lockup failures but we want
to focus on machine check events and RPC/Evict events led soft
lockups. These events formed the failure sequences. The events
preceding these failures are obtained within time window of 1 -6
hours. From Figure 8, MCE events led failures, f1 , f 2 and
failures led by RPC/evict events f 4 , f 6 shows some form of
variations in the similarity of the patterns. This is as expected.
Within time window of 1 – 3 hours, the similarity in pattern seems Figure 9: Failure sequences Similarity Across time
to be clearer than for time window 4 - 6 hours, which suggests windows with redundant events
that for large time window, it is possible that different other
failures would have occurred and led by different other events. Conclusion
Again, removing redundant events greatly improve the clarity of Accurate detection of failure patterns in logs of supercomputers,
failure patterns. This suggests that, preprocessing is an important understanding behavior of the systems with their generated logs
step in understanding logs for failure analysis. Some researchers is crucial. As the sizes of application scales and increase, the
argue that the redundant events also constitute integral part of the failure tends to occur more often within short range of time. The
failure patterns, however, we realized that it is not always the failure’s impact on the system performance becomes more
case. pronounced, making the task of analyzing and quantifying the
extent of the failure impact difficult. The failure traces from large-
scale systems are mostly unavailable. Our approach seeks to find
similarities in patterns of these logs events that leads to failures.

13
Understanding Error Log Event Sequence for Failure Analysis
Science World Journal Vol 13(No 4) 2018
www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

We use latent semantic indexing for reducing the dimension of the Storage and Analysis (SC), 2012 International Conference
data before finding the similarity between the patterns by knowing For. IEEE, pp. 1–11.
the time, locations, and the down time failure. The failure traces Goto, H., Hasegawa, Y., Tanaka, M., 2007. Efficient scheduling
generated by the model is used to understand the behavior of focusing on the duality of MPL representation, in:
certain failures in the system. The result of the experiment has Computational Intelligence in Scheduling, 2007. SCIS’07.
revealed some insightful knowledge: we discovered that from IEEE Symposium On. IEEE, pp. 57–64.
logs, system failure behavior could be traced. Traditional failure Gu, J., Zheng, Z., Lan, Z., White, J., Hocks, E., Park, B.-H., 2008.
correction methods such as regular checkpoints would be Dynamic meta-learning for failure prediction in large-scale
properly done if failure behaviours can be detected early. Finally, systems: A case study, in: Parallel Processing, 2008.
removing redundant event logs provides better understanding the ICPP’08. 37th International Conference On. IEEE, pp. 157–
sequence of event logs for which failure inducing patterns can be 164.
traced. Gurumdimma, N., Jhumka, A., 2017. Detection of Recovery
Patterns in Cluster Systems Using Resource Usage Data,
REFERENCES in: 2017 IEEE 22nd Pacific Rim International Symposium on
Barringer, H., Groce, A., Havelund, K., Smith, M., 2010. Formal Dependable Computing (PRDC). Presented at the 2017
Analysis of Log Files. J. Aerosp. Comput. Inf. Commun. 7, IEEE 22nd Pacific Rim International Symposium on
365–390. Dependable Computing (PRDC), pp. 58–67.
Bisandu, D.B., Prasad, R. and Liman, M.M. 2018. Clustering Gurumdimma, N., Jhumka, A., Liakata, M., Chuah, E., Browne, J.,
news articles using efficient similarity measure and N- 2016. CRUDE: Combining Resource Usage Data and Error
grams, Int. J. Knowledge Engineering and Data Logs for Accurate Error Detection in Large-Scale Distributed
Mining,Vol. 5, No. 4, pp.333–348. Systems, in: 2016 IEEE 35th Symposium on Reliable
Bolander, N., Qiu, H., Eklund, N., Hindle, E., Rosenfeld, T., 2009. Distributed Systems (SRDS). Presented at the 2016 IEEE
Physics-based remaining useful life prediction for aircraft 35th Symposium on Reliable Distributed Systems (SRDS),
engine bearing prognosis, in: Annual Conference of the pp. 51–60.
Prognostics and Health Management Society. Hadžiosmanović, D., Bolzoni, D., Hartel, P.H., 2012. A log mining
Cinque, M., Cotroneo, D., Della Corte, R., Pecchia, A., 2014. approach for process monitoring in SCADA. Int. J. Inf.
Assessing direct monitoring techniques to analyze failures of Secur. 11, 231–251.
critical industrial systems, in: Software Reliability Heien, E., Kondo, D., Gainaru, A., LaPine, D., Kramer, B.,
Engineering (ISSRE), 2014 IEEE 25th International Cappello, F., 2011. Modeling and tolerating heterogeneous
Symposium On. IEEE, pp. 212–222. failures in large parallel systems, in: Proceedings of 2011
Cotroneo, D., Pietrantuono, R., Mariani, L., Pastore, F., 2007. International Conference for High Performance Computing,
Investigation of failure causes in workload-driven reliability Networking, Storage and Analysis. ACM, p. 45.
testing, in: Fourth International Workshop on Software Janbeglou, M., Zamani, M., Ibrahim, S., 2010. Redirecting
Quality Assurance: In Conjunction with the 6th ESEC/FSE network traffic toward a fake DNS server on a LAN, in:
Joint Meeting. ACM, pp. 78–85. Computer Science and Information Technology (ICCSIT),
Dhiman, M.P., Anand, D., Singh, E., Grover, K., 2013. PC based 2010 3rd IEEE International Conference On. IEEE, pp. 429–
speed control of induction motor. Int. J. Emerg. Trends 433.
Electr. Electron. IJETEE–ISSN 2320-9569 2, 81–84. Kalbfleisch, J.D., Prentice, R.L., 2002. The statistical analysis of
Eason, G., Noble, B., Sneddon, I.N., 1995. On certain integrals of failure time data, 2. ed. ed, Wiley series in probability and
Lipschitz-Hankel type involving products of Bessel functions. statistics. Wiley, Hoboken, NJ.
Phil Trans R Soc Lond 247, 529–551. Kang, W., Grimshaw, A., 2007. Failure prediction in computational
El-Sayed, N., Schroeder, B., 2013. Reading between the lines of grids, in: Simulation Symposium, 2007. ANSS’07. 40th
failure logs: Understanding how HPC systems fail, in: Annual. IEEE, pp. 275–282.
Dependable Systems and Networks (DSN), 2013 43rd Kornack, D.R., Rakic, P., 2001. Cell proliferation without
Annual IEEE/IFIP International Conference On. IEEE, pp. 1– neurogenesis in adult primate neocortex. Science 294,
12. 2127–2130.
Fronza, I., Sillitti, A., Succi, G., Terho, M., Vlasenko, J., 2013. Leveson, N., 2003. A new accident model for engineering safer
Failure prediction based on log files using random indexing systems. Saf. Sci. 42, 237–270.
and support vector machines. J. Syst. Softw. 86, 2–11. Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.,
Fu, S., Xu, C.-Z., 2007. Exploring event correlation for failure 2006. Bluegene/l failure analysis and prediction models, in:
prediction in coalitions of clusters, in: Proceedings of the Dependable Systems and Networks, 2006. DSN 2006.
2007 ACM/IEEE Conference on Supercomputing. ACM, p. International Conference On. IEEE, pp. 425–434.
41. Lou, J.-G., Fu, Q., Wang, Y., Li, J., 2010a. Mining dependency in
Gainaru, A., Cappello, F., Snir, M., Kramer, W., 2013. Failure distributed systems through unstructured logs analysis.
prediction for HPC systems and applications: Current ACM SIGOPS Oper. Syst. Rev. 44, 91–96.
situation and open issues. Int. J. High Perform. Comput. Lou, J.-G., Fu, Q., Yang, S., Xu, Y., Li, J., 2010b. Mining
Appl. 27, 273–282. Invariants from Console Logs for System Problem
https://doi.org/10.1177/1094342013488258 Detection., in: USENIX Annual Technical Conference. pp.
Gainaru, A., Cappello, F., Snir, M., Kramer, W., 2012. Fault 1–14.
prediction under the microscope: A closer look into hpc Makanju, A., Zincir-Heywood, A.N., Milios, E.E., 2010. An
systems, in: High Performance Computing, Networking, evaluation of entropy based approaches to alert detection in

14
Understanding Error Log Event Sequence for Failure Analysis
Science World Journal Vol 13(No 4) 2018
www.scienceworldjournal.org
ISSN 1597-6343
Published by Faculty of Science, Kaduna State University

high performance cluster logs, in: Quantitative Evaluation of Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Silva,
Systems (QEST), 2010 Seventh International Conference F., Vahi, K., 2012. Failure analysis of distributed scientific
on The. IEEE, pp. 69–78. workflows executing in the cloud, in: Proceedings of the 8th
Nagarajan, A.B., Mueller, F., Engelmann, C., Scott, S.L., 2007. International Conference on Network and Service
Proactive fault tolerance for HPC with Xen virtualization, in: Management. International Federation for Information
Proceedings of the 21st Annual International Conference on Processing, pp. 46–54.
Supercomputing. ACM, pp. 23–32. Singh, V.P., Vaibhav, K., Chaturvedi, D.K., 2012. Solar power
Nakka, N., Agrawal, A., Choudhary, A., 2011. Predicting node forecasting modeling using soft computing approach, in:
failure in high performance computing systems from failure Engineering (NUiCONE), 2012 Nirma University
and usage logs, in: 2011 IEEE International Symposium on International Conference On. IEEE, pp. 1–5.
Parallel and Distributed Processing, Workshops and Phd Stearley, J., 2004. Towards informatic analysis of syslogs, in:
Forum, IPDPSW 2011. Presented at the 25th IEEE Cluster Computing, 2004 IEEE International Conference On.
International Parallel and Distributed Processing IEEE, pp. 309–318.
Symposium, Workshops and Phd Forum, IPDPSW 2011, Sullivan, M., Chillarege, R., 1991. Software defects and their
pp. 1557–1566. impact on system availability: A study of field failures in
Oliner, A., Stearley, J., 2007. What supercomputers say: A study operating systems, in: FTCS. pp. 2–9.
of five system logs, in: Dependable Systems and Networks, Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M., 2009.
2007. DSN’07. 37th Annual IEEE/IFIP International Online system problem detection by mining patterns of
Conference On. IEEE, pp. 575–584. console logs, in: Data Mining, 2009. ICDM’09. Ninth IEEE
Patra, A.P., Bidhar, S., Kumar, U., 2010. Failure prediction of rail International Conference On. IEEE, pp. 588–597.
considering rolling contact fatigue. Int. J. Reliab. Qual. Saf. Yang, S.K., 2003. A condition-based failure-prediction and
Eng. 17, 167–177. processing-scheme for preventive maintenance. IEEE
Pecchia, A., Cotroneo, D., Kalbarczyk, Z., Iyer, R.K., 2011. Trans. Reliab. 52, 373–383.
Improving log-based field failure data analysis of multi-node Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G.R., Zhao, X., Zhang,
computing systems, in: Dependable Systems & Networks Y., Jain, P., Stumm, M., 2014. Simple Testing Can Prevent
(DSN), 2011 IEEE/IFIP 41st International Conference On. Most Critical Failures: An Analysis of Production Failures in
IEEE, pp. 97–108. Distributed Data-Intensive Systems., in: OSDI. pp. 249–265.
Rajachandrasekar, R., Besseron, X., Panda, D.K., 2012. Zhang, Y., Squillante, M.S., Sivasubramaniam, A., Sahoo, R.K.,
Monitoring and predicting hardware failures in HPC clusters 2004. Performance implications of failures in large-scale
with FTB-IPMI, in: Parallel and Distributed Processing cluster scheduling, in: Workshop on Job Scheduling
Symposium Workshops & PhD Forum (IPDPSW), 2012 Strategies for Parallel Processing. Springer, pp. 233–252.
IEEE 26th International. IEEE, pp. 1136–1143. Zheng, Z., Lan, Z., Gupta, R., Coghlan, S., Beckman, P., 2010. A
Rouillard, P.J., 2004. Real-time Log File Analysis Using the practical failure prediction with location and lead time for
Simple Event Correlator (SEC), in: Proceedings of LISA ’04: blue gene/p, in: Dependable Systems and Networks
Eighteenth Systems Administration Conference, (Atlanta, Workshops (DSN-W), 2010 International Conference On.
GA: USENIX Association, November, 2004). Presented at IEEE, pp. 15–22.
the LISA, pp. 133–150. Zheng, Z., Li, Y., Lan, Z., 2007. Anomaly localization in large-
Salfner, F., 2005. Predicting failures with hidden Markov models, scale clusters, in: Cluster Computing, 2007 IEEE
in: Proceedings of 5th European Dependable Computing International Conference On. IEEE, pp. 322–330.
Conference (EDCC-5). pp. 41–46.

15
Understanding Error Log Event Sequence for Failure Analysis

You might also like