Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
71 views

LogCluster A Data Clustering and Pattern Mining

LogCluster is a data clustering and pattern mining algorithm for event logs that was developed to address shortcomings in previous algorithms. It discovers both frequently occurring line patterns and outlier events from textual event logs. The algorithm begins by identifying frequent words that occur in at least a minimum number of lines. It then mines line patterns from the event log by considering words without their positions. Clusters are uniquely identified by their line patterns, which can include words and wildcards.

Uploaded by

redzgn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

LogCluster A Data Clustering and Pattern Mining

LogCluster is a data clustering and pattern mining algorithm for event logs that was developed to address shortcomings in previous algorithms. It discovers both frequently occurring line patterns and outlier events from textual event logs. The algorithm begins by identifying frequent words that occur in at least a minimum number of lines. It then mines line patterns from the event log by considering words without their positions. Clusters are uniquely identified by their line patterns, which can include words and wildcards.

Uploaded by

redzgn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

LogCluster - A Data Clustering and Pattern Mining

Algorithm for Event Logs

Risto Vaarandi and Mauno Pihelgas


TUT Centre for Digital Forensics and Cyber Security
Tallinn University of Technology
Tallinn, Estonia
firstname.lastname@ttu.ee

Abstract—Modern IT systems often produce large volumes of


event logs, and event pattern discovery is an important log II. RELATED WORK
management task. For this purpose, data mining methods have One of the earliest event log clustering algorithms is SLCT
been suggested in many previous works. In this paper, we present that is designed for mining line patterns and outlier events from
the LogCluster algorithm which implements data clustering and textual event logs [2]. During the clustering process, SLCT
line pattern mining for textual event logs. The paper also
assigns event log lines that fit the same pattern (e.g., Interface
describes an open source implementation of LogCluster.
* down) to the same cluster, and all detected clusters are
Keywords—event log analysis; mining patterns from event logs; reported to the user as line patterns. For finding clusters in log
event log clustering; data clustering; data mining data, the user has to supply the support threshold value s to
SLCT which defines the minimum number of lines in each
cluster. SLCT begins the clustering with a pass over the input
I. INTRODUCTION data set, in order to identify frequent words which occur at
During the last decade, data centers and computer networks least in s lines (word delimiter is customizable and defaults to
have grown significantly in processing power, size, and whitespace). Also, each word is considered with its position in
complexity. As a result, organizations commonly have to the line. For example, if s=2 and the data set contains the lines
handle many gigabytes of log data on a daily basis. For
example, in our recent paper we have described a security log Interface eth0 down
management system which receives nearly 100 million events Interface eth1 down
each day [1]. In order to ease the management of log data,
many research papers have suggested the use of data mining Interface eth2 up
methods for discovering event patterns from event logs [2–20]. then words (Interface,1) and (down,3) occur in three and
This knowledge can be employed for many different purposes two lines, respectively, and are thus identified as frequent
like the development of event correlation rules [12–16], words. SLCT will then make another pass over the data set and
detection of system faults and network anomalies [6–9, 19], create cluster candidates. When a line is processed during the
visualization of relevant event patterns [17, 18], identification data pass, all frequent words from the line are joined into a set
and reporting of network traffic patterns [4, 20], and automated which will act as a candidate for this line. After the data pass,
building of IDS alarm classifiers [5]. candidates generated for at least s lines are reported as clusters
In order to analyze large amounts of textual log data together with their supports (occurrence times). Outliers are
without well-defined structure, several data mining methods identified during an optional data pass and written to a user-
have been proposed in the past which focus on the detection of specified file. For example, if s=2 then two cluster candidates
line patterns from textual event logs. Suggested algorithms {(Interface,1), (down,3)} and {(Interface,1)} are detected with
have been mostly based on data clustering approaches [2, 6, 7, supports 2 and 1, respectively. Thus, {(Interface,1), (down,3)}
8, 10, 11]. The algorithms assume that each event is described is the only cluster and is reported to the user as a line pattern
by a single line in the event log, and each line pattern Interface * down (since there is no word associated with the
represents a group of similar events. second position, an asterisk is printed for denoting a wildcard).
Reported cluster covers the first two lines, while the line
In this paper, we propose a novel data clustering algorithm Interface eth2 up is considered an outlier.
called LogCluster which discovers both frequently occurring
line patterns and outlier events from textual event logs. The SLCT has several shortcomings which have been pointed
remainder of this paper is organized as follows – section II out in some recent works. Firstly, it is not able to detect
provides an overview of related work, section III presents the wildcards after the last word in a line pattern [11]. For instance,
LogCluster algorithm, section IV describes the LogCluster if s=3 for three example lines above, the cluster {(Interface,1)}
prototype implementation and experiments for evaluating its is reported to the user as a line pattern Interface, although most
performance, and section V concludes the paper. users would prefer the pattern Interface * *. Secondly, since
word positions are encoded into words, the algorithm is
This work has been supported by Estonian IT Academy (StudyITin.ee)
and SEB Estonia.

978-3-901882-77-7 © 2015 IFIP


sensitive to shifts in word positions and delimiter noise [8]. For 1 ≤ j ≤ m. LogCluster views the log clustering problem as a
instance, the line Interface HQ Link down would not be pattern mining problem – each cluster Cj is uniquely identified
assigned to the cluster Interface * down, but would rather by its line pattern pj which matches all lines in the cluster, and
generate a separate cluster candidate. Finally, low support in order to detect clusters, LogCluster mines line patterns pj
thresholds can lead to overfitting when larger clusters are split from the event log. The support of pattern pj and cluster Cj is
and resulting patterns are too specific [2]. defined as the number of lines in Cj: supp(pj) = supp(Cj) = |Cj|.
Each pattern consists of words and wildcards, e.g., Interface
Reidemeister, Jiang, Munawar and Ward [6, 7, 8] *{1,3} down has words Interface and down, and wildcard
developed a methodology that addresses some of the above *{1,3} that matches at least 1 and at most 3 words.
shortcomings. The methodology uses event log mining
techniques for diagnosing recurrent faults in software systems. In order to find patterns that have the support s or higher,
First, a modified version of SLCT is used for mining line LogCluster relies on the following observation – all words of
patterns from labeled event logs. In order to handle clustering such patterns must occur at least in s event log lines. Therefore,
errors caused by shifts in word positions and delimiter noise, LogCluster begins its work with the identification of such
line patterns from SLCT are clustered with a single-linkage words. However, unlike SLCT and IPLoM, LogCluster
clustering algorithm which employs a variant of the considers each word without its position in the event log line.
Levenshtein distance function. After that, a common line Formally, let Iw be the set of identifiers of lines that contain the
pattern description is established for each cluster of line word w: Iw = {i | li ∈ L, 1 ≤ i ≤ n, ∃j wi,j = w, 1 ≤ j ≤ ki}. The
patterns. According to [8], single-linkage clustering and post- word w is frequent if |Iw| ≥ s, and the set of all frequent words
processing its results add minimal runtime overhead to the is denoted by F. According to [2, 3], large event logs often
clustering by SLCT. The final results are converted into bit contain many millions of different words, while vast majority
vectors and used for building decision-tree classifiers, in order of them appear only few times in event logs. In order to take
to identify recurrent faults in future event logs. advantage of this property for reducing its memory footprint,
LogCluster employs a sketch of h counters c0,…,ch-1. During a
Another clustering algorithm that mines line patterns from
preliminary pass over event log L, each unique word of every
event logs is IPLoM by Makanju, Zincir-Heywood and Milios
event log line is hashed to an integer from 0 to h-1, and the
[10, 11]. Unlike SLCT, IPLoM is a hierarchical clustering
corresponding sketch counter is incremented. Since the hashing
algorithm which starts with the entire event log as a single
function produces output values 0…h-1 with equal
partition, and splits partitions iteratively during three steps.
probabilities, each sketch counter reflects the sum of
Like SLCT, IPLoM considers words with their positions in
occurrence times of approximately d / h words, where d is the
event log lines, and is therefore sensitive to shifts in word
number of unique words in L. However, since most words
positions. During the first step, the initial partition is split by
appear in only few lines of L, most sketch counters will be
assigning lines with the same number of words to the same
smaller than support threshold s after the data pass. Thus,
partition. During the second step, each partition is divided
corresponding words cannot be frequent, and can be ignored
further by identifying the word position with the least number
during the following pass over L for finding frequent words.
of unique words, and splitting the partition by assigning lines
with the same word to the same partition. During the third step, After frequent words have been identified, LogCluster
partitions are split based on associations between word pairs. makes another pass over event log L and creates cluster
At the final stage of the algorithm, a line pattern is derived for candidates. For each line in the event log, LogCluster extracts
each partition. Due to its hierarchical nature, IPLoM does not all frequent words from the line and arranges the words as a
need the support threshold, but takes several other parameters tuple, retaining their original order in the line. The tuple will
(such as partition support threshold and cluster goodness serve as an identifier of the cluster candidate, and the line is
threshold) which impose fine-grained control over splitting of assigned to this candidate. If the given candidate does not exist,
partitions [11]. As argued in [11], one advantage of IPLoM it is initialized with the support counter set to 1, and its line
over SLCT is its ability to detect line patterns with wildcard pattern is created from the line. If the candidate exists, its
tails (e.g., Interface * *), and the author has reported higher support counter is incremented and its line pattern is adjusted
precision and recall for IPLoM. to cover the current line. Note that LogCluster does not
memorize individual lines assigned to a cluster candidate.
III. THE LOGCLUSTER ALGORITHM For example, if the event log line is Interface DMZ-link
The LogCluster algorithm is designed for addressing the down at node router2, and words Interface, down, at, and node
shortcomings of existing event log clustering algorithms that are frequent, the line is assigned to the candidate identified by
were discussed in the previous section. Let L = {l1,...,ln} be a the tuple (Interface, down, at, node). If this candidate does not
textual event log which consists of n lines, where each line li exist, it will be initialized by setting its line pattern to Interface
(1 ≤ i ≤ n) is a complete representation of some event and i is a *{1,1} down at node *{1,1} and its support counter to 1
unique line identifier. We assume that each line li ∈ L is a (wildcard *{1,1} matches any single word). If the next line
sequence of ki words: li = (wi,1,…,wi,ki). LogCluster takes the which produces the same candidate identifier is Interface HQ
support threshold s (1 ≤ s ≤ n) as a user given input parameter link down at node router2, the candidate support counter is
and divides event log lines into clusters C1,…,Cm, so that there incremented to 2. Also, its line pattern is set to Interface *{1,2}
are at least s lines in each cluster Cj (i.e., |Cj| ≥ s) and O is the down at node *{1,1}, making the pattern to match at least one
cluster of outliers: L = C1 ∪ ... ∪ Cm ∪ O, O ∩ Cj = ∅, but not more than two words between Interface and down. Fig.
1 describes the candidate generation procedure in full details.
Procedure: Generate_Candidates frequent word detection and candidate generation procedures,
Input: event log L = {l1,…,ln}
set of frequent words F
LogCluster is not sensitive to shifts in word positions and is
Output: set of cluster candidates X able to detect patterns with wildcard tails.
When pattern mining is conducted with lower support
X := ∅
for (id = 1; id <= n; ++id) do threshold values, LogCluster is (similarly to SLCT) prone to
tuple := () overfitting – larger clusters might be split into smaller clusters
vars := () with too specific line patterns. For example, the cluster with a
i := 0; v := 0 pattern Interface *{1,1} down could be split into clusters with
for each w in (wid,1,…,wid,kid) do
patterns Interface *{1,1} down, Interface eth1 down, and
if (w ∈ F) then Interface eth2 down. Furthermore, meaningful generic patterns
tuple[i] := w
vars[i] := v (e.g., Interface *{1,1} down) might disappear during cluster
++i; v := 0 splitting. In order to address the overfitting problem,
else LogCluster employs two optional heuristics for increasing the
++v support of more generic cluster candidates and for joining
fi clusters. The first heuristic is called Aggregate_Supports and is
done
vars[i] := v applied after the candidate generation procedure has been
k := # of elements in tuple completed, immediately before clusters are selected. The
if (k > 0) then heuristic involves finding candidates with more specific line
if (∃Y ∈ X, Y.tuple == tuple) then patterns for each candidate, and adding supports of such
++Y.support candidates to the support of the given candidate. For instance,
for (i := 0; i < k+1; ++i) do
if (Y.varmin[i] > vars[i]) then
if candidates User bob login from 10.1.1.1, User *{1,1} login
Y.varmin[i] := vars[i] from 10.1.1.1, and User *{1,1} login from *{1,1} have supports
fi 5, 10, and 100, respectively, the support of the candidate User
if (Y.varmax[i] < vars[i]) then *{1,1} login from *{1,1} will be increased to 115. In other
Y.varmax[i] := vars[i] words, this heuristic allows clusters to overlap.
fi
done The second heuristic is called Join_Clusters and is applied
else after clusters have been selected from candidates. For each
initialize new candidate Y
Y.tuple := tuple frequent word w ∈ F, we define the set Cw as follows: Cw =
Y.support := 1 {f | f ∈ F, Iw ∩ If ≠ ∅} (i.e., Cw contains all frequent words that
for (i := 0; i < k+1; ++i) do co-occur with w in event log lines). If w’ ∈ Cw (i.e., w’ co-
Y.varmin[i] := vars[i] occurs with w), we define dependency from w to w’ as
Y.varmax[i] := vars[i]
done dep(w, w’) = |Iw ∩ Iw’| / |Iw|. In other words, dep(w, w’) reflects
X := X ∪ { Y } how frequently w’ occurs in lines which contain w. Also, note
fi that 0 < dep(w, w’) ≤ 1. If w1,…,wk are frequent words of a line
Y.pattern = () pattern (i.e., the corresponding cluster is identified by the tuple
j: = 0 (w1,…,wk)), the weight of the word wi in this pattern is
for (i := 0; i < k; ++i) do
if (Y.varmax[i] > 0) then
calculated as follows: weight(wi) = ∑jk=1 dep(wj, wi) / k. Note
min := Y.varmin[i] that since dep(wi, wi) = 1, then 1/k ≤ weight(wi) ≤ 1. Intuitively,
max := Y.varmax[i] the weight of the word indicates how strongly correlated the
Y.pattern[j] := “*{min,max}” word is with other words in the pattern. For example, suppose
++j the line pattern is Daemon testd killed, and words Daemon and
fi
Y.pattern[j] := tuple[i]
killed always appear together, while the word testd never
++j occurs without Daemon and killed. Thus, weight(Daemon) and
done weight(killed) are both 1. Also, if only 2.5% of lines that
if (Y.varmax[k] > 0) then contain both Daemon and killed also contain testd, then
min := Y.varmin[k] weight(testd) = (1 + 0.025 + 0.025) / 3 = 0.35. (We plan to
max := Y.varmax[k]
Y.pattern[j] := “*{min,max}”
implement more weight functions in the future versions of the
fi LogCluster prototype.)
fi
done The Join_Clusters heuristic takes the user supplied word
return X weight threshold t as its input parameter (0 < t ≤ 1). For each
cluster, a secondary identifier is created and initialized to the
Fig. 1. Candidate generation procedure of LogCluster. cluster’s regular identifier tuple. Also, words with weights
smaller than t are identified in the cluster’s line pattern, and
After the data pass for generating cluster candidates is each such word is replaced with a special token in the
complete, LogCluster drops all candidates with the support secondary identifier. Finally, clusters with identical secondary
counter value smaller than support threshold s, and reports identifiers are joined. When two or more clusters are joined,
remaining candidates as clusters. For each cluster, its line the support of the joint cluster is set to the sum of supports of
pattern and support are reported, while outliers are identified original clusters, and the line pattern of the joint cluster is
during additional pass over event log L. Due to the nature of its adjusted to represent the lines in all original clusters.
Procedure: Join_Clusters router2, and words router and router2 have insufficient
Input: set of clusters C = {C1,…,Cp}
word weight threshold t
weights, the clusters are joined into a new cluster with the line
word weight function W() pattern Interface *{1,3} down at node (router1|router2). Fig. 2
Output: set of clusters C’ = {C’1,…,C’m}, m ≤ p describes the details of the Join_Clusters heuristic. Since the
line pattern of a joint cluster consists of strongly correlated
C’ := ∅ words, it is less likely to suffer from overfitting. Also, words
for (j = 1; j <= p; ++j) do with insufficient weights are incorporated into the line pattern
tuple := Cj.tuple
k := # of elements in tuple
as lists of alternatives, representing the knowledge from
for (i := 0; i < k; ++i) do original patterns in a compact way without data loss. Finally,
if (W(tuple, i) < t) then joining clusters will reduce their number and will thus make
tuple[i] := TOKEN cluster reviewing easier for the human expert.
fi
done Fig. 3 summarizes all techniques presented in this section
if (∃Y ∈ C’, Y.tuple == tuple) then and outlines the LogCluster algorithm. In the next section, we
Y.support := Y.support + Cj.support describe the LogCluster implementation and its performance.
for (i := 0; i < k+1; ++i) do
if (Y.varmin[i] > Cj.varmin[i]) then
Y.varmin[i] := Cj.varmin[i] IV. LOGCLUSTER IMPLEMENTATION AND PERFORMANCE
fi
if (Y.varmax[i] < Cj.varmax[i]) then For assessing the performance of the LogCluster algorithm,
Y.varmax[i] := Cj.varmax[i] we have created its publicly available GNU GPLv2 licensed
fi prototype implementation in Perl. The implementation is a
done UNIX command line tool that can be downloaded from
else
initialize new cluster Y http://ristov.github.io/logcluster. Apart from its clustering
Y.tuple := tuple capabilities, the LogCluster tool supports a number of data
Y.support := Cj.support preprocessing options which are summarized below. In order to
for (i := 0; i < k+1; ++i) do focus on specific lines during pattern mining, a regular
Y.varmin[i] := Cj.varmin[i] expression filter can be defined with the --lfilter command line
Y.varmax[i] := Cj.varmax[i]
if (i < k AND Y.tuple[i] == TOKEN) then option. For instance, with --lfilter=’sshd\[\d+\]:’ patterns are
Y.wordlist[i] := ∅ detected for sshd syslog messages (e.g., May 10 11:07:12
fi myhost sshd[4711]: Connection from 10.1.1.1 port 5662).
done
C’ := C’ ∪ { Y } Procedure: LogCluster
fi Input: event log L = {l1,…,ln}
Y.pattern := () support threshold s
j: = 0 word sketch size h (optional)
for (i := 0; i < k; ++i) do word weight threshold t (optional)
if (Y.varmax[i] > 0) then word weight function W() (optional)
min := Y.varmin[i] boolean for invoking Aggregate_Supports
max := Y.varmax[i] procedure A (optional)
Y.pattern[j] := “*{min,max}” file of outliers ofile (optional)
++j Output: set of clusters C = {C1,…,Cm}
fi the cluster of outliers O (optional)
if (Y.tuple[i] == TOKEN) then
if (Cj.tuple[i] ∉ Y.wordlist[i]) then 1. if (defined(h)) then
Y.wordlist[i] := make a pass over L and build the word sketch
Y.wordlist[i] ∪ { Cj.tuple[i] } of size h for filtering out infrequent words
fi at step 2
Y.pattern[j] := “( elements of 2. make a pass over L and find the set of
Y.wordlist[i] separated by | )” frequent words: F := {w | |Iw| ≥ s}
else 3. if (defined(t)) then
Y.pattern[j] := Y.tuple[i] make a pass over L and find dependencies for
fi frequent words: {dep(w, w’) | w ∈ F, w’ ∈ Cw}
++j 4. make a pass over L and find the set of cluster
done candidates X: X := Generate_Candidates(L, F)
if (Y.varmax[k] > 0) then 5. if (defined(A) AND A == TRUE) then
min := Y.varmin[k] invoke Aggregate_Supports() procedure
max := Y.varmax[k] 6. find the set of clusters C
Y.pattern[j] := “*{min,max}” C := {Y ∈ X | supp(Y) ≥ s}
fi 7. if (defined(t)) then
done join clusters: C := Join_Clusters(C, t, W)
return C’ 8. report line patterns and their supports
for clusters from set C
Fig. 2. Cluster joining heuristic of LogCluster. 9. if (defined(ofile)) then
make a pass over L and write outliers to ofile
For example, if two clusters have patterns Interface *{1,1}
down at node router1 and Interface *{2,3} down at node
Fig. 3. The LogCluster algorithm.
If a template string is given with the --template option, For evaluating the performance of LogCluster and
match variables set by the regular expression of the --lfilter comparing it with other algorithms, we conducted a number of
option are substituted in the template string, and the resulting experiments with larger event logs. For the sake of fair
string replaces the original event log line during the mining. comparison, we re-implemented the public C-based version of
For example, with the use of --lfilter=’(sshd\[\d+\]: .+)’ SLCT in Perl. Since the implementations of IPLoM and the
and --template=’$1’ options, timestamps and hostnames are algorithm by Reidemeister et al. are not publicly available, we
removed from sshd syslog messages before any other were unable to study their source code for creating their exact
processing. If a regular expression is given with the --separator prototypes. However, because the algorithm by Reidemeister et
option, any sequence of characters that matches this expression al. uses SLCT and has a similar time complexity (see section
is treated as a word delimiter (word delimiter defaults to II), its runtimes are closely approximated by results for SLCT.
whitespace). During our experiments, we used 6 logs from a large institution
of a national critical information infrastructure of an EU state.
Existing line pattern mining tools treat words as atoms The logs cover 24 hour timespan (May 8, 2015), and originate
during the mining process, and make no attempt to discover from a wide range of sources, including database systems, web
potential structure inside words (the only exception is SLCT proxies, mail servers, firewalls, and network devices. We also
which includes a simple post-processing option for detecting
used an availability monitoring system event log from the
constant heads and tails for wildcards). In order to address this NATO CCD COE Locked Shields 2015 cyber defense exercise
shortcoming, LogCluster implements several options for which covers the entire two-day exercise and contains Nagios
masking specific word parts and creating word classes. If a events. During the experiments, we clustered each log file three
word matches the regular expression given with the --wfilter times with support thresholds set to 1%, 0.5% and 0.1% of
option, a word class is created for the word by searching it for lines in the log. We also used the word sketch of 100,000
substrings that match another regular expression provided with counters (parameter h in Fig. 3) for both LogCluster and
the --wsearch option. All matching substrings are then replaced SLCT, and did not employ Aggregate_Supports and
with the string specified with the --wreplace option. For Join_Clusters heuristics. Therefore, both LogCluster and
example, with the use of --wfilter=’=’, --wsearch=’=.+’, SLCT were configured to make three passes over the data set,
and --wreplace=’=VALUE’ options, word classes are created in order to build the word sketch during the first pass, detect
for words which contain the equal sign (=) by replacing the frequent words during the second pass, and generate cluster
characters after the equal sign with the string VALUE. Thus, candidates during the third pass. All experiments were
for words pid=12763 and user=bob, classes pid=VALUE and conducted on a Linux virtual server with Intel Xeon E5-2680
user=VALUE are created. If a word is infrequent but its word CPU and 64GB of memory, and Table I outlines the results.
class is frequent, the word class replaces the word during the Since LogCluster and SLCT implementations are both single-
mining process and will be treated like a frequent word. Since threaded and their CPU utilization was 100% according to
classes can represent many infrequent words, their presence in Linux time utility during all 21 experiments, each runtime in
line patterns provides valuable information about regularities in Table I closely matches the consumed CPU time.
word structure that would not be detected otherwise.

TABLE I. PERFORMANCE OF LOGCLUSTER AND SLCT


Row Event log type Event log size Event log Support Number of LogCluster Number of SLCT
# in megabytes size in lines threshold clusters found runtime in clusters found runtime in
by LogCluster seconds by SLCT seconds
1 Authorization messages 3800.1 7,757,440 7,757 49 3146.42 89 1969.04
2 Authorization messages 3800.1 7,757,440 38,787 32 3070.18 37 1892.41
3 Authorization messages 3800.1 7,757,440 77,574 9 3050.20 15 1911.93
4 UNIX daemon messages 740.2 5,778,847 5,778 150 692.08 158 479.90
5 UNIX daemon messages 740.2 5,778,847 28,894 40 682.95 44 462.85
6 UNIX daemon messages 740.2 5,778,847 57,788 12 667.82 16 470.48
7 Application messages 9363.0 34,516,290 34,516 109 5225.32 114 3674.47
8 Application messages 9363.0 34,516,290 172,581 16 4891.51 25 3559.36
9 Application messages 9363.0 34,516,290 345,162 5 4765.09 8 3517.67
10 Network device messages 4705.0 12,522,620 12,522 193 3181.97 195 2015.52
11 Network device messages 4705.0 12,522,620 62,613 31 3083.16 33 2000.98
12 Network device messages 4705.0 12,522,620 125,226 17 3080.66 19 1945.69
13 Web proxy messages 16681.5 49,376,464 49,376 105 8487.37 111 5409.23
14 Web proxy messages 16681.5 49,376,464 246,882 14 8128.34 14 5277.54
15 Web proxy messages 16681.5 49,376,464 493,764 5 8081.30 5 5244.96
16 Mail server messages 246.0 1,230,532 1,230 129 144.42 139 96.34
17 Mail server messages 246.0 1,230,532 6,152 40 141.83 40 96.85
18 Mail server messages 246.0 1,230,532 12,305 21 142.34 23 94.12
19 Nagios messages 391.9 3,400,185 3,400 45 435.76 46 316.77
20 Nagios messages 391.9 3,400,185 17,000 39 412.08 41 320.26
21 Nagios messages 391.9 3,400,185 34,001 19 409.87 22 318.25
May 8 *{1,1} myserver dhcpd: DHCPREQUEST for contains punctuation marks, so that all sequences of non-
*{1,2} from *{1,2} via *{1,4} punctuation characters which are not followed by the equal
May 8 *{3,3} Note: no *{1,3} sensors sign (=) or opening square bracket ([) are replaced with a single
X character. For the Nagios log, word classes are employed for
May 8 *{3,3} RT_IPSEC: %USER-3-RT_IPSEC_REPLAY: masking blue team numbers in host names, and also, trailing
Replay packet detected on IPSec tunnel on *{1,1} timestamps are removed from each event log line with --lfilter
with tunnel ID *{1,1} From *{1,1} to *{1,1} ESP, and --template options. The first two clusters in Fig. 5 are both
SPI *{1,1} SEQ *{1,1}
created by joining three clusters, while the last cluster is the
May 8 *{1,1} myserver httpd: client *{1,1} request union of twelve clusters which represent Nagios SSH service
GET *{1,1} HTTP/1.1 referer *{1,1} User-agent check events for 192 servers.
Mozilla/5.0 *{3,4} rv:37.0) Gecko/20100101
Firefox/37.0 *{0,1}
logcluster.pl --support=12305 \
May 8 *{1,1} myserver httpd: client *{1,1} request --input=mail.log --wfilter='[[:punct:]]' \
GET *{1,1} HTTP/1.1 referer *{1,1} User-agent --wsearch='[^[:punct:]]++(?![[=])' \
Mozilla/5.0 (Windows NT *{1,3} AppleWebKit/537.36 --wreplace=X --wweight=0.75
(KHTML, like Gecko) Chrome/42.0.2311.135
Safari/537.36 May 8 X:X:X (myserver1|myserver2|myserver3)
sendmail[X]: STARTTLS=client,
(relay=relayserver1,|relay=relayserver2,
Fig. 4. Sample clusters detected by LogCluster (for the reasons of privacy, |relay=relayserver3,) version=TLSv1/SSLv3,
sensitive data have been obfuscated). (verify=FAIL,|verify=OK,) (cipher=DHE-RSA-AES256-
SHA,|cipher=AES128-SHA,|cipher=RC4-SHA,)
(bits=256/256|bits=128/128)
As results indicate, SLCT was 1.28–1.62 times faster than
LogCluster. This is due to the simpler candidate generation May 8 X:X:X (myserver1|myserver2|myserver3)
procedure of SLCT – when processing individual event log sendmail[X]: X: from=<myrobot@mydomain>, size=X,
lines, SLCT does not have to check the line patterns of class=0, nrcpts=1, msgid=<X.X@X.X>,
bodytype=8BITMIME, proto=ESMTP, daemon=MTA,
candidates and adjust them if needed. However, both (relay=relayserver1|relay=relayserver2)
algorithms require considerable amount of time for clustering ([ipaddress1]|[ipaddress2])
very large log files. For example, for processing the largest
event log of 16.3GB (rows 13-15 in Table I), SLCT needed
about 1.5 hours, while for LogCluster the runtime exceeded 2 logcluster.pl --support=3400 \
--input=ls15.log --separator='["|\s]+' \
hours. In contrast, the C-based version of SLCT accomplishes --lfilter='^(.*)(?:\|"\d+"){2}' --template='$1' \
the same three tasks in 18-19 minutes. Therefore, we expect a --wfilter='blue\d\d' --wsearch='blue\d\d' \
C implementation of LogCluster to be significantly faster. --wreplace='blueNN' --wweight=0.5

According to Table I, LogCluster finds less clusters than (ws4-01.lab.blueNN.ex|ws4-04.lab.blueNN.ex


SLCT during all experiments (some clusters are depicted in |ws4-03.int.blueNN.ex|ws4-04.int.blueNN.ex
Fig. 4). The reviewing of detected clusters revealed that unlike |ws4-02.int.blueNN.ex|ws4-05.lab.blueNN.ex
SLCT, LogCluster was able to discover a single cluster for |ws4-05.int.blueNN.ex|dlna.lab.blueNN.ex
|ws4-01.int.blueNN.ex|ws4-02.lab.blueNN.ex
lines where frequent words were separated with a variable |ws4-03.lab.blueNN.ex|git.lab.blueNN.ex)
number of infrequent words. For example, the first cluster in (ssh|ssh.ipv6) OK SSH OK -
Fig. 4 properly captures all DHCP request events. In contrast, (OpenSSH_6.6.1p1|OpenSSH_5.9p1|OpenSSH_6.6.1_hpn1
SLCT discovered two clusters May 8 * myserver dhcpd: 3v11) (Ubuntu-2ubuntu2|FreeBSD-20140420|Debian-
5ubuntu1|Debian-5ubuntu1.4) (protocol 2.0)
DHCPREQUEST for * from * * via and May 8 * myserver
dhcpd: DHCPREQUEST for * * from * * via which still do not
cover all possible event formats. Also, the last two clusters in Fig. 5. Sample joint clusters detected by LogCluster (for the reasons of
Fig. 4 represent all HTTP requests originating from the latest privacy, sensitive data have been obfuscated).
stable versions of Firefox browser on all OS platforms and
Chrome browser on all Windows platforms, respectively (all
OS platform strings are matched by *{3,4} for Firefox, while V. CONCLUSION
Windows NT *{1,3} matches all Windows platform strings for In this paper, we have described the LogCluster algorithm
Chrome). Like in the previous case, SLCT was unable to for mining patterns from event logs. For future work, we plan
discover equivalent two clusters that would concisely capture to explore hierarchical event log clustering techniques. We also
HTTP request events for these two browser types. plan to implement the LogCluster algorithm in C, and use
LogCluster for automated building of user behavior profiles.
When evaluating the Join_Clusters heuristic, we found that
word weight thresholds (parameter t in Fig. 3) between 0.5 and
0.8 produced the best joint clusters. Fig. 5 displays three ACKNOWLEDGMENT
sample joint clusters which were detected from the mail server The authors thank NATO CCD COE for making Locked
and Nagios logs (rows 16-21 in Table I). Fig. 5 also illustrates Shields 2015 event logs available for this research. The authors
data preprocessing capabilities of the LogCluster tool. For the also thank Mr. Kaido Raiend, Mr. Ants Leitmäe, Mr. Andrus
mail server log, a word class is created for each word which Tamm, Dr. Paul Leis and Mr. Ain Rasva for their support.
REFERENCES [11] Adetokunbo Makanju, “Exploring Event Log Analysis With Minimum
Apriori Information,” PhD Thesis, University of Dalhousie, 2012.
[1] Risto Vaarandi and Mauno Pihelgas, “Using Security Logs for
Collecting and Reporting Technical Security Metrics,” in Proceedings of [12] Mika Klemettinen, “A Knowledge Discovery Methodology for
the 2014 IEEE Military Communications Conference, pp. 294-299. Telecommunication Network Alarm Databases,” PhD thesis, University
of Helsinki, 1999.
[2] Risto Vaarandi, “A Data Clustering Algorithm for Mining Patterns From
Event Logs,” in Proceedings of the 2003 IEEE Workshop on IP [13] Qingguo Zheng, Ke Xu, Weifeng Lv and Shilong Ma, “Intelligent
Operations and Management, pp. 119-126. Search of Correlated Alarms from Database Containing Noise Data,” in
Proceedings of the 2002 IEEE/IFIP Network Operations and
[3] Risto Vaarandi, “A Breadth-First Algorithm for Mining Frequent Management Symposium, pp. 405-419.
Patterns from Event Logs,” in Proceedings of the 2004 IFIP
International Conference on Intelligence in Communication Systems, [14] Sheng Ma and Joseph L. Hellerstein, “Mining Partially Periodic Event
LNCS Vol. 3283, Springer, pp. 293-308. Patterns with Unknown Periods,” in Proceedings of the 17th
International Conference on Data Engineering, pp. 205-214, 2001.
[4] Risto Vaarandi, “Mining Event Logs with SLCT and LogHound,” in
[15] James J. Treinen and Ramakrishna Thurimella, “A Framework for the
Proceedings of the 2008 IEEE/IFIP Network Operations and
Management Symposium, pp. 1071-1074. Application of Association Rule Mining in Large Intrusion Detection
Infrastructures,” in Proceedings of the 2006 Symposium on Recent
[5] Risto Vaarandi and Kārlis Podiņš, “Network IDS Alert Classification Advances in Intrusion Detection, LNCS Vol. 4219, Springer, pp. 1-18.
with Frequent Itemset Mining and Data Clustering,” in Proceedings of
the 2010 International Conference on Network and Service [16] Chris Clifton and Gary Gengo, “Developing Custom Intrusion Detection
Management, pp. 451-456. Filters Using Data Mining,” in Proceedings of the 2000 IEEE Military
Communications Conference, pp. 440-443.
[6] Thomas Reidemeister, Mohammad A. Munawar and Paul A.S. Ward,
“Identifying Symptoms of Recurrent Faults in Log Files of Distributed [17] Jon Stearley, “Towards Informatic Analysis of Syslogs,” in Proceedings
Information Systems,” in Proceedings of the 2010 IEEE/IFIP Network of the 2004 IEEE International Conference on Cluster Computing,
Operations and Management Symposium, pp. 187-194. pp. 309–318.
[7] Thomas Reidemeister, Miao Jiang and Paul A.S. Ward, “Mining [18] Adetokunbo Makanju, Stephen Brooks, A. Nur Zincir-Heywood and
Unstructured Log Files for Recurrent Fault Diagnosis,” in Proceedings Evangelos E. Milios, “LogView: Visualizing Event Log Clusters,” in
of the 2011 IEEE/IFIP International Symposium on Integrated Network Proceedings of the 6th Annual Conference on Privacy, Security and
Management, pp. 377-384. Trust, pp. 99-108, 2008.
[8] Thomas Reidemeister, “Fault Diagnosis in Enterprise Software Systems [19] Daniela Brauckhoff, Xenofontas Dimitropoulos, Arno Wagner and Kavè
Using Discrete Monitoring Data,” PhD Thesis, University of Waterloo, Salamatian, “Anomaly Extraction in Backbone Networks using
2012. Association Rules,” in Proceedings of the 2009 ACM SIGCOMM
Internet Measurement Conference, pp. 28-34.
[9] Wei Xu, Ling Huang, Armando Fox, David Patterson and Michael
Jordan, “Mining Console Logs for Large-Scale System Problem [20] Eduard Glatz, Stelios Mavromatidis, Bernhard Ager and Xenofontas
Detection,” in Proceedings of the 3rd Workshop on Tackling Computer Dimitropoulos, “Visualizing big network traffic data using frequent
Systems Problems with Machine Learning Techniques, 2008. pattern mining and hypergraphs,” Computing Vol. 96(1), Springer, pp.
27-38, 2014.
[10] Adetokunbo Makanju, A. Nur Zincir-Heywood and Evangelos E.
Milios, “Clustering Event Logs using Iterative Partitioning,” in
Proceedings of the 2009 ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 1255-1264.

You might also like