LogCluster is a data clustering and pattern mining algorithm for event logs that was developed to address shortcomings in previous algorithms. It discovers both frequently occurring line patterns and outlier events from textual event logs. The algorithm begins by identifying frequent words that occur in at least a minimum number of lines. It then mines line patterns from the event log by considering words without their positions. Clusters are uniquely identified by their line patterns, which can include words and wildcards.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
71 views
LogCluster A Data Clustering and Pattern Mining
LogCluster is a data clustering and pattern mining algorithm for event logs that was developed to address shortcomings in previous algorithms. It discovers both frequently occurring line patterns and outlier events from textual event logs. The algorithm begins by identifying frequent words that occur in at least a minimum number of lines. It then mines line patterns from the event log by considering words without their positions. Clusters are uniquely identified by their line patterns, which can include words and wildcards.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7
LogCluster - A Data Clustering and Pattern Mining
Algorithm for Event Logs
Risto Vaarandi and Mauno Pihelgas
TUT Centre for Digital Forensics and Cyber Security Tallinn University of Technology Tallinn, Estonia firstname.lastname@ttu.ee
Abstract—Modern IT systems often produce large volumes of
event logs, and event pattern discovery is an important log II. RELATED WORK management task. For this purpose, data mining methods have One of the earliest event log clustering algorithms is SLCT been suggested in many previous works. In this paper, we present that is designed for mining line patterns and outlier events from the LogCluster algorithm which implements data clustering and textual event logs [2]. During the clustering process, SLCT line pattern mining for textual event logs. The paper also assigns event log lines that fit the same pattern (e.g., Interface describes an open source implementation of LogCluster. * down) to the same cluster, and all detected clusters are Keywords—event log analysis; mining patterns from event logs; reported to the user as line patterns. For finding clusters in log event log clustering; data clustering; data mining data, the user has to supply the support threshold value s to SLCT which defines the minimum number of lines in each cluster. SLCT begins the clustering with a pass over the input I. INTRODUCTION data set, in order to identify frequent words which occur at During the last decade, data centers and computer networks least in s lines (word delimiter is customizable and defaults to have grown significantly in processing power, size, and whitespace). Also, each word is considered with its position in complexity. As a result, organizations commonly have to the line. For example, if s=2 and the data set contains the lines handle many gigabytes of log data on a daily basis. For example, in our recent paper we have described a security log Interface eth0 down management system which receives nearly 100 million events Interface eth1 down each day [1]. In order to ease the management of log data, many research papers have suggested the use of data mining Interface eth2 up methods for discovering event patterns from event logs [2–20]. then words (Interface,1) and (down,3) occur in three and This knowledge can be employed for many different purposes two lines, respectively, and are thus identified as frequent like the development of event correlation rules [12–16], words. SLCT will then make another pass over the data set and detection of system faults and network anomalies [6–9, 19], create cluster candidates. When a line is processed during the visualization of relevant event patterns [17, 18], identification data pass, all frequent words from the line are joined into a set and reporting of network traffic patterns [4, 20], and automated which will act as a candidate for this line. After the data pass, building of IDS alarm classifiers [5]. candidates generated for at least s lines are reported as clusters In order to analyze large amounts of textual log data together with their supports (occurrence times). Outliers are without well-defined structure, several data mining methods identified during an optional data pass and written to a user- have been proposed in the past which focus on the detection of specified file. For example, if s=2 then two cluster candidates line patterns from textual event logs. Suggested algorithms {(Interface,1), (down,3)} and {(Interface,1)} are detected with have been mostly based on data clustering approaches [2, 6, 7, supports 2 and 1, respectively. Thus, {(Interface,1), (down,3)} 8, 10, 11]. The algorithms assume that each event is described is the only cluster and is reported to the user as a line pattern by a single line in the event log, and each line pattern Interface * down (since there is no word associated with the represents a group of similar events. second position, an asterisk is printed for denoting a wildcard). Reported cluster covers the first two lines, while the line In this paper, we propose a novel data clustering algorithm Interface eth2 up is considered an outlier. called LogCluster which discovers both frequently occurring line patterns and outlier events from textual event logs. The SLCT has several shortcomings which have been pointed remainder of this paper is organized as follows – section II out in some recent works. Firstly, it is not able to detect provides an overview of related work, section III presents the wildcards after the last word in a line pattern [11]. For instance, LogCluster algorithm, section IV describes the LogCluster if s=3 for three example lines above, the cluster {(Interface,1)} prototype implementation and experiments for evaluating its is reported to the user as a line pattern Interface, although most performance, and section V concludes the paper. users would prefer the pattern Interface * *. Secondly, since word positions are encoded into words, the algorithm is This work has been supported by Estonian IT Academy (StudyITin.ee) and SEB Estonia.
sensitive to shifts in word positions and delimiter noise [8]. For 1 ≤ j ≤ m. LogCluster views the log clustering problem as a instance, the line Interface HQ Link down would not be pattern mining problem – each cluster Cj is uniquely identified assigned to the cluster Interface * down, but would rather by its line pattern pj which matches all lines in the cluster, and generate a separate cluster candidate. Finally, low support in order to detect clusters, LogCluster mines line patterns pj thresholds can lead to overfitting when larger clusters are split from the event log. The support of pattern pj and cluster Cj is and resulting patterns are too specific [2]. defined as the number of lines in Cj: supp(pj) = supp(Cj) = |Cj|. Each pattern consists of words and wildcards, e.g., Interface Reidemeister, Jiang, Munawar and Ward [6, 7, 8] *{1,3} down has words Interface and down, and wildcard developed a methodology that addresses some of the above *{1,3} that matches at least 1 and at most 3 words. shortcomings. The methodology uses event log mining techniques for diagnosing recurrent faults in software systems. In order to find patterns that have the support s or higher, First, a modified version of SLCT is used for mining line LogCluster relies on the following observation – all words of patterns from labeled event logs. In order to handle clustering such patterns must occur at least in s event log lines. Therefore, errors caused by shifts in word positions and delimiter noise, LogCluster begins its work with the identification of such line patterns from SLCT are clustered with a single-linkage words. However, unlike SLCT and IPLoM, LogCluster clustering algorithm which employs a variant of the considers each word without its position in the event log line. Levenshtein distance function. After that, a common line Formally, let Iw be the set of identifiers of lines that contain the pattern description is established for each cluster of line word w: Iw = {i | li ∈ L, 1 ≤ i ≤ n, ∃j wi,j = w, 1 ≤ j ≤ ki}. The patterns. According to [8], single-linkage clustering and post- word w is frequent if |Iw| ≥ s, and the set of all frequent words processing its results add minimal runtime overhead to the is denoted by F. According to [2, 3], large event logs often clustering by SLCT. The final results are converted into bit contain many millions of different words, while vast majority vectors and used for building decision-tree classifiers, in order of them appear only few times in event logs. In order to take to identify recurrent faults in future event logs. advantage of this property for reducing its memory footprint, LogCluster employs a sketch of h counters c0,…,ch-1. During a Another clustering algorithm that mines line patterns from preliminary pass over event log L, each unique word of every event logs is IPLoM by Makanju, Zincir-Heywood and Milios event log line is hashed to an integer from 0 to h-1, and the [10, 11]. Unlike SLCT, IPLoM is a hierarchical clustering corresponding sketch counter is incremented. Since the hashing algorithm which starts with the entire event log as a single function produces output values 0…h-1 with equal partition, and splits partitions iteratively during three steps. probabilities, each sketch counter reflects the sum of Like SLCT, IPLoM considers words with their positions in occurrence times of approximately d / h words, where d is the event log lines, and is therefore sensitive to shifts in word number of unique words in L. However, since most words positions. During the first step, the initial partition is split by appear in only few lines of L, most sketch counters will be assigning lines with the same number of words to the same smaller than support threshold s after the data pass. Thus, partition. During the second step, each partition is divided corresponding words cannot be frequent, and can be ignored further by identifying the word position with the least number during the following pass over L for finding frequent words. of unique words, and splitting the partition by assigning lines with the same word to the same partition. During the third step, After frequent words have been identified, LogCluster partitions are split based on associations between word pairs. makes another pass over event log L and creates cluster At the final stage of the algorithm, a line pattern is derived for candidates. For each line in the event log, LogCluster extracts each partition. Due to its hierarchical nature, IPLoM does not all frequent words from the line and arranges the words as a need the support threshold, but takes several other parameters tuple, retaining their original order in the line. The tuple will (such as partition support threshold and cluster goodness serve as an identifier of the cluster candidate, and the line is threshold) which impose fine-grained control over splitting of assigned to this candidate. If the given candidate does not exist, partitions [11]. As argued in [11], one advantage of IPLoM it is initialized with the support counter set to 1, and its line over SLCT is its ability to detect line patterns with wildcard pattern is created from the line. If the candidate exists, its tails (e.g., Interface * *), and the author has reported higher support counter is incremented and its line pattern is adjusted precision and recall for IPLoM. to cover the current line. Note that LogCluster does not memorize individual lines assigned to a cluster candidate. III. THE LOGCLUSTER ALGORITHM For example, if the event log line is Interface DMZ-link The LogCluster algorithm is designed for addressing the down at node router2, and words Interface, down, at, and node shortcomings of existing event log clustering algorithms that are frequent, the line is assigned to the candidate identified by were discussed in the previous section. Let L = {l1,...,ln} be a the tuple (Interface, down, at, node). If this candidate does not textual event log which consists of n lines, where each line li exist, it will be initialized by setting its line pattern to Interface (1 ≤ i ≤ n) is a complete representation of some event and i is a *{1,1} down at node *{1,1} and its support counter to 1 unique line identifier. We assume that each line li ∈ L is a (wildcard *{1,1} matches any single word). If the next line sequence of ki words: li = (wi,1,…,wi,ki). LogCluster takes the which produces the same candidate identifier is Interface HQ support threshold s (1 ≤ s ≤ n) as a user given input parameter link down at node router2, the candidate support counter is and divides event log lines into clusters C1,…,Cm, so that there incremented to 2. Also, its line pattern is set to Interface *{1,2} are at least s lines in each cluster Cj (i.e., |Cj| ≥ s) and O is the down at node *{1,1}, making the pattern to match at least one cluster of outliers: L = C1 ∪ ... ∪ Cm ∪ O, O ∩ Cj = ∅, but not more than two words between Interface and down. Fig. 1 describes the candidate generation procedure in full details. Procedure: Generate_Candidates frequent word detection and candidate generation procedures, Input: event log L = {l1,…,ln} set of frequent words F LogCluster is not sensitive to shifts in word positions and is Output: set of cluster candidates X able to detect patterns with wildcard tails. When pattern mining is conducted with lower support X := ∅ for (id = 1; id <= n; ++id) do threshold values, LogCluster is (similarly to SLCT) prone to tuple := () overfitting – larger clusters might be split into smaller clusters vars := () with too specific line patterns. For example, the cluster with a i := 0; v := 0 pattern Interface *{1,1} down could be split into clusters with for each w in (wid,1,…,wid,kid) do patterns Interface *{1,1} down, Interface eth1 down, and if (w ∈ F) then Interface eth2 down. Furthermore, meaningful generic patterns tuple[i] := w vars[i] := v (e.g., Interface *{1,1} down) might disappear during cluster ++i; v := 0 splitting. In order to address the overfitting problem, else LogCluster employs two optional heuristics for increasing the ++v support of more generic cluster candidates and for joining fi clusters. The first heuristic is called Aggregate_Supports and is done vars[i] := v applied after the candidate generation procedure has been k := # of elements in tuple completed, immediately before clusters are selected. The if (k > 0) then heuristic involves finding candidates with more specific line if (∃Y ∈ X, Y.tuple == tuple) then patterns for each candidate, and adding supports of such ++Y.support candidates to the support of the given candidate. For instance, for (i := 0; i < k+1; ++i) do if (Y.varmin[i] > vars[i]) then if candidates User bob login from 10.1.1.1, User *{1,1} login Y.varmin[i] := vars[i] from 10.1.1.1, and User *{1,1} login from *{1,1} have supports fi 5, 10, and 100, respectively, the support of the candidate User if (Y.varmax[i] < vars[i]) then *{1,1} login from *{1,1} will be increased to 115. In other Y.varmax[i] := vars[i] words, this heuristic allows clusters to overlap. fi done The second heuristic is called Join_Clusters and is applied else after clusters have been selected from candidates. For each initialize new candidate Y Y.tuple := tuple frequent word w ∈ F, we define the set Cw as follows: Cw = Y.support := 1 {f | f ∈ F, Iw ∩ If ≠ ∅} (i.e., Cw contains all frequent words that for (i := 0; i < k+1; ++i) do co-occur with w in event log lines). If w’ ∈ Cw (i.e., w’ co- Y.varmin[i] := vars[i] occurs with w), we define dependency from w to w’ as Y.varmax[i] := vars[i] done dep(w, w’) = |Iw ∩ Iw’| / |Iw|. In other words, dep(w, w’) reflects X := X ∪ { Y } how frequently w’ occurs in lines which contain w. Also, note fi that 0 < dep(w, w’) ≤ 1. If w1,…,wk are frequent words of a line Y.pattern = () pattern (i.e., the corresponding cluster is identified by the tuple j: = 0 (w1,…,wk)), the weight of the word wi in this pattern is for (i := 0; i < k; ++i) do if (Y.varmax[i] > 0) then calculated as follows: weight(wi) = ∑jk=1 dep(wj, wi) / k. Note min := Y.varmin[i] that since dep(wi, wi) = 1, then 1/k ≤ weight(wi) ≤ 1. Intuitively, max := Y.varmax[i] the weight of the word indicates how strongly correlated the Y.pattern[j] := “*{min,max}” word is with other words in the pattern. For example, suppose ++j the line pattern is Daemon testd killed, and words Daemon and fi Y.pattern[j] := tuple[i] killed always appear together, while the word testd never ++j occurs without Daemon and killed. Thus, weight(Daemon) and done weight(killed) are both 1. Also, if only 2.5% of lines that if (Y.varmax[k] > 0) then contain both Daemon and killed also contain testd, then min := Y.varmin[k] weight(testd) = (1 + 0.025 + 0.025) / 3 = 0.35. (We plan to max := Y.varmax[k] Y.pattern[j] := “*{min,max}” implement more weight functions in the future versions of the fi LogCluster prototype.) fi done The Join_Clusters heuristic takes the user supplied word return X weight threshold t as its input parameter (0 < t ≤ 1). For each cluster, a secondary identifier is created and initialized to the Fig. 1. Candidate generation procedure of LogCluster. cluster’s regular identifier tuple. Also, words with weights smaller than t are identified in the cluster’s line pattern, and After the data pass for generating cluster candidates is each such word is replaced with a special token in the complete, LogCluster drops all candidates with the support secondary identifier. Finally, clusters with identical secondary counter value smaller than support threshold s, and reports identifiers are joined. When two or more clusters are joined, remaining candidates as clusters. For each cluster, its line the support of the joint cluster is set to the sum of supports of pattern and support are reported, while outliers are identified original clusters, and the line pattern of the joint cluster is during additional pass over event log L. Due to the nature of its adjusted to represent the lines in all original clusters. Procedure: Join_Clusters router2, and words router and router2 have insufficient Input: set of clusters C = {C1,…,Cp} word weight threshold t weights, the clusters are joined into a new cluster with the line word weight function W() pattern Interface *{1,3} down at node (router1|router2). Fig. 2 Output: set of clusters C’ = {C’1,…,C’m}, m ≤ p describes the details of the Join_Clusters heuristic. Since the line pattern of a joint cluster consists of strongly correlated C’ := ∅ words, it is less likely to suffer from overfitting. Also, words for (j = 1; j <= p; ++j) do with insufficient weights are incorporated into the line pattern tuple := Cj.tuple k := # of elements in tuple as lists of alternatives, representing the knowledge from for (i := 0; i < k; ++i) do original patterns in a compact way without data loss. Finally, if (W(tuple, i) < t) then joining clusters will reduce their number and will thus make tuple[i] := TOKEN cluster reviewing easier for the human expert. fi done Fig. 3 summarizes all techniques presented in this section if (∃Y ∈ C’, Y.tuple == tuple) then and outlines the LogCluster algorithm. In the next section, we Y.support := Y.support + Cj.support describe the LogCluster implementation and its performance. for (i := 0; i < k+1; ++i) do if (Y.varmin[i] > Cj.varmin[i]) then Y.varmin[i] := Cj.varmin[i] IV. LOGCLUSTER IMPLEMENTATION AND PERFORMANCE fi if (Y.varmax[i] < Cj.varmax[i]) then For assessing the performance of the LogCluster algorithm, Y.varmax[i] := Cj.varmax[i] we have created its publicly available GNU GPLv2 licensed fi prototype implementation in Perl. The implementation is a done UNIX command line tool that can be downloaded from else initialize new cluster Y http://ristov.github.io/logcluster. Apart from its clustering Y.tuple := tuple capabilities, the LogCluster tool supports a number of data Y.support := Cj.support preprocessing options which are summarized below. In order to for (i := 0; i < k+1; ++i) do focus on specific lines during pattern mining, a regular Y.varmin[i] := Cj.varmin[i] expression filter can be defined with the --lfilter command line Y.varmax[i] := Cj.varmax[i] if (i < k AND Y.tuple[i] == TOKEN) then option. For instance, with --lfilter=’sshd\[\d+\]:’ patterns are Y.wordlist[i] := ∅ detected for sshd syslog messages (e.g., May 10 11:07:12 fi myhost sshd[4711]: Connection from 10.1.1.1 port 5662). done C’ := C’ ∪ { Y } Procedure: LogCluster fi Input: event log L = {l1,…,ln} Y.pattern := () support threshold s j: = 0 word sketch size h (optional) for (i := 0; i < k; ++i) do word weight threshold t (optional) if (Y.varmax[i] > 0) then word weight function W() (optional) min := Y.varmin[i] boolean for invoking Aggregate_Supports max := Y.varmax[i] procedure A (optional) Y.pattern[j] := “*{min,max}” file of outliers ofile (optional) ++j Output: set of clusters C = {C1,…,Cm} fi the cluster of outliers O (optional) if (Y.tuple[i] == TOKEN) then if (Cj.tuple[i] ∉ Y.wordlist[i]) then 1. if (defined(h)) then Y.wordlist[i] := make a pass over L and build the word sketch Y.wordlist[i] ∪ { Cj.tuple[i] } of size h for filtering out infrequent words fi at step 2 Y.pattern[j] := “( elements of 2. make a pass over L and find the set of Y.wordlist[i] separated by | )” frequent words: F := {w | |Iw| ≥ s} else 3. if (defined(t)) then Y.pattern[j] := Y.tuple[i] make a pass over L and find dependencies for fi frequent words: {dep(w, w’) | w ∈ F, w’ ∈ Cw} ++j 4. make a pass over L and find the set of cluster done candidates X: X := Generate_Candidates(L, F) if (Y.varmax[k] > 0) then 5. if (defined(A) AND A == TRUE) then min := Y.varmin[k] invoke Aggregate_Supports() procedure max := Y.varmax[k] 6. find the set of clusters C Y.pattern[j] := “*{min,max}” C := {Y ∈ X | supp(Y) ≥ s} fi 7. if (defined(t)) then done join clusters: C := Join_Clusters(C, t, W) return C’ 8. report line patterns and their supports for clusters from set C Fig. 2. Cluster joining heuristic of LogCluster. 9. if (defined(ofile)) then make a pass over L and write outliers to ofile For example, if two clusters have patterns Interface *{1,1} down at node router1 and Interface *{2,3} down at node Fig. 3. The LogCluster algorithm. If a template string is given with the --template option, For evaluating the performance of LogCluster and match variables set by the regular expression of the --lfilter comparing it with other algorithms, we conducted a number of option are substituted in the template string, and the resulting experiments with larger event logs. For the sake of fair string replaces the original event log line during the mining. comparison, we re-implemented the public C-based version of For example, with the use of --lfilter=’(sshd\[\d+\]: .+)’ SLCT in Perl. Since the implementations of IPLoM and the and --template=’$1’ options, timestamps and hostnames are algorithm by Reidemeister et al. are not publicly available, we removed from sshd syslog messages before any other were unable to study their source code for creating their exact processing. If a regular expression is given with the --separator prototypes. However, because the algorithm by Reidemeister et option, any sequence of characters that matches this expression al. uses SLCT and has a similar time complexity (see section is treated as a word delimiter (word delimiter defaults to II), its runtimes are closely approximated by results for SLCT. whitespace). During our experiments, we used 6 logs from a large institution of a national critical information infrastructure of an EU state. Existing line pattern mining tools treat words as atoms The logs cover 24 hour timespan (May 8, 2015), and originate during the mining process, and make no attempt to discover from a wide range of sources, including database systems, web potential structure inside words (the only exception is SLCT proxies, mail servers, firewalls, and network devices. We also which includes a simple post-processing option for detecting used an availability monitoring system event log from the constant heads and tails for wildcards). In order to address this NATO CCD COE Locked Shields 2015 cyber defense exercise shortcoming, LogCluster implements several options for which covers the entire two-day exercise and contains Nagios masking specific word parts and creating word classes. If a events. During the experiments, we clustered each log file three word matches the regular expression given with the --wfilter times with support thresholds set to 1%, 0.5% and 0.1% of option, a word class is created for the word by searching it for lines in the log. We also used the word sketch of 100,000 substrings that match another regular expression provided with counters (parameter h in Fig. 3) for both LogCluster and the --wsearch option. All matching substrings are then replaced SLCT, and did not employ Aggregate_Supports and with the string specified with the --wreplace option. For Join_Clusters heuristics. Therefore, both LogCluster and example, with the use of --wfilter=’=’, --wsearch=’=.+’, SLCT were configured to make three passes over the data set, and --wreplace=’=VALUE’ options, word classes are created in order to build the word sketch during the first pass, detect for words which contain the equal sign (=) by replacing the frequent words during the second pass, and generate cluster characters after the equal sign with the string VALUE. Thus, candidates during the third pass. All experiments were for words pid=12763 and user=bob, classes pid=VALUE and conducted on a Linux virtual server with Intel Xeon E5-2680 user=VALUE are created. If a word is infrequent but its word CPU and 64GB of memory, and Table I outlines the results. class is frequent, the word class replaces the word during the Since LogCluster and SLCT implementations are both single- mining process and will be treated like a frequent word. Since threaded and their CPU utilization was 100% according to classes can represent many infrequent words, their presence in Linux time utility during all 21 experiments, each runtime in line patterns provides valuable information about regularities in Table I closely matches the consumed CPU time. word structure that would not be detected otherwise.
TABLE I. PERFORMANCE OF LOGCLUSTER AND SLCT
Row Event log type Event log size Event log Support Number of LogCluster Number of SLCT # in megabytes size in lines threshold clusters found runtime in clusters found runtime in by LogCluster seconds by SLCT seconds 1 Authorization messages 3800.1 7,757,440 7,757 49 3146.42 89 1969.04 2 Authorization messages 3800.1 7,757,440 38,787 32 3070.18 37 1892.41 3 Authorization messages 3800.1 7,757,440 77,574 9 3050.20 15 1911.93 4 UNIX daemon messages 740.2 5,778,847 5,778 150 692.08 158 479.90 5 UNIX daemon messages 740.2 5,778,847 28,894 40 682.95 44 462.85 6 UNIX daemon messages 740.2 5,778,847 57,788 12 667.82 16 470.48 7 Application messages 9363.0 34,516,290 34,516 109 5225.32 114 3674.47 8 Application messages 9363.0 34,516,290 172,581 16 4891.51 25 3559.36 9 Application messages 9363.0 34,516,290 345,162 5 4765.09 8 3517.67 10 Network device messages 4705.0 12,522,620 12,522 193 3181.97 195 2015.52 11 Network device messages 4705.0 12,522,620 62,613 31 3083.16 33 2000.98 12 Network device messages 4705.0 12,522,620 125,226 17 3080.66 19 1945.69 13 Web proxy messages 16681.5 49,376,464 49,376 105 8487.37 111 5409.23 14 Web proxy messages 16681.5 49,376,464 246,882 14 8128.34 14 5277.54 15 Web proxy messages 16681.5 49,376,464 493,764 5 8081.30 5 5244.96 16 Mail server messages 246.0 1,230,532 1,230 129 144.42 139 96.34 17 Mail server messages 246.0 1,230,532 6,152 40 141.83 40 96.85 18 Mail server messages 246.0 1,230,532 12,305 21 142.34 23 94.12 19 Nagios messages 391.9 3,400,185 3,400 45 435.76 46 316.77 20 Nagios messages 391.9 3,400,185 17,000 39 412.08 41 320.26 21 Nagios messages 391.9 3,400,185 34,001 19 409.87 22 318.25 May 8 *{1,1} myserver dhcpd: DHCPREQUEST for contains punctuation marks, so that all sequences of non- *{1,2} from *{1,2} via *{1,4} punctuation characters which are not followed by the equal May 8 *{3,3} Note: no *{1,3} sensors sign (=) or opening square bracket ([) are replaced with a single X character. For the Nagios log, word classes are employed for May 8 *{3,3} RT_IPSEC: %USER-3-RT_IPSEC_REPLAY: masking blue team numbers in host names, and also, trailing Replay packet detected on IPSec tunnel on *{1,1} timestamps are removed from each event log line with --lfilter with tunnel ID *{1,1} From *{1,1} to *{1,1} ESP, and --template options. The first two clusters in Fig. 5 are both SPI *{1,1} SEQ *{1,1} created by joining three clusters, while the last cluster is the May 8 *{1,1} myserver httpd: client *{1,1} request union of twelve clusters which represent Nagios SSH service GET *{1,1} HTTP/1.1 referer *{1,1} User-agent check events for 192 servers. Mozilla/5.0 *{3,4} rv:37.0) Gecko/20100101 Firefox/37.0 *{0,1} logcluster.pl --support=12305 \ May 8 *{1,1} myserver httpd: client *{1,1} request --input=mail.log --wfilter='[[:punct:]]' \ GET *{1,1} HTTP/1.1 referer *{1,1} User-agent --wsearch='[^[:punct:]]++(?![[=])' \ Mozilla/5.0 (Windows NT *{1,3} AppleWebKit/537.36 --wreplace=X --wweight=0.75 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 May 8 X:X:X (myserver1|myserver2|myserver3) sendmail[X]: STARTTLS=client, (relay=relayserver1,|relay=relayserver2, Fig. 4. Sample clusters detected by LogCluster (for the reasons of privacy, |relay=relayserver3,) version=TLSv1/SSLv3, sensitive data have been obfuscated). (verify=FAIL,|verify=OK,) (cipher=DHE-RSA-AES256- SHA,|cipher=AES128-SHA,|cipher=RC4-SHA,) (bits=256/256|bits=128/128) As results indicate, SLCT was 1.28–1.62 times faster than LogCluster. This is due to the simpler candidate generation May 8 X:X:X (myserver1|myserver2|myserver3) procedure of SLCT – when processing individual event log sendmail[X]: X: from=<myrobot@mydomain>, size=X, lines, SLCT does not have to check the line patterns of class=0, nrcpts=1, msgid=<X.X@X.X>, bodytype=8BITMIME, proto=ESMTP, daemon=MTA, candidates and adjust them if needed. However, both (relay=relayserver1|relay=relayserver2) algorithms require considerable amount of time for clustering ([ipaddress1]|[ipaddress2]) very large log files. For example, for processing the largest event log of 16.3GB (rows 13-15 in Table I), SLCT needed about 1.5 hours, while for LogCluster the runtime exceeded 2 logcluster.pl --support=3400 \ --input=ls15.log --separator='["|\s]+' \ hours. In contrast, the C-based version of SLCT accomplishes --lfilter='^(.*)(?:\|"\d+"){2}' --template='$1' \ the same three tasks in 18-19 minutes. Therefore, we expect a --wfilter='blue\d\d' --wsearch='blue\d\d' \ C implementation of LogCluster to be significantly faster. --wreplace='blueNN' --wweight=0.5
According to Table I, LogCluster finds less clusters than (ws4-01.lab.blueNN.ex|ws4-04.lab.blueNN.ex
SLCT during all experiments (some clusters are depicted in |ws4-03.int.blueNN.ex|ws4-04.int.blueNN.ex Fig. 4). The reviewing of detected clusters revealed that unlike |ws4-02.int.blueNN.ex|ws4-05.lab.blueNN.ex SLCT, LogCluster was able to discover a single cluster for |ws4-05.int.blueNN.ex|dlna.lab.blueNN.ex |ws4-01.int.blueNN.ex|ws4-02.lab.blueNN.ex lines where frequent words were separated with a variable |ws4-03.lab.blueNN.ex|git.lab.blueNN.ex) number of infrequent words. For example, the first cluster in (ssh|ssh.ipv6) OK SSH OK - Fig. 4 properly captures all DHCP request events. In contrast, (OpenSSH_6.6.1p1|OpenSSH_5.9p1|OpenSSH_6.6.1_hpn1 SLCT discovered two clusters May 8 * myserver dhcpd: 3v11) (Ubuntu-2ubuntu2|FreeBSD-20140420|Debian- 5ubuntu1|Debian-5ubuntu1.4) (protocol 2.0) DHCPREQUEST for * from * * via and May 8 * myserver dhcpd: DHCPREQUEST for * * from * * via which still do not cover all possible event formats. Also, the last two clusters in Fig. 5. Sample joint clusters detected by LogCluster (for the reasons of Fig. 4 represent all HTTP requests originating from the latest privacy, sensitive data have been obfuscated). stable versions of Firefox browser on all OS platforms and Chrome browser on all Windows platforms, respectively (all OS platform strings are matched by *{3,4} for Firefox, while V. CONCLUSION Windows NT *{1,3} matches all Windows platform strings for In this paper, we have described the LogCluster algorithm Chrome). Like in the previous case, SLCT was unable to for mining patterns from event logs. For future work, we plan discover equivalent two clusters that would concisely capture to explore hierarchical event log clustering techniques. We also HTTP request events for these two browser types. plan to implement the LogCluster algorithm in C, and use LogCluster for automated building of user behavior profiles. When evaluating the Join_Clusters heuristic, we found that word weight thresholds (parameter t in Fig. 3) between 0.5 and 0.8 produced the best joint clusters. Fig. 5 displays three ACKNOWLEDGMENT sample joint clusters which were detected from the mail server The authors thank NATO CCD COE for making Locked and Nagios logs (rows 16-21 in Table I). Fig. 5 also illustrates Shields 2015 event logs available for this research. The authors data preprocessing capabilities of the LogCluster tool. For the also thank Mr. Kaido Raiend, Mr. Ants Leitmäe, Mr. Andrus mail server log, a word class is created for each word which Tamm, Dr. Paul Leis and Mr. Ain Rasva for their support. REFERENCES [11] Adetokunbo Makanju, “Exploring Event Log Analysis With Minimum Apriori Information,” PhD Thesis, University of Dalhousie, 2012. [1] Risto Vaarandi and Mauno Pihelgas, “Using Security Logs for Collecting and Reporting Technical Security Metrics,” in Proceedings of [12] Mika Klemettinen, “A Knowledge Discovery Methodology for the 2014 IEEE Military Communications Conference, pp. 294-299. Telecommunication Network Alarm Databases,” PhD thesis, University of Helsinki, 1999. [2] Risto Vaarandi, “A Data Clustering Algorithm for Mining Patterns From Event Logs,” in Proceedings of the 2003 IEEE Workshop on IP [13] Qingguo Zheng, Ke Xu, Weifeng Lv and Shilong Ma, “Intelligent Operations and Management, pp. 119-126. Search of Correlated Alarms from Database Containing Noise Data,” in Proceedings of the 2002 IEEE/IFIP Network Operations and [3] Risto Vaarandi, “A Breadth-First Algorithm for Mining Frequent Management Symposium, pp. 405-419. Patterns from Event Logs,” in Proceedings of the 2004 IFIP International Conference on Intelligence in Communication Systems, [14] Sheng Ma and Joseph L. Hellerstein, “Mining Partially Periodic Event LNCS Vol. 3283, Springer, pp. 293-308. Patterns with Unknown Periods,” in Proceedings of the 17th International Conference on Data Engineering, pp. 205-214, 2001. [4] Risto Vaarandi, “Mining Event Logs with SLCT and LogHound,” in [15] James J. Treinen and Ramakrishna Thurimella, “A Framework for the Proceedings of the 2008 IEEE/IFIP Network Operations and Management Symposium, pp. 1071-1074. Application of Association Rule Mining in Large Intrusion Detection Infrastructures,” in Proceedings of the 2006 Symposium on Recent [5] Risto Vaarandi and Kārlis Podiņš, “Network IDS Alert Classification Advances in Intrusion Detection, LNCS Vol. 4219, Springer, pp. 1-18. with Frequent Itemset Mining and Data Clustering,” in Proceedings of the 2010 International Conference on Network and Service [16] Chris Clifton and Gary Gengo, “Developing Custom Intrusion Detection Management, pp. 451-456. Filters Using Data Mining,” in Proceedings of the 2000 IEEE Military Communications Conference, pp. 440-443. [6] Thomas Reidemeister, Mohammad A. Munawar and Paul A.S. Ward, “Identifying Symptoms of Recurrent Faults in Log Files of Distributed [17] Jon Stearley, “Towards Informatic Analysis of Syslogs,” in Proceedings Information Systems,” in Proceedings of the 2010 IEEE/IFIP Network of the 2004 IEEE International Conference on Cluster Computing, Operations and Management Symposium, pp. 187-194. pp. 309–318. [7] Thomas Reidemeister, Miao Jiang and Paul A.S. Ward, “Mining [18] Adetokunbo Makanju, Stephen Brooks, A. Nur Zincir-Heywood and Unstructured Log Files for Recurrent Fault Diagnosis,” in Proceedings Evangelos E. Milios, “LogView: Visualizing Event Log Clusters,” in of the 2011 IEEE/IFIP International Symposium on Integrated Network Proceedings of the 6th Annual Conference on Privacy, Security and Management, pp. 377-384. Trust, pp. 99-108, 2008. [8] Thomas Reidemeister, “Fault Diagnosis in Enterprise Software Systems [19] Daniela Brauckhoff, Xenofontas Dimitropoulos, Arno Wagner and Kavè Using Discrete Monitoring Data,” PhD Thesis, University of Waterloo, Salamatian, “Anomaly Extraction in Backbone Networks using 2012. Association Rules,” in Proceedings of the 2009 ACM SIGCOMM Internet Measurement Conference, pp. 28-34. [9] Wei Xu, Ling Huang, Armando Fox, David Patterson and Michael Jordan, “Mining Console Logs for Large-Scale System Problem [20] Eduard Glatz, Stelios Mavromatidis, Bernhard Ager and Xenofontas Detection,” in Proceedings of the 3rd Workshop on Tackling Computer Dimitropoulos, “Visualizing big network traffic data using frequent Systems Problems with Machine Learning Techniques, 2008. pattern mining and hypergraphs,” Computing Vol. 96(1), Springer, pp. 27-38, 2014. [10] Adetokunbo Makanju, A. Nur Zincir-Heywood and Evangelos E. Milios, “Clustering Event Logs using Iterative Partitioning,” in Proceedings of the 2009 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1255-1264.
Beyond Cryptographic Routing: The Echo Protocol in the new Era of Exponential Encryption (EEE): - A comprehensive essay about the Sprinkling Effect of Cryptographic Echo Discovery (SECRED) and further innovations in cryptography around the Echo Applications Smoke, SmokeStack, Spot-On, Lettera and GoldBug Crypto Chat Messenger addressing Encryption, Graph-Theory, Routing and the change from Mix-Networks like Tor or I2P to Peer-to-Peer-Flooding-Networks like the Echo respective to Friend-to-Friend Trust-Networks like they are built over the POPTASTIC protocol
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage: Volume 10 Troanary Photonic Storage Blueprint, #1