Adaptive Intrusion Detection: a Data Mining Approach
Wenke Lee (wenke@cs.columbia.edu) and Salvatore J. Stolfo (sal@cs.columbia.edu)
Computer Science Department, Columbia University, 500 West 120th Street, New York, NY 10027
Kui W. Mok† (mokk@ms.com)
Morgan Stanley Dean Witter & Co., 750 7th Avenue, New York, NY 10019
Abstract. In this paper we describe a data mining framework for constructing intrusion detection models. The first key idea is to
mine system audit data for consistent and useful patterns of program and user behavior. The other is to use the set of relevant system
features presented in the patterns to compute inductively learned classifiers that can recognize anomalies and known intrusions. In
order for the classifiers to be effective intrusion detection models, we need to have sufficient audit data for training and also select
a set of predictive system features. We propose to use the association rules and frequent episodes computed from audit data as
the basis for guiding the audit data gathering and feature selection processes. We modify these two basic algorithms to use axis
attribute(s) and reference attribute(s) as forms of item constraints to compute only the relevant patterns. In addition, we use an
iterative level-wise approximate mining procedure to uncover the low frequency but important patterns. We use meta-learning as
a mechanism to make intrusion detection models more effective and adaptive. We report our extensive experiments in using our
framework on real-world audit data.
Keywords: intrusion detection, audit data, classification, association rules, frequent episodes, feature construction
✁
†
This research is supported in part by grants from DARPA (F30602-96-1-0311) and NSF (IRI-96-32225 and CDA-96-25374).
This author’s contributions were made while at Columbia University.
✂
c 2000 Kluwer Academic Publishers. Printed in the Netherlands.
2
Lee and Stolfo and Mok
1. Introduction
As network-based computer systems play increasingly vital roles in modern society, they have become
the target of our enemies and criminals. Therefore, we need to find the best ways possible to protect our
systems. The security of a computer system is compromised when an intrusion takes place. An intrusion can
be defined (Heady et al., 1990) as “any set of actions that attempt to compromise the integrity, confidentiality
or availability of a resource”, for example, illegally gaining superuser privileges, attacking and rendering a
system out of service (i.e., denial-of-service), etc.
Intrusion prevention techniques, such as user authentication (e.g. using passwords or biometrics),
avoiding programming errors, and information protection (e.g., encryption) have been used to protect computer systems as a first line of defense. Intrusion prevention alone is not sufficient because as systems become
ever more complex, there are always exploitable weaknesses in the systems due to design and programming
errors. For example, there are known design flaws in TCP/IP protocols and Unix systems that have led to
security holes (Bellovin, 1989; Grampp and Morris, 1984); and after it was first reported many years ago,
exploitable “buffer overflow” bugs still exist in some recent system software due to programming errors. The
policies that balance convenience versus strict control of a system and information access, as well as critical
“insider” problems, also make it impossible for an operational system to be completely secure. Intrusion
detection is therefore needed as another wall to protect computer systems.
Many intrusion detection systems have been constructed by manual and ad hoc means. These systems
have been designed and implemented based on the system builders’ knowledge of a computer system and
their understanding of known intrusions. As a result, the effectiveness and adaptability of the intrusion
detection systems are limited in the face of new computing environments or newly invented attack methods.
This paper discusses a systematic framework for analyzing audit data and constructing intrusion detection models. Under this framework, a large amount of audit data is first analyzed using data mining
algorithms in order to obtain the frequent activity patterns. These patterns are then used to guide the selection
A Data Mining Framework for Adaptive Intrusion Detection
3
of system features as well as the construction of additional temporal and statistical features for another phase
of automated learning. Classifiers based on these selected features are then inductively learned using the
appropriately formatted audit data. These classifiers can be used as intrusion detection models since they
can classify (i.e., decide) whether an observed system activity is “legitimate” or “intrusive”. This approach
is substantially different from the traditional signature-based approaches. We believe inductively learned
models may provide generalized detection capabilities that may not be possible with a static set of hand
crafted signatures.
This paper is organized as follows. We first give a brief overview of current intrusion detection techniques and their limitations. We then outline the key elements of our framework. Next, we summarize the
lessons we learned from our early experiments on building classification models for intrusion detection,
namely we find that tools are needed for feature selection and audit data gathering. We then discuss how to
exploit the schema level information (i.e., attribute definitions) of connection records so that the association
rules (Agrawal et al., 1993) and frequent episodes (Mannila et al., 1995) algorithms can be directed to
compute relevant patterns from audit data efficiently. We report in detail our various experiments on using
these patterns as a guide for audit data gathering, and as a basis for feature selection. Finally, we present
our experimental results on combining multiple classification models for collaborative intrusion detection.
And in conclusion, we compare our framework with related research efforts and discuss issues in our future
research.
2. Intrusion detection techniques
There are mainly two types of intrusion detection techniques: Anomaly detection and Misuse detection 1 .
Most current intrusion detection systems use one or both of these two approaches.
1 Misuse detection is often times associated with detecting “insider attacks”, while intrusion detection generally refers to
“outsider attacks”. We make no such distinction here.
4
Lee and Stolfo and Mok
2.1. A NOMALY
DETECTION
Anomaly detection, for example IDES (Lunt et al., 1992), tries to determine whether deviation from an
established normal behavior profile can be flagged as an intrusion. A profile typically consists of a number of
statistical measures on system activities, for example, the CPU usage and the frequency of system commands
during a user login session. Deviation from a profile can be computed as the weighted sum of the deviations
of the constituent statistical measures. Profiles can be aged so that shifts of normal behavior are accounted
for. The key advantages of anomaly detection systems is that they can detect unknown intrusion since
they require no a priori knowledge about specific intrusions. However, defining and maintaining “normal”
profiles is a nontrivial and error prone task, leading to sometimes unacceptable levels of false alarms.
2.2. M ISUSE
DETECTION
Misuse detection refers to techniques that use patterns of known intrusions (e.g., more than three consecutive
failed logins within 2 minutes is a penetration attempt) or weak spots of a system (e.g., system utilities that
have the “buffer overflow” vulnerabilities) to match and identify intrusions. The sequence of attack actions,
the conditions that compromise a system’s security, as well as the evidence (e.g., damage) left behind by
intrusions can be represented by a number of general pattern matching models. For example, NIDES (Lunt,
1993) uses rules to describe attack actions, STAT (Ilgun et al., 1995) uses state transition diagrams to model
general states of the system and access control violations, and IDIOT (Kumar and Spafford, 1995) uses
Colored Petri nets to represent intrusion signatures as sequences of events on the target system. The key
advantage of misuse detection systems is that once the patterns of known intrusions are stored, future
instances of these intrusions can be detected effectively and efficiently. However, newly invented attacks
will likely go undetected, leading to unacceptable false negative error rates.
A Data Mining Framework for Adaptive Intrusion Detection
2.3. D IFFICULTIES
5
IN BUILDING INTRUSION DETECTION SYSTEMS
Currently there is no systematic approach for building intrusion detection systems, nor any generally agreed
upon evaluation metric for measuring their effectiveness. System builders’ intuition and experience guide
the selection of the statistical measures for anomaly detection (Lunt, 1993). Experts need to analyze and
categorize attack scenarios and system vulnerabilities, and hand-code the corresponding rules and patterns
for misuse detection.
In today’s network computing environment, there are multiple “penetration points” for intrusions to
take place. For example, at the network level, intruders can crash a host by simply sending “malicious”
packets to the victim; at the host level, intruders can first login to a system and then illegally obtain a root
(i.e., superuser) shell. System activities occurring at these different “penetration points” are captured in
different audit data sources. However, many intrusion detection systems handle only one particular audit
data source because of the knowledge and labor-intensive nature of understanding the system audit data,
selecting statistical measures, and modeling intrusion patterns.
The large traffic volume in security related mailing lists and Web sites of industry groups as well as
“underground societies” suggest that new system security holes and intrusion methods are continuously
being discovered. Therefore, it is imperative that intrusion detection systems be updated rapidly and upgraded frequently. However, these maintenance and updates are expensive and difficult because of the current
manual, ad hoc approaches generally employed.
2.4. A SYSTEMATIC
FRAMEWORK
A basic premise for intrusion detection is that when audit mechanisms are enabled to record system events,
distinct evidence of legitimate and intrusive user and program activities will be manifested in the audit data.
For example, from network traffic audit data, connection failures are normally infrequent. However, certain
types of intrusions will result in a large number of consecutive failures that may be easily detected. Taking
6
Lee and Stolfo and Mok
a data-centric point of view, intrusion detection can therefore be considered as a data analysis process.
Anomaly detection is about finding the normal usage patterns from the audit data, and misuse detection is
about encoding and matching intrusion patterns using the audit data.
We are developing a framework, first described in (Lee and Stolfo, 1998), of applying data mining
techniques to build intrusion detection models. This framework consists of classification and metaclassification (Chan and Stolfo, 1993), association rules and frequent episodes programs, as well as a support
environment that enables system builders to interactively and iteratively drive the process of constructing
and evaluating detection models. The end products are concise and consist of intuitive classification rules
that can detect intrusions, and that can be easily inspected and edited by security experts when needed.
We try to eliminate the manual and ad hoc elements as much as possible. Here inductively learned
classification rules replace the manually encoded intrusion patterns and profiles, and system features and
measures are selected by considering the statistical patterns computed from the audit data. Further, metalearning can be used to learn the correlation of intrusion evidence from multiple audit data sources, making
it feasible to detect a full range of intrusions.
Since the process of using pattern mining and classification programs is generic and mechanical, it is
much easier to update the intrusion detection models using our framework. Migration to new computing
environments or dealing with new intrusion methods means applying the data mining and machine learning
process to the relevant audit data2 . Meta-learning can be used to combine the new models (e.g., those that
detect new intrusions) with the existing ones (e.g., those that detect “old” intrusions). This meta-learning
mechanism facilitates incremental updates of the overall combined models, making adaptive intrusion
detection feasible.
2 However, exemplar attacks need to be executed, observed and labeled in new audit data stream in order to learn new detection
models. Naturally, this is also required for hand crafted signatures.
A Data Mining Framework for Adaptive Intrusion Detection
7
3. Classification models for intrusion detection
In this section we briefly summarize our early experiments on constructing classification models for anomaly
detection (detailed in (Lee and Stolfo, 1998)). These experiments showed the effectiveness of classification
models computed by machine learning programs.
System programs, such as sendmail, are frequent targets of intruders. The attacks normally cause the
programs to perform malicious activities. The sequence of run-time operating system calls (e.g., open, read,
...) of a program provides very detailed information about its execution. We obtained, from Stephanie Forrest
of the University of New Mexico, system call data of hundreds of normal sendmail runs and a dozen different
simulated attacks. Based on the findings that the short sequences of system calls of a program are very
consistent (Forrest et al., 1996), we pre-processed the data by using a length 11 sliding window to scan the
system call sequence and create a list of records, each of which has 11 consecutive system calls. The 11th
system call of each record is the class label (i.e., the target of the learning task), and the first 10 system calls
are the positional features (attributes)3 . RIPPER (Cohen, 1995), a classification rule learner that computes
the most distinguishing and concise feature value tests for each class label, was then applied to the data. The
resultant rule set, consisting of 252 rules each with 2 or 3 feature value tests, is the normal execution profile
since it specifies the normal co-occurrences of the system calls. When it is used to analyze an execution trace,
a large number of misclassifications (i.e., deviations) suggests an “anomaly”. Our experiments (detailed
in (Lee and Stolfo, 1998)) showed that we need to use data from as many as 80% of the hundreds of
the normal runs of sendmail for training in order for the classifier to clearly identify the attacks from the
normal runs. Therefore, in general we need to collect a sufficiently large amount of training data to learn
and establish a normal profile for use as an anomaly detector.
Recently, attacking the weaknesses in the TCP/IP protocols has become a very “cool” trend for intruders. These intrusions are aimed at disrupting network services. “Packet sniffing” programs, such as
3 Features and attributes both refer to the fields in a database record. In this paper, we use attributes in the context of mining
frequent patterns, and features in the context of building classifiers.
8
Lee and Stolfo and Mok
tcpdump (Jacobson et al., 1989), can capture traffic data for analysis and monitoring. We obtained a set
of tcpdump data on network traffic, available via http://iris.cs.uml.edu:8080/network.html, that
is part of an Information Exploration Shootout (see http://iris.cs.uml.edu:8080). The data set has
traces of normal traffic and a number of intrusions. We followed the TCP/IP protocol specification to first
process the raw packet-level data into a time series of connection-level records using a manually written
preprocessing script. For each record, we used the destination network service (e.g. http) as the class label,
and included other intrinsic connection information, for example, duration, number of bytes transfered in
each direction, and the flag specifying whether there is an error according to the protocol as the features.
We then applied RIPPER to the normal data. The resultant rule set, about 20 rules, can be considered as the
normal profile of the network since it characterizes the normal traffic patterns for each network service.
The initial results were not good: the misclassification rates on the normal testing data and the intrusions
were very close. This is because the temporal nature of network traffic is not captured in the classification
model. For each connection we then added several temporal and statistical features calculated from the
connections of the prior n seconds. These include the average duration and number of bytes, and the total
number of errors, etc. The new classification models showed significant improvement in detecting intrusions.
Complete details of these experiments and our findings appear in (Lee and Stolfo, 1998).
3.1. T HE CHALLENGES
The above experiments revealed that we need to solve some very challenging problems for the classification
models to be effective.
Formulating the classification tasks, i.e., determining the class labels and the set of features, from audit
data is a very difficult and time-consuming task. Since security is usually an after-thought of computer
system design, there is no standard auditing mechanism and data format specifically designed for intrusion
analysis purposes. A considerable amount of data pre-processing, which involves domain knowledge, is
A Data Mining Framework for Adaptive Intrusion Detection
9
required to extract raw “action-level” audit data into higher-level “session/event” records with the set of intrinsic system features. The temporal nature of event sequences in network-based computer systems suggests
that temporal and statistical measures over the features can be very valuable and hence additional features
(e.g., the total number of connection failures in the past 10 seconds) should be used. Traditional feature
selection techniques, as discussed in the machine learning literature, can not be directly applied here since
prior work typically does not consider sequential correlation of features across record boundaries. Fawcett
and Provost (Fawcett and Provost, 1997) presented some very interesting ideas on automatic selection of
features for a cellular fraud detector. Their method is very effective in detecting “superimposition fraud”
in which fraudulent activities are conducted using a legitimate account. Many intrusions can not be easily
categorized as “superimposition”, for example, those that attack network protocols.
A critical requirement for using classification rules as an anomaly detector is that we need to have
“sufficient” training data that covers as much variation of the normal behavior as possible, so that the false
positive rate is kept low (i.e., we wish to minimize detected “abnormal normal” behavior). It is not always
possible to formulate a classification model to learn the anomaly detector with “insufficient” training data,
and then incrementally update the classifier using on-line learning algorithms. This is because the limited
training data may not have covered all the class labels, and on-line algorithms, for example, ITI (Incremental
Tree Induction) (Utgoff et al., 1997), can’t deal with new data with new (i.e., unseen) class labels. For
example, in modeling daily network traffic, we use the services, e.g., http, telnet etc., of the connections as
the class labels in training models. We may not have connection records of the infrequently used services
with, say, only one week’s traffic data. A formal audit data gathering process therefore needs to take place
first. As we collect audit data, we need an indicator that can tell us whether the new audit data exhibits any
“new” normal behavior, so that we can stop the process when there is no more variation evident during our
collection process. This indicator ideally should be simple to compute and must be incrementally updated.
10
Lee and Stolfo and Mok
4. Algorithms for mining audit data
We aim to develop general rather than intrusion-specific tools in response to the challenges discussed in
the previous section. The idea is to first compute the association rules and frequent episodes from audit
data, which capture the intra- and inter- audit record patterns. These patterns are then utilized, with user
participation, to guide the data gathering and feature selection processes. Here we use the term “audit data”
to refer to general data streams that can be processed for detection purposes. Examples of such data streams
are the connection records extracted from the raw tcpdump output, and the Web site visit records processed
using Web site logs. We assume that audit data records are timestamped and hence ordered.
As described in (Lee et al., 1998), the main challenge in developing these data mining algorithms is
to provide support mechanisms for domain knowledge so that “useful” patterns are computed. We next
describe these basic data mining algorithms and our proposed extensions that allow the introduction of
domain knowledge in a convenient manner.
4.1. T HE BASIC
ALGORITHMS
From (Agrawal et al., 1993), let A be a set of attributes, and I be a set of values on A , called items.
Any subset of I is called an itemset. The number of items in an itemset is called its length. Let D be a
database with n attributes (columns). Define support X ✁ as the percentage of transactions (records) in D
that contain itemset X . An association rule is the expression X ✂
/ s ✟ support X ✠ Y ✁ is the support of the rule, and c ✟
X ✞ Y ✟ 0.
Y ✄✆☎ c ✄ s✝ . Here X and Y are itemsets, and
support ✡ X ☛ Y ☞
support ✡ X ☞
is the confidence. For example,
an association rule from the shell command history file of a user, which is a stream of commands and their
arguments, is trn ✂
rec ✌ humor, ☎ 0 ✌ 3 ✄ 0 ✌ 1✝✍✄ which indicates that 30% of the time when the user invokes trn,
he or she is reading the news in rec ✌ humor, and reading this newsgroup accounts for 10% of the activities
recorded in his or her command history file.
11
A Data Mining Framework for Adaptive Intrusion Detection
We implemented the association rules algorithm following the ideas of Apriori (Agrawal and Srikant,
1994). Briefly, an itemset X is frequent if support X ✁
✁
min support. The Apriori algorithm starts by
finding all length 1 frequent itemsets, then iteratively generates length k ✂ 1 frequent itemsets from the
length k frequent itemsets. At each iteration, the algorithm first uses a join step to generate length k ✂ 1
candidate itemsets, each from 2 length k frequent itemsets; then a prune step to filter out a candidate itemset
if one of its length k subsets is not frequent; finally the support values for the remaining candidate itemsets
are counted to form the set of frequent k ✂ 1 itemsets.
Since we look for correlation among values of different attributes, and the pre-processed audit data
usually has multiple columns of attributes, each with a large number of possible values, we do not convert
the data into a binary database. In our implementation we trade memory for speed. The data structure for
a frequent itemset has a row vector that records the transactions in which the itemset is contained. When a
length k ✂ 1 candidate itemset ck ✄
vector of ck ✄
1
1
is generated by joining two length k frequent item sets lk1 and lk2 , the row
is simply the bitwise AND product of the row vectors of lk1 and lk2 . The support of ck ✄
1
can
be calculated easily by counting the 1s in its row vector. There is also no need to perform the prune step in
the candidate generation function. We minimize the memory consumption by freeing up the row vectors of
length k itemsets after they are used to generate the length k+1 itemsets. Notice that the database needs to be
scanned only once to generate the list of frequent itemsets of length 1. Since most pre-processed audit data
files are small enough to fit into main memory, and the size of any list of length k item sets can not exceed
the size of the database, this implementation works well in this particular application domain.
The problem of finding frequent episodes based on minimal occurrences was introduced in (Mannila
and Toivonen, 1996). Briefly, given an event database D where each transaction is associated with a timestamp, an interval t1 t2 is the sequence of transactions that starts from timestamp t1 and ends at t2 . The
☎
✄
✝
width of the interval is defined as t2 ☎ t1 . Given an itemset A in D , an interval is a minimal occurrence
of A if it contains A and none of its proper sub-intervals contains A. Define support X as the the ratio
✁
between the number of minimum occurrences that contain itemset X and the number of records in D .
12
Lee and Stolfo and Mok
A frequent episode rule is the expression X ✄ Y ✂
Z ✄✆☎ c ✄ s ✄ window✝ . Here X , Y and Z are itemsets in D .
s ✟ support X ✠ Y ✠ Z ✁ is the support of the rule, and c ✟
support ✡ X ☛ Y ☛ Z ☞
support ✡ X ☛ Y ☞
is the confidence. Here the width of
each of the occurrences must be less than window. A serial episode rule has the additional constraint that X ,
Y and Z must occur in transactions in partial time order, i.e., Z follows Y and Y follows X . The description
here differs from (Mannila and Toivonen, 1996) in that we don’t consider a separate window constraint on
the LHS (left hand side) of the rule. The frequent episode algorithm finds patterns in a single sequence of
event stream data. The problem of finding frequent sequential patterns that appear in many different datasequences was introduced in (Agrawal and Srikant, 1995). This related algorithm is not used in our study
since the frequent network or system activity patterns can only be found in the single audit data stream from
the network or the operating system.
Our implementation of the frequent episodes algorithm utilized the data structures and library functions
of the association rules algorithm. Here instead of finding correlations of attribute values across columns,
we look for correlations across rows. The row vector is now used as the interval vector where each pair
of adjacent 1s is the pair of boundaries of an interval. A temporal join function, that considers minimal
occurrences that are non-overlapping, is used to create the interval vector of a candidate length k itemset
from the two interval vectors of two length k-1 frequent itemsets . The support of an itemset can be easily
counted as the number of pairs in the interval vector.
4.2. U SING
THE
axis ATTRIBUTES ( S )
These basic algorithms do not consider any domain knowledge and as a result it can generate many “irrelevant” rules. Consider the audit data of network connections shown in Table I. Here we have already
discretized the continuous attribute values, except the timestamps, into proper buckets. The basic association
rules algorithm may generate rules such as src bytes=200 ✂
f lag=SF . These rules are not useful and to
13
A Data Mining Framework for Adaptive Intrusion Detection
Table I. Network Connection Records
timestamp
duration
service
src_bytes
dst_bytes
flag
...
1.1
2.0
2.3
3.4
3.7
3.8
5.2
5.7
...
10
2
1
60
1
1
1
2
...
telnet
ftp
smtp
telnet
smtp
smtp
http
smtp
...
100
200
250
200
200
200
200
300
...
2000
300
300
12100
300
300
0
200
...
SF
SF
SF
SF
SF
SF
REJ
SF
...
...
...
...
...
...
...
...
...
...
some degree misleading. There is no intuition for the association between the number of bytes from the
source, src bytes, and the normal status (i.e., f lag=SF ) of the connection.
In (Klemettinen et al., 1994), rule templates specifying the allowable attribute values are used to postprocess discovered rules. In (Srikant et al., 1997), boolean expressions over the attribute values are used as
item constraints during rule discovery. A drawback of these approaches is that one has to know a priori what
rules and patterns are interesting. We can’t assume such strong prior knowledge on all audit data.
We instead utilize the schema-level information about audit records to direct the pattern mining process.
Observe that certain attributes are essential in describing the data, while others provide only “auxiliary information”. For example, a network connection can be uniquely identified by the combination of its timestamp
(start time), src host (source host), src port (source port), dst host (destination host), and service (destination port). These are the essential attributes when describing network data. We argue that the “relevant”
association rules should describe patterns related to the essential attributes.
We call the essential attribute(s) axis attribute(s) when they are used as a form of item constraints in
the association rules algorithm. During candidate generation, an item set must contain value(s) of the axis
attribute(s). That is, we consider the correlations among non-axis attributes as not interesting. In practice,
we need not designate all essential attributes as the axis attributes. For example, some network analysis tasks
require statistics about various network services while others may require the patterns related to the hosts.
14
Lee and Stolfo and Mok
We can use service as the axis attribute to compute the association rules that describe the patterns related to
the services of the connections.
It is even more important to use the axis attribute(s) to constrain the item generation for frequent
episodes. The basic algorithm can generate serial episode rules that contain “non-essential” attribute values,
for example, src bytes=200, src bytes=200 ✂
dst bytes=300, src bytes=200. Note that here each attribute
value, for example, src bytes=200, is from a separate connection record. Compared with the association
rules, the total number of serial rules is large and so is the number of such useless rules. Observe that the
number of iterations for growing the frequent itemsets (i.e., the length of an itemset) is bounded here by the
number of rows instead of the number of columns as in association rules, which tends to be a large number.
Further, if the support of an association rule on non-axis attributes, A ✂
number of “useless” serial episode rules of the form A B ✁ ✄ A B ✁✂✁ ✂
B, is high then there will be a large
A B ✁ ✄ A B ✁✄✁ . To see this, assume that
there are a total of m records (rows) in the database, the time difference from the last and the first record
is t seconds, and the support of A ✠ B is s. Then the number of minimal and non-overlapping intervals that
have k records with A ✠ B is
sm
k .
Note that each of these intervals contains a length k serial episode on (A B).
Assume that the records with A ✠ B are evenly distributed, then the width of the interval is w ✟
kt
sm .
There
can be a large number of serial patterns on A B ✁ if s is high and window (i.e., the interval width threshold)
is large, since kmax , the maximal length of the patterns, can be large and w ☎ width still holds.
Instead of using the basic algorithm, here we first find the frequent associations using the axis attributes(s) and then generate the frequent serial patterns from these associations. An example of a rule
is
service ✟ smtp ✄ src_bytes ✟ 200 ✄ dst_bytes ✟ 300 ✄ flag ✟ SF ✁ ✄
service ✟ telnet ✄ flag ✟ SF ✁ ✂
service ✟ http ✄ src_bytes ✟ 200 ✁ ✄✆☎ 0 ✌ 2 ✄ 0 ✌ 1 ✄ 2s✝
Note that each itemset of the episode rule, e.g., (service = smtp, src_bytes = 200, dst_bytes = 300, flag = SF),
is an association. We in effect have combined the associations among attributes and the sequential patterns
15
A Data Mining Framework for Adaptive Intrusion Detection
Table II. Web Log Records
timestamp
remote host
action
request
...
1
1.1
1.3
...
3.1
3.2
3.5
...
8
8.2
9
...
his.moc.kw
his.moc.kw
his.moc.kw
...
taka10.taka.is.uec.ac.jp
taka10.taka.is.uec.ac.jp
taka10.taka.is.uec.ac.jp
...
rjenkin.hip.cam.org
rjenkin.hip.cam.org
rjenkin.hip.cam.org
...
GET
GET
GET
...
GET
GET
GET
...
GET
GET
GET
...
/images
/images
/shuttle/missions/sts-71
...
/images
/images
/shuttle/missions/sts-71
...
/images
/images
/shuttle/missions/sts-71
...
...
...
...
...
...
...
...
...
...
...
...
...
among the records into a single rule. This rule formalism provides rich and useful information about the
audit data.
4.3. U SING
THE
reference
ATTRIBUTE ( S )
Another piece of interesting schema-level information about audit records is that some essential attributes
can be the references of other attributes. These reference attributes normally carry information about some
“subject”, and other attributes describe the “actions” that refer to the same “subject”. Consider the log of
visits to a Web site, as shown in Table II. Here action and request are the “actions” taken by the “subject”,
remote host. We see that for a number of remote hosts, each of them makes the same sequence of requests:
“/images”, “/images” and “/shuttle/missions/sts-71”. It is important to use the “subject” as a reference
when finding such frequent sequential “action” patterns because the “actions” from different “subjects”
are normally irrelevant. This kind of sequential pattern can be represented as
subject ✟ X ✄ action ✟ a1 ✁ ✄✆ subject ✟ X ✄ action ✟ a2 ✁ ✂
subject ✟ X ✄ action ✟ a3 ✁ ✄✆☎ c ✄ s ✄ window ✝
16
Lee and Stolfo and Mok
Input: the terminating minimum support s0 , the initial minimum support si , and the axis attribute(s)
Output: frequent episode rules Rules
Begin
/
(1)
Rrestricted 0;
(2)
scan database to form L {large 1-itemsets that meet s0 };
(3)
s si ;
(4)
while s s0 do begin
(5)
find serial episodes from L: each pattern must contain at least one axis attribute value
that is not in Rrestricted ;
(6)
append new axis attribute values to Rrestricted ;
(7)
append episode rules to the output rule set Rules;
(8)
s 2s ;
end while
end
✟
✟
✟
✁
✟
Figure 1. Level-wise Approximate Mining of Frequent Episodes
Note that within each occurrence of the pattern, the action values refer to the same subject, yet the actual
subject value may not be given in the rule since any particular subject value may not be frequent with regard
to the entire dataset. In other words, subject is simply a reference or a variable.
The basic frequent episodes algorithm can be extended to consider reference attribute(s). Briefly, when
forming an episode, an additional condition is that, within its minimal occurrences, the records covered by
its constituent itemsets have the same value(s) of the reference attribute(s).
In summary, axis attributes and reference attributes provide a convenient and easy to use means of specifying domain knowledge to constrain the outcome of the association rules and frequent episodes algorithms
to produce “relevant” output.
4.4. Level-wise APPROXIMATE
MINING
Sometimes it is important to discover the low frequency patterns. For example, in daily network traffic
some services, for example, gopher, account for a very low percentages of activity. Yet we still need to include their patterns into the network traffic profile so that we have representative patterns for each supported
service. If we use a very low support value for the data mining algorithms, we will then get unnecessarily a
very large number of patterns related to the high frequency services, for example, smtp.
A Data Mining Framework for Adaptive Intrusion Detection
17
We propose a level-wise approximate mining procedure, as outlined in Figure 1, for finding the frequent
episodes. Here the idea is to first find the episodes related to high frequency axis attribute values, for
example,
service ✟ smtp ✄ src_bytes ✟ 200 ✁ ✄✆ service ✟ smtp ✄ src_bytes ✟ 200 ✁ ✂
service ✟ smtp ✄ dst_bytes ✟ 300 ✁ ✄✆☎ 0 ✌ 3 ✄ 0 ✌ 3 ✄ 2s✝
We then iteratively lower the support threshold to find the episodes related to the low frequency axis values
by restricting the participation of the “old” axis values that already have output episodes. More specifically,
when a candidate itemset is generated, it must contain at least one “new” (i.e., low frequency) axis value.
The episodes from each iteration are those consisting of either all new axis values or a combination of new
and old axis values. For example, in the second iteration, where smtp now is an old axis value, we get an
episode rule with http as a new axis value
service ✟ smtp ✄ src_bytes ✟ 200 ✁ ✄✆ service ✟ http ✄ src_bytes ✟ 200 ✁ ✂
service ✟ smtp ✄ src_bytes ✟ 300 ✁ ✄✆☎ 0 ✌ 4 ✄ 0 ✌ 1 ✄ 2s✝
The procedure terminates when a very low support value is reached. In practice, this can be the lowest
frequency of all axis values.
Note that for a high frequency axis value, we in effect omit its very low frequency episodes generated in
the runs with low support value because they are not as interesting and representative. Hence our procedure
is “approximate” mining. We still include all the old (i.e., high frequency) axis values to form episodes with
the new axis values because it is important to capture the sequential context of the new axis values. For
example, although used infrequently, auth normally co-occurs with other services such as smtp and login. It
is therefore imperative to include these high frequency services when presenting episode rules about auth.
Our approach here is different from the algorithms in (Han and Fu, 1995) since we do not have and can
not assume multiple concept levels, rather, we deal with multiple frequency levels of a single concept, e.g.,
the network service.
18
Lee and Stolfo and Mok
5. Using the mined patterns
In this section we report upon our experience in mining the audit data and using the discovered patterns both
as the indicator for gathering data and as the basis for selecting appropriate temporal and statistical features
for learning classification models.
5.1. AUDIT DATA
GATHERING
We posit that the patterns discovered from the audit data on a protected target (e.g., a network, system
program, or user, etc.) corresponds to the target’s behavior. When we gather audit data to learn the target’s
normal profile, we can compute the patterns from each new audit data set, and merge the new rules into the
existing aggregate rule set. These added new rules represent new variations of the normal behavior. When
the aggregate rule set stabilizes, i.e., no new rules from the new audit data can be added, we can stop the
data gathering since the aggregate audit data set has covered sufficient variations of the normal behavior 4 .
We merge two rules, r1 and r2 , into one rule r if 1) their right and left hand sides are exactly the same, or
their RHSs can be combined and LHSs can also be combined; and 2) the support values and the con f idence
values are close, i.e., within an ε. The concept of combining here is similar to clustering in (Lent et al.,
1997) in that we also combine rules that are “similar” or “adjacent”.
To simplify our discussion, consider combining the LHSs and assume that the LHS of r1 has just one
itemset, ax1
ax2
✟
vx2 a2
✄
vx1 a1
✟
✟
✄
✟
v1 . Here ax1 is an axis attribute. The LHS of r2 must also have only one itemset,
✁
v2 . Further, ax1
✁
✟
ax2 , vx1
✟
vx2 , and a1
✟
a2 must hold (i.e., the LHSs must cover the
same set of attributes, and their axis values must be the same). For the LHSs to be combined, v1 and v2
must be adjacent values or adjacent bins of values. The LHS of the merged rule r is ax1
✟
vx1 v1
✄
☎
a1
☎
v2 , assuming that v2 is the larger value. For example, (service=smt p, src bytes=200) and (service=smt p,
✁
4 Presently, we do not know of any formal means of determining when sufficient normal behavior has indeed been observed.
Installing new services or programs on a system clearly requires new data gathering to establish new norms.
A Data Mining Framework for Adaptive Intrusion Detection
src bytes=300) can be combined to (service=smt p, 200
☎
src bytes
☎
19
300). To compute the statistically
relevant support and con f idence values of the merged rule r, we record support lhs and db size of r1 and
r2 when mining the rules from the audit data. Here support lhs is the support of a LHS and db size is the
number of records in the audit data.
Our approach of merging rules is based on the intuition that even the same behavior will have slight
differences across audit data sets. Therefore we should not expect exact matches of the mined patterns.
Instead we need to combine similar patterns into more generalized ones.
5.1.1. Experiments
Here we test our hypothesis that the merged rule set can indicate whether the audit data has covered sufficient
variations of normal behavior. We obtained one month of TCP/IP network traffic data from http://ita.ee.
lbl.gov/html/contrib/LBL-CONN-7.html. We hereafter refer to it as the LBL dataset. We segmented the
data by day. With the data of each day, we again segmented the data into four partitions: morning, afternoon,
evening and night. This partitioning scheme allowed us to cross evaluate anomaly detection models of
different time segments that have different traffic patterns. It is often the case that very little and sometimes
no intrusion data is available when building an anomaly detector. A common practice is to use audit data of
legitimate activities that is known to have different behavior patterns for testing and evaluation. We followed
this approach here.
We first describe the experiments and results on the “weekday morning” traffic data on connections
originated from LBL to the outside world. We decided to compute the frequent episodes using the network
service as the axis attribute. Recall from section 4.2 that this formalism captures both association and
sequential patterns of network services. We also used the level-wise approximate mining procedure so that
even the infrequent services are included in the mined patterns. For the first three weeks, we mined the
patterns from the audit data of each weekday morning, and merged them into an aggregate rule set. For each
rule in the rule set we record merge count, the number of merges on each rule. Note that if two rules r 1
20
Lee and Stolfo and Mok
560
504
448
Number of Rules
392
all services
frequent all services
frequent http
frequent smtp
frequent telnet
336
280
224
168
112
56
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17
Days of Week
Figure 2. The Number of Rules vs. The number of Audit Data Sets
and r2 are merged into r, its merge count is the sum from the two parent rules. merge count indicates how
frequent the behavior represented by the merged rule is encountered across a period of time. The rules with
merge count
min f requency are regarded as frequent rules. Frequent rules are more reliable in describing
behavior since they are “verified” by multiple audit data sets. The f reshness field of a rule indicates the last
time the rule was merged. It can be used to age out “old” frequent rules.
Figure 2 plots how the rule set changes as we merge patterns from each new audit data set. We see that
the total number of rules monotonically increases. We visually inspected the new patterns from each new
data set. In the first two weeks, the majority are related to “new” network services that have no prior patterns
in the aggregate rule set, and for the last week, the majority are just new patterns of the existing services.
Figure 2 shows that the rate of change indeed slows down during the last week. We used min f requency
✟
2
to filter out the “one-time” patterns so that the remaining patterns are considered “frequent”. We can see in
the plots that the frequent rule sets of all services as well as the individual services grow at a much slower
rate and tend to stabilize.
21
A Data Mining Framework for Adaptive Intrusion Detection
Weekend Mornings
Weekday Nights
Weekday Mornings
22.00
19.80
19.80
17.60
17.60
Misclassification Rate
Misclassification Rate
Weekday Mornings
22.00
15.40
13.20
11.00
8.80
6.60
4.40
13.20
11.00
8.80
6.60
2.20
1
2
3
4
0.00
5
1
2
3
Days of Week
Days of Week
(a) 8 days
(b) 10 days
Weekday Mornings
Weekend Mornings
Weekday Nights
Weekday Mornings
22.00
19.80
19.80
17.60
17.60
Misclassification Rate
Misclassification Rate
15.40
22.00
15.40
13.20
11.00
8.80
6.60
4.40
4
Weekend Mornings
5
Weekday Nights
15.40
13.20
11.00
8.80
6.60
4.40
2.20
0.00
Weekday Nights
4.40
2.20
0.00
Weekend Mornings
2.20
1
2
3
4
5
0.00
1
2
3
Days of Week
Days of Week
(c) 15 days
(d) 17 days
4
5
Figure 7. Misclassification Rates of Classifier Trained on First 8, 10, 15 and 17 Weekdays
We used the set of frequent rules of all services as the indicator on whether the audit data is sufficient
to model normal behavior. We tested the quality of this indicator by constructing four classifiers, using audit
data from the first 8, 10, 15, and 17 weekday mornings, respectively, for training anomaly detectors. We
used the services of the connections as the class labels, and included a number of temporal and statistical
features (we discuss the feature selection and construction process in Session 5.3). The classifiers were
tested using the audit data from the mornings and nights of the last 5 weekdays of the month, as well as the
last 5 weekend mornings. Figure 5.1.1 shows the performance of these four classifiers respectively. On each
plot, we show the misclassification rate on the test data. Since the classifiers model the weekday morning
traffic, we wish to see this rate be low on the weekday morning test data, but high on the weekend morning
data as well as the weekday night data. The figures show that the classifiers with more training data perform
better. Further, the performance of the last two classifiers are very close. This is not surprising at all because
22
Lee and Stolfo and Mok
from the plots in Figure 2, the set of frequent rules, that is, our indicator on audit data, grows in weekday 8
and 10, but stabilizes from day 15 to 17.
This experiment provides evidence that indeed the rate of growth of the rule set is a plausible indicator
on whether sufficient training data has been collected for building useful models for anomaly detection.
However, this study involved only one specific externally supplied data set. Further study is required to
determine how generally applicable this might be to other contexts.
5.2. O FF - LINE
ANALYSIS
Here we discuss some ideas and experiments on using the mined patterns as a means for understanding and
comparing the activity behavior exhibited in the audit data. The fact that the merged rule set can be used to
model “new” behavior when gathering audit data suggests that we can use the final rule set directly to detect
anomalies. Also, by mining and comparing specific patterns from audit data, we can discover whether certain
known intrusion patterns exist. In other words, the patterns might be used for misuse intrusion detection
directly.
5.2.1. Anomaly detection: experiments on LBL dataset
We used the final set of frequent rules to distinguish the traffic data of the last week’s weekday mornings
from all the weekends’ mornings, and from the last week’s weekday nights. We use a similarity measure.
Assume that the merged rule set has n rules, the size of the new rule set from an audit data set is m, and the
number of matches (i.e., the number of rules that can be merged) between the merged rule set and the new
rule set is p, then we have similarity
✟
p
n
p
m.
Here
p
n
represents the percentage of known behavior (from
the merged rule set) exhibited in the new audit data, and
p
m
represents the proportion of behavior in the new
audit data that conforms to the known behavior. Figure 8 shows that in general the similarity of the weekday
mornings are much larger than the weekend mornings and the weekday nights.
A Data Mining Framework for Adaptive Intrusion Detection
Weekday Mornings
Weekend Mornings
23
Weekday Nights
0.25
Similarities
0.20
0.15
0.10
0.05
0.00
1
2
3
4
5
Days of Week
Figure 8. Similarity Measures Against the Merged Rule Set of Weekday Mornings
Note that in general these mined patterns cannot be used directly to classify the records. That is, they
cannot tell which records are an anomaly. However they are very valuable in off-line analysis. We built a
graphical tool to visually display the differences of two rule sets. When comparing the weekday morning
rule set with a weekend morning, we see the latter rule set has a large number of patterns of telnet from 2 or
3 particular machines. This was due to the fact that only a few people showed up to work on the weekend
mornings, therefore each machine is relatively frequent. Whereas in the weekdays, many machines used
telnet and thus each alone is not frequent enough.
5.2.2. Misuse detection: experiments on InfoWorld IWSS16 dataset
We report our recent experiments on a set of network intrusion data from InfoWorld, which contains attacks of the “InfoWorld Security Suite 16” that was used to evaluate several leading commercial intrusion
detection products (McClure et al., 1998). We hereafter refer this dataset as the IWSS16 dataset.
We were given two traces of tcpdump data. One contains 4 hours of normal network traffic, and the
other contains 2 hours of network traffic where 16 different types of attacks were simulated. We first summarized the raw packet level tcpdump data into connection records using the preprocessing script described in
Section 3. For each record, we included only the “intrinsic” connection attributes such as service, duration,
flag, etc., that is, no temporal and statistical features were added.
24
Lee and Stolfo and Mok
According to their attack methods5 , several intrusions in IWSS16 would leave distinct evidence in
the short sequence of time ordered connection records. Below we demonstrate how our frequent episodes
program can be used to find or test the patterns of these attacks. The main idea is to mine frequent episodes
from both the normal data and attack data, and compare the patterns to identify those that are from the attack
data only. The procedure of comparing patterns and identifying intrusion patterns is discussed in Section 5.3.
We used a time window of 2 seconds when computing the following frequent episodes.
Port Scan and Netsonar Scan: The attacker systematically makes connections to each port (i.e.,
☎
service) of a target host (i.e, the destination host) in order to find out which ports are accessible. In the
connection records, there should be a host (or hosts) that receives many connections to its “different”
ports in a short period of time. Further, a lot of these connections have the “REJ” flag since many ports
are normally unavailable (hence the connections are rejected).
Data mining strategy: use dst_host as both the axis attribute and the reference attribute to find the
“same destination host” frequent sequential “destination host” patterns;
Evidence in intrusion data: there are several patterns that suggest the attack, for example,
dst_host ✟ hostv ✄ flag ✟ REJ ✁ ✄✆ dst_host ✟ hostv ✄ flag ✟ REJ✁ ✂
dst_host ✟ hostv ✄ flag ✟ REJ ✁ ✄✆☎ 0 ✌ 65 ✄ 0 ✌ 2 ✄ 2s✝
But no patterns with f lag ✟
REJ are found when using the service as the axis attribute and
dst_host as the reference attribute since a large number of different services are attempted in a
short period time, and as a result, for each service the “same destination host” sequential patterns
are not frequent;
Contrast with normal data: patterns related to f lag ✟
REJ indicate that the “same” service is
involved.
5
Scripts and descriptions of many intrusions can be found using the search engine in http://www.rootshell.com
A Data Mining Framework for Adaptive Intrusion Detection
25
Syn Flood: The attacker makes a lot of “half-open” connections, by sending only a “syn request” but
☎
not establishing the connection, to a port of a target host in order to fill up the victim’s connectionrequest buffer. As a result, the victim will not be able handle new incoming requests. This is a form
of “denial-of-service” attack. In the connection records, there should exist a host where one of its port
receives a lot of connections with flag “S0” (i.e., only the “syn request” packet is seen) in a short period
of time.
Data mining strategy: use service as the axis attribute and dst_host as the reference attribute to
find the “same destination host” frequent sequential “service” patterns;
Evidence in intrusion data: there is very strong evidence of the attack, for example,
service ✟ http ✄ flag ✟ S0 ✁ ✄✆ service ✟ http ✄ flag ✟ S0 ✁ ✂
service ✟ http ✄ flag ✟ S0 ✁ ✄✆☎ 0 ✌ 6 ✄ 0 ✌ 15 ✄ 2s✝
☎
Contrast with normal data: no patterns with f lag ✟ S0.
ICMP Flood: The attacker sends a very large number of icmp echo messages to a target host in an
attempt to use up its resources since the kernel will be busy processing the incoming messages. This
is another form of “denial-of-service” attack. In the connection records, there should exist a host that
receives a lot of icmp echo connections in a short period of time.
Data mining strategy: use service as the axis attribute and dst_host as the reference attributes to
find the “same connection host pair” frequent sequential “service” patterns;
Evidence in intrusion data: there is very strong evidence of the attack, for example,
service ✟ icmp_echo ✄ src_host ✟ hosth ✄ dst_host ✟ hostv ✁ ✄
service ✟ icmp_echo ✄ src_host ✟ hosth ✄ dst_host ✟ hostv ✁ ✂
service ✟ icmp_echo ✄ src_host ✟ hosth ✄ dst_host ✟ hostv ✁ ✄✆☎ 0 ✌ 45 ✄ 0 ✌ 1 ✄ 2s✝
26
Lee and Stolfo and Mok
Contrast with normal data: no such patterns.
Other types of intrusions in IWSS16 do not show evidence in connection record sequences since their
attack actions can be embedded in a single connection. Section 5.3.1 discusses how to mine patterns from
the packet level data and improve the detection models.
5.3. F EATURE
SELECTION AND CONSTRUCTION
We use the mined frequent episodes from network connection records as guidelines to construct temporal
and statistical features for building classification models. An iterative procedure that involves pattern mining
and comparison, feature construction from patterns, and model building and evaluation, is needed in order
to find the most effective classification model. In each iteration, we choose a different data mining strategy
regarding the choices of axis attributes and reference attributes. These choices are limited among the “essential” attributes (see Section 4.2), that is, service, dst host, src dst, or src port. Note that the exact timestamp
is never frequent and is thus omitted. Since intrusions are generally targeted to some victim host(s) in the
network, we start with service and dst host.
For building misuse detection models, we apply the frequent episodes program to both normal connection data and intrusion data, then compare the resulting patterns to find the “intrusion only” patterns. The
details of the pattern comparison algorithm is described in (Lee et al., 1999). Briefly, since the number of
patterns may be very large and there are rarely exactly matched patterns from two data sets, this algorithm
uses a heuristic, in the form of domain-specific “order of importance” among attributes6 , and considers two
episodes related to two different sets of “important” attributes (e.g., service, dst host, and f lag) as more
different, while episodes with the same set of “important” attributes but different sets of “unimportant”
attributes (e.g., src bytes) as more alike. This algorithm then outputs a user-specified top percent of the most
“intrusion only” patterns.
6 From the data schema perspective, “order of importance” specifies which attribute is more important, compared with others,
in describing the data.
A Data Mining Framework for Adaptive Intrusion Detection
27
Input: a frequent episode, and the set of existing features in connection records, F
Output: the updated F
Begin
(1)
Let F0 (e.g., dst host) be the reference attribute used to mine the episode;
(2)
Let w, in seconds, be the minimum width of the episode;
/* all the following features consider only the connections in past w
* seconds that share the same value in F0 as the current connection
*/
(3)
Let count sameF0 be the number of these connections;
(4)
F = F ✠ count sameF0 ;
(5)
for each “essential attribute” F1 other than F0 do begin
(6)
if the same F1 value is in all the item sets then begin
(7)
Let count sameF1 be the number of connections that share the same F1 value
as the current connection;
(8)
F = F ✠ count sameF1 ;
end else
/* there are different F1 or no F1 values at all */
(9)
Let count di f fF1 be the number of different F1 values in the connections;
(10)
F = F ✠ count di f fF1 ;
end
end
(11) for each value V2 of an non-essential attribute F2 do begin
(12)
if V2 is in all the item sets then begin
(13)
Let count sameV2 be the number of connections that share the same V2 value
as the current connection;
(14)
F = F ✠ count sameV2 ;
(15)
end else if F2 is a numerical attribute then begin
(16)
Let averageF2 be the average of the F2 values of the connections;
(17)
F = F ✠ averageF2 ;
end
end
end
✁
✁
✁
✁
✁
Figure 9. Constructing Features from Frequent Episode
Each of the intrusion patterns, for example, the “syn flood” pattern
service ✟ http ✄ flag ✟ S0 ✁ ✄✆ service ✟ http ✄ flag ✟ S0 ✁ ✂
service ✟ http ✄ flag ✟ S0 ✁
is used for constructing additional features into the connection records according to the algorithm in Figure 9.
This “syn flood” pattern results in the following additional features: a count of connections to the same
28
Lee and Stolfo and Mok
dst host in the past 2 seconds, and among these connections, a count of those that have the same service,
and count of those that have the “S0” f lag.
An open problem here is how to decide the right time window value w. We mined sequential patterns
using different w values, from 0.5 to 20 with an increment of 0.5, and plotted the number of patterns generated at each run. Our experience showed that this plot tends to stabilize after the initial sharp jump. We call
the smallest w in the stable region w0 . In (Lee and Stolfo, 1998), we reported experiments of using different
w values to calculate temporal and statistical features for classification models. Our results showed the plot
of accuracy of the classifier also stabilizes after w
w0 and tends to taper off. Intuitively, a requirement for
a good window size is that its set of sequential patterns is stable, that is, sufficient patterns are captured and
noise is small. We therefore use w0 for adding temporal and statistical features.
For building anomaly detection models when there is no intrusion data available, we don’t need the
pattern comparison step. We simply use the feature construction procedure outlined in Figure 9 on all the
frequent episodes mined from the normal data. The classifiers briefly described in Section 5 include these
temporal and statistical features: for the past 140 seconds, the number of connections with the same service
as the current connection, and among these connections, the count of “SF” f lag, the average of duration,
the average of src bytes, and the average of dst bytes.
5.3.1. Experiments on IWSS16 dataset
We report our experiments of building misuse detection models for the IWSS16 attacks. We applied the
feature construction process discussed above, and for each connection record we added these temporal and
statistical measures on the connections computed over the prior 2 seconds:
☎
For connections that have the same destination host as the current connection: the total number of such
connections, the number of connections with the “S0” or “S1” flags (connection establishment errors),
the number of connections with “REJ” flag, and the number of different services;
29
A Data Mining Framework for Adaptive Intrusion Detection
Table III. Labeled Connection Records of IWSS16 Dataset
☎
protocol
flag
dst_count
src_count
srv_S01_count
...
label
telnet
icmp_echo
icmp_echo
telnet
http
...
SF
SF
SF
S0
S0
...
1
1
350
1
25
...
1
1
350
1
20
...
0
0
0
0
20
...
...
...
...
...
...
...
normal
normal
icmp_flood
normal
syn_flood
...
For connections that have the same service as the current connection: the total number of such connections, the number of connections with the “S0” or “S1” flags, the number of connections with “REJ”
flag, the number of distinct source hosts, and the number of distinct destination hosts.
We used the intrusion types (i.e., “normal” or one of the IWSS16) as the class labels. Table III shows
examples of these labeled connections. The data set has a total of 12,027 connection records. We used 80%
of the connection records from the intrusion trace as the training dataset and the remaining 20% as the
testing dataset. Note that since each of the attack types was simulated more than once, we were able to
have all intrusion types represented in both the training and testing datasets. The resultant RIPPER classifier
had an overall accuracy of 99 1%, a false positive (i.e., “normal” classified as “intrusion”) rate of 1% and a
✌
false negative (i.e., “intrusion” classified as “normal”) rate of 0 6% on the testing dataset. Note that the false
✌
positive rate is calculated as the percentage of negatives incorrectly classified, and the false negative rate is
calculated as the percentage of positives incorrectly classified. There were 62 rules in the resultant RIPPER
classifier, and all were very intuitive. Table IV shows some example RIPPER rules.
Although the false negative rate is very low, the classifier actually detected only the 5 “scan and flood”
intrusions described in Section 5.2.2. Each of these intrusions generates a large number of connections, each
of which is labeled by the intrusion type. The other intrusions, for example “teardrop”, can each generate just
one connection that is sufficient to bring down the victim host. Therefore we can not overlook the low false
negative rate. However, because of the features we had chosen, the connection records of these intrusions
30
Lee and Stolfo and Mok
Table IV. RIPPER Rules for IWSS16
RIPPER rule
Meaning
syn_flood :- flag = S0, srv_S01_count >= 4.
if the current connection flag is S0, and for
connections that have the same service as the
current one for the past 2 seconds, the number
with S01 (i.e., S0 or S1) flag is at least 4,
then this connection is a syn flood attack.
icmp_flood :- protocol = icmp_echo, dst_count >= 5,
srv_count >= 5.
if the connection protocol is icmp echo, and for
the past 2 seconds, there are at least 5 connections
that have the same destination host as the current
connection, and at least 5 connections that have
the same service as the current one,
then this connection is a icmp flood attack.
...
...
normal :- true.
if none of the above, it is a normal connection.
match the characteristics of the “normal” connections. Therefore we need to modify our preprocessing script
to include the “missing” system features that are distinctive for the intrusions.
We did a minimum amount of data preprocessing on the raw-packet level data. Each packet-level record
has the following features: timestamp, connection_id (i.e., the connection to which the packet data belongs),
head, and tail. Here for TCP protocols, head is the flag in the TCP header (e.g., “S” for a syn packet,
“P” for a data packet, etc.); for other protocols, it is the name of the protocols, for example, “icmp echo”.
tail is the remaining information in the packet header, for example “ack”, “win” etc. We ran our frequent
episodes program with head as the axis attribute and connection_id as the reference attribute since packets
of different connections are not related. The patterns from the normal data present no surprises since they
merely indicate that the packets are following the most common features of the protocols. For example,
head
✟
P tail
✄
✟
ack_win
✁
✂
head
✟
P tail
✄
✟
ack_win
✁
A Data Mining Framework for Adaptive Intrusion Detection
31
means that a data packet is followed by another, each with acknowledgment and window advertisement
information. However, there are several distinct patterns from the intrusion data.
head ✟ frag ✁ ✄✆ head ✟ frag ✁ ✂
head ✟ frag ✁
and
tail ✟ frag ✁ ✄✆ tail ✟ frag ✁ ✂
tail ✟ frag ✁
suggest that there are connections with many fragmented packets.
head ✟ P✄ tail ✟ ack_win_urg ✁ ✂
head=P ✄ tail ✟ ack_win_urg ✁
indicates that in some connections, the “urgent” flag is used when data packets are transfered.
We modified our preprocessing scripts according to these “new” patterns: we include a flag in the connection record to indicate the presence of “urgent” data packets; we examine the packet fragments to check
for violations of the TCP/IP protocols, and include a flag that indicates the existence of such violations. The
resultant new RIPPER classifier, learned from the new connection-level training data, detected 4 additional
intrusions, hereafter referred to as the “funny packet” intrusions. As a result, the false negative rate dropped
to 0 ✌ 2% 7 .
There is a limit to any automated approach. Our method of mining the packet-level audit data failed to
find patterns of the three remaining undetected network-based intrusions. Both “land” and “latierra” spoof
a connection by faking the original host and port so that the connection appears to be originated from and
destined for the same host and port. Since only a single “syn request” packet with the same host and port
is seen, it is rather difficult for our frequency-based mining algorithms to discover this pattern. The “zone
transfer” attack uses the TCP version of dns to transfer a large number of data packets. Its packet-level
7
Among these 4 intrusions, “teardrop”, “ping of death”, and “bonk” all have violations on fragmentation with each attack using
a different network service, “oob” (or “WinNuke”) exploits the fact that some Windows systems would simply crash when receiving
any “urgent” packets.
32
Lee and Stolfo and Mok
patterns are normal since many normal TCP connections of other network services also transfer a lot of data
packets. We incorporated this “expert” knowledge to our connection record for the sake of completeness.
The last remaining four intrusions of the IWSS16 dataset leave no evidence in any packet header since
they are host-based attacks rather than network-based. Section 6 discusses how to incorporate evidence from
the multiple data sources to construct a “complete” detection model that consider both network-level and
host-level audit data.
6. Combining multiple detection models
There are several important reasons for combining multiple detection models. First, in order to avoid becoming a performance bottleneck and an easy target of “subversion”, an intrusion detection system should
consist of multiple cooperative lightweight subsystems that each monitors separate parts (i.e., access points)
of the entire network environment. Some current network intrusion detection systems run on the gateway
that separates the protected LAN and the outside network. In order for these centralized systems to detect
the full range of intrusions (e.g., guessing password to gain access) they must examine not only the header of
each packet, but the data portion (e.g., how many times a password string is sent) as well. Since every packet
has to go through this single complex checkpoint, network throughput will be affected. Alternatively we can
have one relatively lightweight system on the gateway that checks only the packet headers, and several
host-based systems that monitor the activities on the “important” hosts. A “global” detection system can
then combine the evidence from these subsystems and take appropriate action. For example, upon receiving
signals from a host-based subsystem that a connection involves illegal activities, the global system can
instruct the gateway subsystem to intervene and terminate the offending connection.
An effective intrusion detection system must have both misuse and anomaly detection components. We
trained an anomaly detection model using only the normal traffic data of the ISSW16 data set. We used the
A Data Mining Framework for Adaptive Intrusion Detection
33
services of the connections as the class labels. The resultant RIPPER rules, with no prior knowledge of any
intrusion, were able to detect the intrusions described in Section 5.2.2 because these intrusions generate a
large number of connections that are out of normal characteristics according to the temporal and statistical
measures. However, even after adding the flags on fragmentation and “urgent” packets, the new rule sets
still could not detect the 4 “funny packet” intrusions as the misuse detection model did. The reason is the
“normal” rules include only the characteristics that are unique to each service, the class label, yet for every
normal connection, regardless of its service, these flags are always “off”. On the other hand we can not write
off the anomaly detection models since the misuse detection can detect only known intrusions.
We believe that the best way to make intrusion detection models adaptive is by combining existing
models with new models trained on new intrusion data or new normal data, e.g., from a new component of
the network, instead of restarting the entire model building process using the aggregate of the archived audit
data and new data.
Meta-learning can be used to inductively learn a meta-classifier that combines a number of pre-existing
base classifiers (i.e., component detection models). Here in order to obtain the combined model (i.e., metaclassifier), meta-level connection records are constructed where the features are predictions from the base
detection models, and the class labels are the original connection labels (i.e., “normal” or an intrusion). A
machine learning algorithm can then be applied to the meta-level records to produce the combined detection
model.
We have previously performed a number of meta-learning experiments for credit card fraud detection (Stolfo et al., 1997). We next describe similar experiments in intrusion detection.
34
Lee and Stolfo and Mok
Table V. Meta-level Connection Records
6.1. E XPERIMENTS :
old_model_prediction
new_model_prediction
label
normal
normal
syn_flood
syn_flood
normal
normal
...
normal
teardrop
normal
syn_flood
syn_flood
bonk
...
normal
teardrop
syn_flood
syn_flood
normal
bonk
...
COMBINING FOR ADAPTABILITY
In Section 5.3.1 we retrained a new misuse detection model after learning characteristics of the 4 “funny
packet” intrusions. Here we describe our experiment on combining the old model with the new model that
was trained on these 4 new intrusions.
In the training data for the old model, we removed the connection records of the 4 “funny packet”
intrusions to simulate the condition that these intrusion are “new” (i.e., unobserved) and thus are absent from
the old model data. We want to test the idea that new models can be specialized in detecting new intrusions.
Each connection record for the new model includes the flags for fragmentation and “urgent” packets, but not
any of the temporal and statistical features included in the old model, since they are irrelevant for these 4
intrusions. We still include all the attack types in the training data of the new model because we need to see
whether the new features can be useful in detecting the old intrusions (i.e., the “scan and flood” attacks).
A meta-learning process is used to generate the combined detection model. Both the old model and
new model are applied to a common set of “hold-out” connection records that are not used in training the
models. Each meta-level connection record has the outputs (i.e., predictions) from the old model and the
new model, plus the correct class label, i.e., normal or an intrusion name. Table V shows examples of the
meta-level records. The resultant RIPPER rules, as expected, use the predictions from the new model for the
new intrusions and the predictions from the old model for the old intrusions. Table VI shows some example
meta-level RIPPER rules.
35
A Data Mining Framework for Adaptive Intrusion Detection
Table VI. Example Meta Rules
RIPPER rule
Meaning
teardrop :- new_model_prediction = teardrop.
if the prediction from the new model is teardrop,
then this connection is a teardrop attack.
syn_flood :- old_model_prediction = syn_flood.
if the prediction from the old model is syn flood,
then this connection is a syn flood attack.
...
...
normal :- true.
if none of the above, it is a normal connection.
Note that although the new model can also detect the “scan and flood” intrusions, e.g., syn flood, the
rules, e.g., “syn_flood :- flag = S0”, have high false positive rates because the new model doesn’t have the
temporal and statistical features that are essential in detecting the “scan and flood” intrusions. The combined
model uses the predictions from the old model to detect these intrusions because the old model rules are more
accurate.
The combined detection model has slightly better performance than the updated model re-trained using
the entire dataset in Section 5.3.1. Here the false negative rate dropped to 0 18% (compared to 0 2%) while
✌
✌
the false positive rate remained the same. This experiment shows that when new intrusions can be detected
using a new set of features, we can train a specialized new detection model. The meta-classifier that combines
the existing detection model and the new model can be just as effective as an updated model that is retrained
using the updated audit data set and feature set.
But what about new intrusions that require no new features? We conducted similar meta-learning
experiments on the 5 “scan and flood” intrusions. The “updated” detection model on all 5 intrusions has
better performance than the meta-classifier that combines the “new” model specialized on “syn flood” and
the “existing” model on the other 4 “scan and flood” intrusions. The meta-classifier has a false negative rate
of 1 1%, compared with 0 6% of the detection model on the 5 intrusions; while the false positive rates are
✌
✌
the same. We inspected the “existing” model and the “updated” model and compared the rules regarding the
4 intrusions that are common to both models. The rules in the “updated” model include more distinguishing
36
Lee and Stolfo and Mok
features useful for more accurate classification. This is because with an additional class (i.e., intrusion type)
of records in the training data, the machine learning algorithm has more information to make better choices
on the features that can set the intrusions and normal records apart. Therefore, when the accuracy of a
detection model is critical, it is more desirable to retrain the model for the new intrusions that are detectable
using the existing features.
We have only begun studying this issue and an extensive set of additional experiments are needed to
elucidate the tradeoff in this approach.
7. Related work
Recently there have been quite a few new research efforts in intrusion detection. In (Forrest et al., 1996)
it was shown that the short sequences of run-time system calls of a system program can be used as its
“identity”. To analyze a program, its short sequences of system calls are looked up in the “normal” database
that contains the exhaustively gathered short sequences from its normal runs. A large degree of mismatches
suggests that the program is attacked (i.e., its weakness is exploited). Here the “normal” database is used
as an anomaly detection model. While this approach seems simple, it is only suited for system programs.
The number of possible execution paths of a program is perhaps limited, and therefore one can in theory
build a “normal” database that exhausts all possible sequences. However network traffic patterns and user
behavior are in general unlimited (i.e., they can shift and evolve). The approach of using association rules
and frequent episodes and measuring their growth might provide a sufficient indicator to detect when the
“normal” has changed.
In (Lane and Brodley, 1997), algorithms for modeling user behavior were discussed. The idea is to
match the user command sequences with established profiles. Various heuristics for partial string matching
were introduced. This approach is also very limited. For example, the algorithms assume that user commands
A Data Mining Framework for Adaptive Intrusion Detection
37
(i.e., (command options arguments) tuples ) are first collapsed into a single dimension string of “command
✄
✄
options arguments”. Converting the multi-attribute connection or session records into a single attribute
string is not appropriate since each attribute carries distinct and yet important information. Section 5.2.2
demonstrates that it is important to be able to analyze the data using different combinations of the attributes.
In DC-1 (Detector Constructor) (Fawcett and Provost, 1997), a rule learning step is first used to obtain
each customer’s fraudulent patterns, and rule selection is then used to obtain a set of general fraudulent
patterns for the entire population. A monitor construction step is used to obtain the sensitivity measures
of different users to these general patterns. The final detector, a classifier, is then trained using records
where the features are the outputs of monitors (i.e., sensitivity measures). We face here greater challenges
in feature selection and construction. DC-1 assumes call data with a set of base-level attributes. Here we
need to carefully preprocess the raw audit data into meaningful connection or session records. DC-1 does
not need to consider the temporal and statistical features from the base-level attributes in call data, which
are critical for an effective network intrusion detection model.
8. Conclusion
In this paper we outlined our approach for building intrusion detection models. We proposed that association rules and frequent episodes from the audit data can be used to guide audit data gathering and feature
selection, the critical steps in building effective classification models. We incorporated domain knowledge
into these basic algorithms using the axis and reference attribute(s), and a level-wise approximate mining
procedure. Our experiments on real-world audit data showed that the algorithms are effective. Intrusion
detection models need to be adaptive to changing environments and new intrusion methods. We show
preliminary but promising results in using meta-learning as a mechanism to incrementally incorporate new
models into the overall intrusion detector.
38
Lee and Stolfo and Mok
Several tools and steps in our framework, for example, using frequent episode programs to find specific
patterns, are not fully automatic (e.g., we need to manually inspect the patterns). We need to provide support
for rule templates (Klemettinen et al., 1994) so that patterns can be post-processed and presented as query
results to users.
It is important to include users in the knowledge discovery tasks. We are implementing a support
environment that integrates the iterative processes of selecting features, and building and evaluating classification models. This environment graphically presents the patterns along with the list of features and the
time windowing information to the user, and allows him/her to formulate a classification task, build and test
the model using a classification engine such as JAM (Stolfo et al., 1997).
9. Acknowledgments
We are very grateful to Stephanie Forrest of University of New Mexico for providing us with the sendmail
system call data, and Stuart McClure of InfoWorld for supplying us with the IWSS16 data set.
Our work has benefited from in-depth discussions with Alexander Tuzhilin of New York University,
and suggestions from Charles Elkan of UC San Diego and Foster Provost of Bell Atlantic Science and
Technology. We also wish to thank Phil Chan of Florida Institute of Technology, Andreas Prodromidis and
Wei Fan, both of Columbia University, for their help and encouragement.
References
Agrawal, R., T. Imielinski, and A. Swami: 1993, ‘Mining Association Rules between Sets of Items in Large Databases’. In:
Proceedings of the ACM SIGMOD Conference on Management of Data. pp. 207–216.
A Data Mining Framework for Adaptive Intrusion Detection
39
Agrawal, R. and R. Srikant: 1994, ‘Fast Algorithms for Mining Association Rules’. In: Proceedings of the 20th VLDB Conference.
Santiago, Chile.
Agrawal, R. and R. Srikant: 1995, ‘Mining Sequential Patterns’. In: Proceedings of the 11th International Conference on Data
Engineering. Taipei, Taiwan.
Bellovin, S. M.: 1989, ‘Security Problems in the TCP/IP Protocol Suite’. Computer Communication Review 19(2), 32–48.
Chan, P. K. and S. J. Stolfo: 1993, ‘Toward Parallel and Distributed Learning by Meta-Learning’. In: AAAI Workshop in Knowledge
Discovery in Databases. pp. 227–240.
Cohen, W. W.: 1995, ‘Fast Effective Rule Induction’. In: Machine Learning: the 12th International Conference. Lake Taho, CA.
Fawcett, T. and F. Provost: 1997, ‘Adaptive Fraud Detection’. Data Mining and Knowledge Discovery 1, 291–316.
Forrest, S., S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff: 1996, ‘A Sense of Self for Unix Processes’. In: Proceedings of the
1996 IEEE Symposium on Security and Privacy. Los Alamitos, CA, pp. 120–128.
Grampp, F. T. and R. H. Morris: 1984, ‘Unix System Security’. AT&T Bell Laboratories Technical Journal 63(8), 1649–1672.
Han, J. and Y. Fu: 1995, ‘Discovery of Multiple-Level Association Rules from Large Databases’. In: Proceedings of the 21th VLDB
Conference. Zurich, Switzerland.
Heady, R., G. Luger, A. Maccabe, and M. Servilla: 1990, ‘The Architecture of a Network Level Intrusion Detection System’.
Technical report, Computer Science Department, University of New Mexico.
Ilgun, K., R. A. Kemmerer, and P. A. Porras: 1995, ‘State Transition Analysis: A Rule-Based Intrusion Detection Approach’. IEEE
Transactions on Software Engineering 21(3), 181–199.
Jacobson, V., C. Leres, and S. McCanne: 1989, ‘tcpdump’. available via anonymous ftp to ftp.ee.lbl.gov.
Klemettinen, M., H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo: 1994, ‘Finding Interesting Rules from Large
Sets of Discovered Association Rules’. In: Proceedings of the 3rd International Conference on Information and Knowledge
Management (CIKM’94). Gainthersburg, MD, pp. 401–407.
Kumar, S. and E. H. Spafford: 1995, ‘A Software Architecture to Support Misuse Intrusion Detection’. In: Proceedings of the 18th
National Information Security Conference. pp. 194–204.
Lane, T. and C. E. Brodley: 1997, ‘Sequence Matching and Learning in Anomaly Detection for Computer Security’. In: AAAI
Workshop: AI Approaches to Fraud Detection and Risk Management. pp. 43–49.
Lee, W. and S. J. Stolfo: 1998, ‘Data Mining Approaches for Intrusion Detection’. In: Proceedings of the 7th USENIX Security
Symposium. San Antonio, TX.
Lee, W., S. J. Stolfo, and K. W. Mok: 1998, ‘Mining Audit Data to Build Intrusion Detection Models’. In: Proceedings of the 4th
International Conference on Knowledge Discovery and Data Mining. New York, NY.
40
Lee and Stolfo and Mok
Lee, W., S. J. Stolfo, and K. W. Mok: 1999, ‘Mining in a Data-flow Environment: Experience in Network Intrusion Detection’. In:
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-99).
Lent, B., A. Swami, and J. Widom: 1997, ‘Clustering Association Rules’. In: Proceedings of the 13th International Conference on
Data Engineering. Birmingham, UK.
Lunt, T.: 1993, ‘Detecting Intruders in Computer Systems’. In: Proceedings of the 1993 Conference on Auditing and Computer
Technology.
Lunt, T., A. Tamaru, F. Gilham, R. Jagannathan, P. Neumann, H. Javitz, A. Valdes, and T. Garvey: 1992, ‘A Real-time Intrusion
Detection Expert System (IDES) - final technical report’. Technical report, Computer Science Laboratory, SRI International,
Menlo Park, California.
Mannila, H. and H. Toivonen: 1996, ‘Discovering Generalized Episodes Using Minimal Occurrences’. In: Proceedings of the 2nd
International Conference on Knowledge Discovery in Databases and Data Mining. Portland, Oregon.
Mannila, H., H. Toivonen, and A. I. Verkamo: 1995, ‘Discovering Frequent Episodes in Sequences’. In: Proceedings of the 1st
International Conference on Knowledge Discovery in Databases and Data Mining. Montreal, Canada.
McClure, S., J. Scambray, and J. Broderick: 1998, ‘Test Center Comparison: Network Intrusion-detection Solutions’.
In:
INFOWORLD May 4, 1998.
Srikant, R., Q. Vu, and R. Agrawal: 1997, ‘Mining Association Rules with Item Constraints’.
In: Proceedings of the 3rd
International Conference on Knowledge Discovery and Data Mining. Newport Beach, California, pp. 67–73.
Stolfo, S. J., A. L. Prodromidis, S. Tselepis, W. Lee, D. W. Fan, and P. K. Chan: 1997, ‘JAM: Java Agents for Meta-Learning
over Distributed Databases’. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.
Newport Beach, CA, pp. 74–81.
Utgoff, P. E., N. C. Berkman, and J. A. Clouse: 1997, ‘Decision Tree Induction Based on Efficient Tree Restructuring’. Machine
Learning 29, 5–44.