Detection of Network Faults and Performance Problems
Hassan Hajji, B. H. Far, and Jingde Cheng
Department of Information and Computer Science
Saitama University,
Saitama 338-8570, JAPAN
e-mail: hajji@aise.ics.saitama-u.ac.jp
far@enel.ucalgary.ca
cheng@ics.saitama-u.ac.jp
Abstract — Network normal operation baselining for automatic detection of anomalies is addressed. A model for
network traffic is presented in which studied variables are modeled as a finite mixture model. Based on stochastic
approximation of the maximum likelihood function, we propose a baseline of network normal operation as the
asymptotic distribution of the difference between successive estimates of model parameters. The baseline multivariate random variable is shown to be stationary, with mean zero under normal operation. Performance problems
are characterized by sudden jumps in the mean. Detection is formulated as an online change point problem, where
the task is to process residuals and raise alarms as soon as anomalies occur. An analytical expression of false alarm
rate allows us to choose the threshold, automatically. Extensive experimental results on a real network showed that
the monitoring agent is able to detect even slight changes in the characteristics of the network, and adapt to traffic
patterns, while maintaining a low alarm rate. Despite large fluctuations in network traffic, this work proves that
tailoring traffic modeling to specific goals can be efficiently achieved.
1
Introduction
Networks and distributed processing systems have
become an important substrate of modern information
technology. The rapid growth of these systems throughout the workplace has given rise to a discontinuity in
expertise of human operators to manage them. There
is a need for automating the management functions to
reduce network operations management cost.
Detection of network problems is a crucial step in
automating network management. It has a direct impact on the accuracy of fault, performance and security management functions. From a control viewpoint,
well designed fault and performance problems detection
algorithms enhance the network control capability, by
providing timely indication of network incipient problems. The possibility of early detection of performance
degradation can alleviate the constant fire-fighting of
network managers. Early warnings from the monitoring agent can trigger preventive actions, and serious
and expensive outages can be avoided. In addition,
network monitoring agents can be designed to interface
network protocols, to tune their operations. For example, routing metrics can be adjusted based on management agents alarms.
A large amount of work has gone into developing
mechanisms and protocols for collecting traffic statistics. Indeed, currently most of the work in the simple
network management architecture focuses on defining
detailed system and network traffic objects. Compar-
atively, little work is done to support user analysis of
collected statistics. Most of the interpretation is left to
the common-sense of network operators. Unless control mechanisms are driven by objective measures using
well-tested network traffic models, the benefits and the
results of network traffic analysis will remain biased by
the common sense of human operators. On the other
hand, existing network on fault and performance management assumed that the alarm generating mechanism
is accurate, and network problems are given a priori
[12, 11, 26]. Current practice in network management
rely on user-defined thresholds for detection. Alarms
are generated when some variable of interest crosses a
predefined threshold. Generally, the predefined value
of the threshold is no more than an estimation of the
normal range within which the measured feature is believed to operate. Not only there is little objective insights on how to choose these thresholds, but also there
is a risk of missing subtle changes in the network state
[10]. In addition, the complexity and size of current
network systems makes them vulnerable to novel faults
and performance degradation patterns.
The main difficulty of network anomaly detection is
the lack of a generally accepted definition of what constitutes normal behavior [15]. The dynamics of the network normal operations need to be identified from routine operation data. Earlier work reported in [18] characterizes the normal behavior by different templates,
obtained by taking the standard deviations of observations (typically Ethernet load and packets count), at
different operating times. An observation is declared
abnormal if it exceeds the upper bound of the envelope.
Given the bursty nature of network traffic, the standard
deviation estimates are likely to be distorted, making
subtle changes in the network state go undetected. To
mitigate the effect of the non-stationary nature of network traffic, [10] considered the model formed by segmenting time series obtained form Management Information Base (MIB) objects. Observations are declared
abnormal if they do not fit an auto-regressive model
of the traffic inside segments. In [22] the observation
are declared abnormal after a statistical test with the
mean of 24-hour period sample. In these approaches,
the assumption of piece-wise constancy of the traffic is
questionable, since traffic volume is generally not sustained at a given level long enough to allow accurate
estimation.
In this paper, we address the problem of faults and
performance problems detection in local area networks.
No knowledge about the problems to be detected is required. The emphasis is on fast detection – an important requirement for reducing potential impact of
problems on network services users. We parameterize
network traffic variables using finite Gaussian mixtures.
Based on this parametric model, we propose a baseline
of network normal operation as the asymptotic distribution of the difference between successive estimates of
model parameters. This difference is shown to be approximately multivariate Normal, with mean zero under normal operations, and sudden jumps in this mean
are characteristics of abnormal conditions. The detection problem is formulated as a change point problem.
A real-time online change detection algorithm is designed to processes, sequentially, the residuals and raise
an alarm as soon as the anomaly occurs. We motivate
this formulation through a real problem scenario that
occurred in Saitama university network. The proposed
approach requires neither the set of faults and performance degradation nor the thresholds to be supplied
by the user. Experimental results on a real network
showed the effectiveness of our approach. A very low
alarm rate and a high detection has been demonstrated.
This paper is arranged as follows: Section 2 introduces our proposed parametric model of the network
traffic, and how network normal operation baseline is
derived. Section 3 shows the characteristics of the baseline model, under abnormal condition, and introduces
the formulation of the network problem detection. In
Section 4, we present results of our experiments in a
real network. We conclude in section 5.
2
Normal Operations Baselining
The goal of this section is to characterize network
normal behavior. We first present a parametric model
of traffic variables, and then show how this model can
be used to build a baseline of normal operations.
A
Traffic Variables Parametric Model
Our approach to network model parameterization is
to view each variable as switching between different
regimes, where each regime is a Gaussian distribution.
This is a form of what referred to in the literature as
finite mixture model [20]. The observations xi are generated by one of the K Gaussian distribution, as shown
in the following equations:
xi = mk + ǫki
p(xi ) =
K
X
k=1
√
k = 1, . . . , K
πk
−(xi − mk )2
exp
2σk2
2πσk
(1)
(2)
The errors ǫk are assumed to be Gaussian, with mean
0 and variance σk . The integer K denotes the number
of regimes (components). Each regime k has a mixing
probability, denoted by πk . Strictly speaking, the errors ǫk should be truncated Gaussian, since negative,
and extremely large values do not appear in the network traffic data sets. As it turns out, such issues do
not affect the accuracy of the detection significantly.
The parametric finite mixture model has an attractive interpretation. It is useful in the analysis of data
that are believed to come from a finite number of distinct subpopulations, indicated by a latent variable. In
our case, the latent variable has the natural interpretation of time of the day, given the known fact that
network traffic changes as a function of the time.
Recent work on characterizing operation of network
traffic has resulted in analytical models of many important network statistics. For instance, mixes of lognormal distributions have been found to model very
well the call holding time in telephony [3, 4]. In [14],
it was found that Telnet originator responder bytes,
data transmitted in a given FTP connection, FTP session bytes can be modeled as a lognormal distribution.
Lognormal distribution also fits well the distribution of
message length in Public Access Mobile Radio (PAMR),
a mixture of two lognormal distributions gives the best
fit for transmission length [1]. Given the fact that if
the random variable X is lognormal, then log(X) follows a Normal distribution, the random variable whose
realization are the logarithm of the original data can
be modeled by the finite mixture model of Equation
2. This model provides, then, a good parametric characterization for many important traffic characteristics,
with the advantage of accounting of non-stationary nature of this statistics, due to hourly changes in network
traffic. Note that in general, finite Gaussian mixture
models are general enough to approximate any continuous function with a finite number of discontinuities
[24], providing a general first approximation to other
network traffic variables.
B
Normal Operation Baselining
Our approach to operation baselining starts by recognizing that any parametric model of network traffic
is, at best, an approximation to the reality. Approximation errors and model mis-specification become very
pronounced in any inference that use the estimated parameters, as if they are the ”true” ones. Given that
our ultimate goal is anomaly detection, formulated as
a change detection in baseline model parameters, we
claim that this task can be achieved without using these
”true” parameters. The goal is to avoid solving the
more general problem of parameter estimation, as an
intermediate step to change detection.
Our approach to realize this idea is pictorially shown
in Figure (1). Observation are passed through the
learning algorithm to produce the point estimation
θn−1 . As new data points are sequentially added, the
learning algorithm outputs a refined new estimate θn .
The idea is to characterize network normal operations
using the difference (θn − θn−1 ).
There are two major advantages of the residuals generated this way. First, the difference (θn − θn−1 ) does
not depend on the ”true” value θ0 of the parameter
θ. This is very important since, in practice, we do not
know this ”true” value, and the only available information is the value θ̂, estimated from the data. Approximating the true parameter θ0 with θ̂, and studying
the difference (θn − θ̂) is possible, but our experiments
showed that this approach is inefficient, as shown later.
Second, the learning algorithm can be designed to adaptively track local changes in model parameters. It is
unrealistic to assume that the model parameters will
remain exactly the same over all the operating times of
the network.
Intuitively, under normal conditions, the difference
between the successive values θn and θn−1 is expected
to fluctuate around zero. This difference should not
drift constantly in a fixed direction. On the other
hand, if this difference drifts systematically over long
duration, then the new observations are generated by
a different model, induced by a pattern not present in
the training data. The learning algorithm will alter
the parameter θ to its new value. The idea, then, is
to generate the residuals based on the random variable
(θn − θn−1 ). The mean value of this difference is a good
indicator of the health of the network.
To realize this principle, two issues need to be addressed. First, the issue of how to design an adaptive
learning algorithm. In addition, since detection is required to be online, the learning of Figure 1 should as
fast as possible. Second, we have to work out the distribution of the difference (θn − θn−1 ). The remainder
of this section addresses these two issues.
C
Learning Algorithm and Residual
Distribution
A well-known algorithm for parameter identification
in finite mixture models is the Expectation Maximization (EM) algorithm [7]. The EM is, however, a batchoriented algorithm. It requires the whole data to be
available in memory, before a new refined estimates is
produced. If we have to run this algorithm for each
new observation, too much time and memory will be
consumed. Worse yet, time and memory consumed
keeps growing as monitoring goes on. We do not follow this approach, instead a stochastic approximation
of the problem of maximizing the likelihood function,
is used to turn the EM to an online algorithm.
Let the vectors x1 , . . . , xn be a sequence of observations, whose joint probability distributions fx (θ) depends on the unknown parameter θ. The goal is to
derive an online algorithm for estimating the parameter θ. We define the recursive likelihood function Ln (θ)
as follows:
¡
¢
Ln (θ) = Eθn log(f (yn |x1 . . . xn−1 ) + Ln−1 (θ) (3)
Where Eθn (.) denotes the expectation with respect to
the parameter θn , y is the latent, unobservable variable. This is basically the same recursive likelihood as
in [23], except that the expectation is taken, conditionally on the whole set of observations x1 . . . xn−1 . It can
be shown that the solution θn of the problem of maximizing Ln (θ) is given by:
θn = θn−1 + Ic−1 S(xn , θn )
¢
S(xn , θn ) = D log(f (xn |y1 . . . yn−1 )
(4)
(5)
Where Ic denotes Fisher information matrix for the
complete data, where the separation variable is known.
Similar results are obtained in [25] by minimizing the
Kullback-Liebler divergence, instead of maximizing the
recursive likelihood. Working out the scores S(xn , θn ),
and the Ic , leads to the following recursive formulas for
updating the model parameters:
wkn
(xn − mn−1
)
(6)
+ Pn
mnk = mn−1
k
k
( i=1 wki )
wkn
2(n−1)
2(n)
2(n−1)
((xn − mn−1
)2 − σk
)
σk = σk
+ Pn
k
w
ki
i=1
(7)
Where
π n−1 fik
wki = PK k n−1
fik
k=1 πk
fik = q
1
2(n−1)
2πσk
k = 1...K
exp
(8)
¡ −(xi − mkn−1 )2 ¢
2(n−1)
2σk
(9)
(10)
x n-1
x n-1
x n-1
Learning
Algorithm
n-1
xn x
nx
nx
n
Learning
Algorithm
n
Figure 1: Normal operation baselining based on repeated identification of model parameters
Now let us verify our design goal. First, note that parameter updating is fast enough, to be implemented in
real-time. Sequentially acquired data points are merged
with existing processed data, and do not require the
re-computation of all collected data. Time and memory requirements are, hence, kept minimal. To allow
the learning algorithm to track slight change in model
parameters, we introduce an exponential forgetting factor 0 < ζ ≤ 1, that reduces
Pnthe effect of old observations.
Evaluating
the
sum
i=1 wki is then replaced by
Pn
i
i=1 ζ wki . In the sequel, we shall be interested only in
changes in the K-dimensional mean m = (m1 , . . . , mk ).
As stated earlier, approximating θ0 with θ̂, and
studying the difference (θn − θ̂) is possible, but our
experiments showed that this approach is inefficient.
Figure 2-a compares both differences for a duration
of one hour under the same network conditions. Results are shown only for one of the second component, in the mixture model of the number of broadcast
packets traffic variable. It can be concluded that the
(mnk − m̂k ) is not symmetric around zero, while the difference (mnk − mn−1
)is both symmetric and very close
k
to zero under normal conditions.
For the distribution of the residuals, we showed empirically [9] that the K-variate residuals en given by:
en = (mn − mn−1 )T Λ−1 (mn − mn−1 )
√
wkn σ̂ n−1
Λ = diag ( Pn k )
i=1 wki
(11)
(12)
are approximately Normal, with mean zero under network normal conditions. Note that en given in Equation
(11) is simply the difference (mn − mn−1 ), scaled such
that its variance-covariance matrix becomes Identity.
Figure (2)-b shows the residuals e2n , corresponding to
the mixture model of the number of broadcast packets
traffic variable. It can be seen that the residuals e2n
are stable, and their mean is very close to 0.
To summarize results of this section, we showed how
the learning algorithm transforms the raw data, to stationary multi-variate residuals en . The residuals en
have the desirable property of being Normal with mean
zero and variance Identity matrix, under normal network operations. The mean of the random variable en
serves as the baseline for normal operation. The next
section shows the behavior of these residuals under abnormal conditions, and how we formulate and solve the
detection problem.
3
Anomaly Detection
Anomaly detection is determining the discrepancy
between the normal behavior and the predicted behavior. Figure 3 shows the behavior of the residuals generated by the model under a real abnormal condition
that affected Saitama university network, due to badly
formatted packets. As shown in Figure 3-a, this abnormal condition causes a sudden jump in the mean of the
residuals. Figure 3-b shows the behavior of the residuals just before the sudden jump in the mean. Interestingly, we notice that the sudden jump is preceded by a
slight change in the mean of residuals. If the detection
approach is designed to be sensitive to slight changes in
the operating characteristics of the network, we could
have predicted the problem of Figure 3 before it became serious. The problem could have been avoided,
or at least addressed immediately after its occurrence.
In general, however, not all problems presents signs to
allow their prediction. In this case, we require our detection method to raise alarm as soon as change in the
mean occurs.
Consider the residuals Ecn obtained by observing sequentially the residuals ei from time point c to n. Under the normal operations of the network, the sample of
en follows a K-variate Normal distribution with mean
zero, and Identity variance-covariance matrix (Section
B). At some unknown time point c, a change happens
in the model, and the new generated residuals shift to a
new distribution, with a different mean, denoted by θ1 .
The goal is to find a decision function and a stopping
rule that detects this change and raise an alarm as soon
as possible, under a controlled false alarm rate. This
formulation is known in sequential analysis literature
as the disruption problem. The main difference with
3
0.4
2
0.2
Residuals
1
0
0
−1
−0.2
n
n−1
(m − m
n
(m−m)
)
−2
−0.4
0
100
200
−3
300
0
100
Time(x 10 seconds)
200
300
Time(x 10 seconds)
(a)
(b)
Figure 2: Comparison of the drifts (mn2 − mn−1
) and the (mn2 − m̂2 )
2
10
4
0
2
−20
Residuals
Residuals
−10
−30
0
−40
−2
−50
−4
11200
11400
11600
11800
12000
Time (x10 seconds)
(a)
11750
11800
11850
11900
11950
Time (x10 seconds)
(b)
Figure 3: Behavior of the residuals under abnormal network conditions
classical hypothesis testing is that the sample size is a
function of the observations made so far ( i.e. not fixed
a priori), and the distribution of the residuals is known,
when the process being monitored, is in control. The
goal is to achieve fast detection of change, by using no
more than the sufficient sample size to decide whether
an alarm is to be raised or not.
It is well-known that for known probability distribution after change, Page-Lorden cumulative sum
(CUSUM) [2] test is optimal, in the sense that it minimizes the delay to detection, among all tests with a
given false alarm rate. However, in the present case
of network anomaly detection, we do not have a priori knowledge about the probability distribution after
change Pθ1 , and the change point c. The common
extension of Page-Lorden CUSUM test consists of estimating the post-change distribution mean, and the
change point from the data. This approach is known as
the Generalized Likelihood Ratio (GLR) test [2]. That
is, for the unknown parameter θ1 of the post-change dis-
tribution Pθ1 (ei ), and the change point c are estimated
from data, using the maximum likelihood estimator.
The resulting decision function is given by:
Rn = sup sup ln
1≤c≤n θ1
P (Ecn |θ1 , c)
P (Ecn |θ0 )
(13)
Tn = inf {n : Rn > λ}
(14)
In our case, where pre-change and post-change distributions are Normal, the maximization problem of Equation (13) can be worked out explicitly. It has a simple
form, given by:
S0 = (0, . . . , 0)T
Sn =
n
X
ei
(15)
i=1
kSn − Sc k
√
> λ}
0≤c<n
n−c
Tn = inf {n : max
(16)
The equation assumes that after change, the distribution of the residuals is still Normal, but with different
mean. For the abnormal case, it is hard to obtain an unbiased fit of the post-change distribution Pθ1 (ei ). Fortunately such accurate estimation is not crucial. What
is needed is that, when an anomaly occurs, the closest
Normal distribution, obtained by maximum likelihood
estimation, has a mean significantly different from zero.
A
Tuning the Threshold λ
So far we have introduced the decision function and
the stopping rule used for online detection of network
faults and performance degradation. The remainder of
our problem set-up concerns the choice of the design
threshold λ.
It can be shown that the expectation of the stopping
rule, under no change denoted by E∞ (T ), is given by
[21]:
Γ(K/2)2K/2 exp(λ2 /2)
as λ → ∞
(17)
Rλ
λK 0 xv 2 (x)dx
∞
X
−xn1/2
)), x > 0
v(x) = 2x2 exp (−2
n−1 Φ(
2
1
E∞ (T ) ∼
(18)
Where Φ denotes the Normal distribution function. For
calculation, see [21] for an approximation of v(x). Not
surprisingly, Equation (17) turns out to be the mean
time between false alarms. It follows that, given a desired false alarm rate, we can recover the design threshold λ, by solving Equation (17).
4
Evaluation and Results
The network monitoring algorithms described earlier
has been implemented in a real networks. This section
discusses how the data is collected, and the results that
validate the agent capabilities.
A
Data Collection
The implementation of the monitoring software consists of two modules: statistics collection and monitoring modules. Statistic collection module interfaces the
network for protocols operation statistics. The monitoring module monitors management objects, for online
anomaly detection.
Statistics collection is implemented as a Remote
Monitoring (RMON) agent, running as user-level process. This solution is particularly appealing in the
sense that dependency on the operating system kernel is minimally reduced to the interface to access the
data link layer. This way, we have full control of all
aspects of network statistics, as opposed to depending
on whether operating system kernel keeps track of traffic statistics. Our earlier implementation experience of
an SNMP agent [9], revealed that some kernels do not
have entries for all management objects, as defined in
MIB-II
The network monitoring module operates on top of
RMON. It accesses raw measurements through RMON
management information base. All monitoring computation is done locally. In contrast to the Network
Management Station (NMS) based polling of network
statistics, the bandwidth and computation time consumed during transfer and processing of raw measurement is kept minimal. Distributed management organizational models [8, 17], addressed these issues, but
failed to address the critical issue of how to use collected
statistics. In this sense, our approach complements distributed management by providing the details of monitoring tasks to be carried by the distributed agents.
Currently our monitoring agent software runs on Linux
operating system.
B
Experimental Setup
To illustrate each of the capabilities of our proposed
monitoring approach, Table 1 lists the variables studied. The first three variables are modeled using a finite Gaussian mixture model. The last variable data is
modeled using a finite mixture of lognormal distribution. For each of these variables, Table 2 shows its experimental configuration. The number of components
is determined in an ad-hoc manner, based on observed
fluctuation of traffic in one week training data. We are
now studying how to choose the number of components
automatically form the data. It is, also, assumed that
the training data is ”pure”. That is, no anomaly occurred during its collection.
C
Detection Accuracy
Figure (4)-a shows how the decision function reacts
to excessive broadcasts, created by injecting additional
two broadcast packets, every second. As shown in the
figure, the decision function shows a sharp increase,
crossing the threshold after a delay of 16 minutes approximately. Figure(4)-b shows how the test statistic
reacts to a sustained rate of TCP passive opens, created
by injecting 10 additional packets every second. It can
be shown that the problem is detected, with a delay of
17 minutes, approximately.
Our RMON implementation allows us to study perprotocol statistics. Here we focus on Address Resolution Protocol (ARP), as an illustrative example, given
both the lack of counters for ARP packet operations,
and the range of problems that manifest themselves as
changes in the statistic characteristics of this protocol
traffic. Figure (4)-c shows results of monitoring ARP
operation for 24 hours, and then perturbing network
operations by injecting an additional two ARP request
packets per seconds. It can be seen clearly that the
anomaly is detected. The delay to detection is 16 minutes, approximately.
Variable
Definition
etherStatsBroadcastPkts
The total number of good packets received
that were directed to the broadcast address.
The total number of Ethernet packets with an
unknown protocol type
The total number of good Address Resolution
Protocol (ARP) packets on the segment
The number of times TCP connections have
made a direct transition to the SYN-RCVD
state from the LISTEN state
etherStatsUnknownProts
arpStatsPkts
tcpPassiveOpens
Table 1: Definition of MIB variables used in experimental results
Variable
Number of Components
Forgetting
Factor
Horizon
Length
etherStatsBroadcastPkts
etherStatsUnknownProts
arpStatsPkts
tcpPassiveOpens
4
3
3
3
0.8
0.8
0.8
0.7
180
180
180
180
Table 2: Configuration of each of the variables studied
Figure (4)-d shows how the agent reacts to excessive
unknown protocols. This variable is obtained by simply counting packets that do not conform to Ethernet
packet format. Most of this traffic is caused by packets
with protocol type less than 1500. In this case, it took
50 seconds for this anomaly to be detected.
In summary, we note that in all cases tested the
detection is accurate, even with slight changes in network traffic. Recall that we are not modeling particular faults, and performance degradation patterns. The
agent contrasts the baseline normal behavior with observed traffic, making it possible to detect novel network problems. It follows that, in principle, our approach can be easily deployed across different networks.
It is also important to notice that in all the above cases,
analyzing the captured packets after an alarm is raised,
immediately reveals the problem. This is to be contrasted with methods as in [26], that take alarms as
given, and yet have to match faults signatures with
pre-stored patterns. These approach is both knowledgeextensive, and do not have the capability to learn new,
unseen faults.
take them for anomalies.
Figure 5-(a) shows an increase in ARP volume, that
is part of normal operation of the network. ARP packets count increases, as transition is made from night
hours to day hours. Figure 5-(b) shows that the decision function remains within the normal range, for
this pattern. Similarly, Figure 5-(c) shows a decrease
in ARP traffic volume, as network becomes less busy,
in late evening hours. Figure 5-(d) plots the reaction of
the decision function, around the same time this transition took place. Here also, the decision remains within
its normal range. What actually happens, is that ARP
traffic model is a mixture, with three components each
modeling a given level of ARP traffic. Depending on
the time of the day (latent variable), observations are
assigned to the corresponding component, making it
possible to adapt to these traffic fluctuation. One can
conclude, then, that if the training data is large enough
to contain all the possible regimes of operation, the
monitoring can adapt to these patterns, and will not
be taken for anomalies.
E
D
Adaptability to Normal Traffic
Fluctuations
Network traffic exhibits clear diurnal patterns. Night
hours and less busy days of the week show a decrease in
network traffic. Traffic picks up again during working
days. The purpose of this section is to show that the
monitoring agent learns these patterns, and does not
Alarm Rate
Ideally, we would like to estimate the false alarm
rate, given that we know for sure that the network is
operating normally. Unfortunately, it is difficult to gain
perfect knowledge about all the subtle changes in the
network behavior. Instead, Table 3, and Table 4 show
the average alarm rate per hour, evaluated after the
agent is set to run for one week, then one month, re-
10
14
12
8
Test statistic
Test statistic
10
packet injection started
6
4
8
Packet injection started
6
4
2
2
0
0
10
20
40
0
50
0
10
20
30
40
Time (hours)
Time (hours)
(a)
50
60
(b)
10
15
Test statistic
Test statistic
8
6
4
Packets injection started
10
Packets injection started
5
2
0
0
10
20
30
Time (hours)
(c)
0
0
10
20
Time (hours)
(d)
Figure 4: Behavior of the test statistic corresponding to (a) excessive broadcasts (b) sustained rate of TCP passive
opens (c) excessive ARP packets (d) excessive unknown packets
spectively.
The duration of the testing is long enough to conclude that our monitoring technique adapts to different
traffic patterns. The results show a very low alarm rate,
yet a high detection accuracy, as evidenced by results of
section C. In addition, it should be noted that most of
the alarms generated by the variables tcpPassiveOpens
and etherStatsBroadcasts in Table 4 are caused by one
particular anomalous activity. Around this problem, 35
alarms generated by the variable tcpPassiveOpens, out
of 43 total alarms raised, during one month duration
of the experiment. These alarms are generated around
one anomaly that manifested itself as an almost fixed
rate of passive opens for 14 hours. In the remaining
29 days, only 8 alarms are generated. The same also
applies to broadcast packets. Broadcast packets generated 12 alarms, 11 of them are caused by a failure of the
file server in the neighboring subnet. Only one alarm
is generated for the remaining 29 days.
5
Conclusion
In this paper, we developed an online technique for
real-time detection of anomalies in IP-Networks. We
showed that the parametric characterization of studied
variables is amenable to a finite mixture model. Model
parameters are identified from routine operation data,
using the expectation maximization algorithm. A new
method for residual generation, based on successive parameter identification, is introduced. The residuals are
shown to be approximately Normal, with mean zero under normal operations, and sudden jumps in this mean
are characteristics of abnormal conditions. A real-time
online change detection algorithm is designed to processes, sequentially, the residuals and raise an alarm as
soon as the anomaly occurs. The proposed approach requires neither the set of faults and performance degradation nor the thresholds to be supplied by the user.
Experimental results showed the effectiveness of the
method on real data. A low false alarm rate and a high
detection accuracy has been demonstrated. The key
innovation that allowed efficient detection of network
problems was to avoid solving the more general problem of accurate parameter estimation of traffic model,
as in intermediate step for change detection.
10
20
Normal increase in ARP packet count
Test statistic
ARP packets count
30
5
Normal increase in ARP packets
10
0
28
30
32
Time (hours)
0
34
28
(a)
30
32
Time (hours)
34
(b)
10
30
20
Normal decrease in ARP packet count
10
Test statistic
ARP packet count
8
6
Normal decrease in ARP packet count
4
2
0
10
20
Time (hours)
(c)
0
10
20
Time (hours)
(d)
Figure 5: Behavior of the test statistic corresponding to excessive ARP packets, excessive inbound broadcasts, excessive outbound broadcasts, and IP packet loss problems: (a)IP packet discards (b) Outbound broadcast packets
(c)Inbound broadcast packets (d) Inbound ARP packets
References
[1] F. Barceló, and J. Jordán. ”Channel Holding Time Distribution in Public Telephony System (PAMR and PCS)”. IEEE
Transaction on Vehicular Technology, Vol. 49, No. 5, pp:
1615-1625,September 2000.
[2] M. Basseville and I. V. Nikiforov. Detection of Abrupt
Changes: Theory and Application.
[3] V. A. Bolotin, Telephone circuit holding time distributions,
in The Fundamental Role of Teletraffic in the Evolution of
Telecommunications Networks (Proc. 14th ITC). Amsterdam, the Netherlands: Elsevier, 1994, vol. 1a, pp. 125134.
[4] V. A. Bolotin Modeling call holding time distributions for
CCS network design and performance analysis, IEEE J. Select. Areas Commun., vol. 12, pp. 433438, Apr. 1994.
[5] V. A. Bolotin, Y. Levi, and D. Liu, Characterizing data connection and messages by mixtures of distributions on logarithmic scale, in Teletraffic Engineering in a Competitive
World (Proc. 16th ITC). Amsterdam, Prentice-Hall, 1993.
[6] J. Case, M. Fedor, M. Schoffstall and J. Davi. A Simple
Network Management Protocol, RFC 1157, 1990.
[7] A. Dempster, N. Laird and D.Rubin. Maximum Likelihood
from Incomplete Data via the EM Algorithm. J. R. Statist.
Soc. B, Vol. 39, pp:1-38, 1977.
[8] G. Goldzsmidt, Y. Yemini. Distributed Management by Delegation. 15th International Conference on Distributed Computing Systems, IEEE Computer, 1995.
[9] Hassan Hajji, B. H. Far. Continuous Network Monitoring
for Fast Detection of Performance Problems. Proceedings of
2001 International Symposium on Performance Evaluation
of Computer and Telecommunication Systems, 2001.
[10] S. C. Hood and C. Ji. Proactive Network Fault Detection.
Proceedings of the INFOCOM’97, pp:1147-1155, 1997.
[11] G. Jakobson and M. D. Weissman. Alarm Correlation. IEEE
Network, pp:52-59, 1993.
[12] I. Katzela and M. Schwarz. Schemes For Fault Identification
is Communication Networks. IEEE/ACM Transactions on
Networking, Vol. 3, pp:753-764, 1995.
[13] F. Kastenholz. Definitions of Managed Objects for the
Ethernet-like Interface Types, RFC 1643, 1994.
[14] V. Paxson. Empirically-Derived Analytic Models of WideArea TCP Connections. IEEE/ACM Transactions on Networking, Vol. 2. No. 4, August 1994.
[15] L. LaBarre. Management by Exception: OSI event generation, reporting, and logging. Proceedings of Second International Symposium on Integrated Network Management,1991.
[16] A. Leinwand, K. Fang Conroy. Network management, a
practical perspective. 2nd Edition. Addison-Wesley, 1996.
[17] J.P. Martin-Flatin, S. Znaty and J.P. Hubaux. A Survey of
Distributed Enterprise Network and Systems Management
Paradigms. Journal of Network and Systems Management,
Vol.7, No.1, pp:9-26, 1999.
Variable
Number
of Alarms
Average Alarm
Rate per Hour
etherStatsBroadcastPkts
etherStatsUnknownProts
arpStatsPkts
tcpInSyn
12
9
2
2
0.077
0.053
0.011
0.011
Table 3: Average alarm rate per hour for a duration of one week (168 hours)
Variable
Number
of Alarms
Average Alarm
Rate per Hour
etherStatsBroadcastPkts
etherStatsUnknownProts
arpStatsPkts
tcpInSyn
13
89
3
43
0.018
0.123
0.004
0.059
Table 4: Average alarm rate per hour for a duration of one month (720 hours)
[18] R. A. Maxion and F. E. Feather. A Case Study of Ethernet Anomalies in a Distributed Computing Environments.
IEEE Transactions on Reliability, Vol. 39, No. 4, pp:433443, 1990.
[19] K. McCloghrie, M. Rose. Management Information Base for
Network Management of TCP/IP-based internets: MIB-II,
RFC 1213, 1991.
[20] G. J. McLachlan, and K. E. Basford. Mixture Models: Inference and Application to Clustering. New York: Dekker,
1988.
[21] D. Seigmund and E. S. Venkatraman. Using the Generalized Likelihood Ratio Statistic for Sequential Detection of
a Change Points. The Annals of Statistics, Vol. 23, No.1,
pp:255-271, 1995.
[22] M. Thottan and C. Ji. Proactive Anomaly Detection Using
Distributed Intelligent Agents. IEEE International Workshop on Systems Management, 1998.
[23] D. M. Titterington. Recursive Parameter Estimation using
Incomplete. Journal of Royal Statistics Society, Serie B, Vol.
46, No. 2, pp:257-267, 1984.
[24] B.D. Ripley. Pattern Recognition and Neural Networks,
Cambridge University Press, 1996.
[25] E. Weinstein, M. Feder and A. V. Oppenheim. Sequential algorithms for parameter estimation based on KullbackLeibler information measure. IEEE Trans. Acous., Speech,
Signal Processing, Vol. 38, No. 9, pp:1652-1654, 1990.
[26] S. Yemini, S. Kliger, E. Mozes, Y. Yemini, D. Ohsie. High
speed and robust event correlation. IEEE communication
Magazine, pp 82-90, 1996.