FluxInfer: Automatic Diagnosis of Performance
Anomaly for Online Database System
Ping Liu
§‡ ,
Shenglin Zhang † , Yongqian Sun∗ † , Yuan Meng
§‡ ,
Jiahai Yang
k‡¶ ,
Dan Pei
§‡
§ Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
k Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing 100084, China
† College of Software, Nankai University, Tianjin 300071, China
¶ Cyberspace Security Research Center, Peng Cheng Laboratory, Shenzhen 518000, China
‡ Beijing National Research Center for Information Science and Technology (BNRist)
Abstract—The root cause diagnosis of performance anomaly
for online database anomalies is challenging due to diverse types
of database engines, different operational modes, and variable
anomaly patterns. To relieve database operators from manual
anomaly diagnosis and alarm storm, we propose FluxInfer, a
framework to accurately and rapidly localize root cause related
KPIs for database performance anomaly. It first constructs a
Weighted Undirected Dependency Graph (WUDG) to represent
the dependency relationships of anomalous KPIs accurately, and
then applies a weighted PageRank algorithm to localize root cause
related KPIs. The testbed evaluation experiments show that the
AC@3, AC@5, and Avg@5 of FluxInfer are 0.90, 0.95, and 0.77,
outperforming nine baselines by 64%, 60%, and 53% on average,
respectively.
Index Terms—performance anomaly, root cause localization,
KPI
I. I NTRODUCTION
The performance of today’s database system is vitally
important to Internet services, since a performance issue of
database can degrade service performance and in turn impact
user experience. Therefore, the performance monitoring and
anomaly root cause analysis of database systems are crucial to
these services. However, they are challenging due to diverse
types of database engines, different operational modes, and
variable anomaly patterns.
To rapidly diagnose anomalies and trigger mitigation,
database operators monitor hundreds of KPIs (Key Performance Indicator) of each database instance. Usually, when
a database anomaly (e.g., the database response times are
too slowly) arises, some of these KPIs manifest anomalous
patterns (say, sudden changes). Operators thus can proactively
detect database anomalies and conduct root cause analysis
according to these KPIs. We take a real-world example
to elaborate how operators manually analyze anomaly root
causes. Figure 1 shows some anomalous KPIs when a database
becomes unavailable. 78 KPIs becomes anomalous in this
database anomaly. Some anomalous KPIs indicate the root
cause of this database anomaly and thus is closely related to
the root cause (called root cause related KPI hereafter), while
other anomalous KPIs are not directly related to the root cause
∗ Yongqian
Sun is the correspondence author.
978-1-7281-9829-3/20/$31.00 ©2020 IEEE
Fig. 1: Some anomalous KPIs during a real-world database
anomaly.
(called symptom KPI hereafter). In this case, for example, a
sudden increase in workload leads to this database anomaly.
Therefore, QPS (a KPI), which represents the workload of the
database, is a root cause related KPI. On the other hand, the
increase in workload also leads to a sudden increase in CPU
utilization, which is represented by the KPI of CPU USAGE.
Consequently, CPU USAGE is a symptom KPI.
Manual anomaly diagnosis. When an anomaly occurs,
operators usually take two steps to manually localize the root
cause related KPIs from such a large number of anomalous
KPIs: (1) operators manually infer the dependency relationships of these KPIs, which is highly dependent on the domain knowledge and experience; (2) they then localize the
root cause related KPIs based on the dependency relationships. However, it is time-consuming to manually diagnose
a database anomaly even for a highly experienced operator,
causing the services to suffer from performance degradation
for a long period. In addition, hundreds of database anomalies
happen in a large online database cluster [1], and manually
diagnosing such a large number of anomalies is tedious, errorprone and unscalable.
Alarm storm. Typically, an alarm is generated and sent to
operators when a KPI is detected anomalous. Since a single
database anomaly can trigger up to hundreds of these alarms,
a large online database cluster housing tens of thousands to
millions of database systems can produce tens of thousands
of such alarms per day, which is called alarm storm. Alarm
storm has always been a challenge for operators because such
a large number of alarms can degrade diagnosis efficiency.
Although operators can avoid alarm storm by adjusting alarm
filtering rules (e.g., tuning the thresholds of anomaly detection
algorithms) for different database instances, it is still a great
deal of work to find the best rule for each database instance.
To relieve operators from manual anomaly diagnosis and
alarm storm, we try to automate anomaly diagnosis of performance anomaly for online database systems. In this way,
we can automatically localize the root cause related KPIs
soon after a database anomaly arises to quickly mitigate the
loss. Moreover, operators can focus on these KPIs rather
than symptom KPIs, successfully addressing the burden of
alarm storm. Although DBSherlock [2] was proposed as a
tool to assist operators in explaining performance anomalies of
database systems, it did not directly localize the root cause or
root cause related KPIs. Operators have to manually combine
DBSherlock’s output with domain knowledge to localize the
root cause KPIs. Accordingly, we propose FluxInfer to automatically, accurately, and rapidly localize root cause KPIs to
diagnose performance anomaly for online database systems.
However, FluxInfer faces the following two challenges.
Challenge 1: Accurately represent the dependency relationships of KPIs. Motivated by the manual diagnosis
process of operators, we find that inferring the dependency
relationships of KPIs is crucial to automatically diagnose
anomalies. Some state-of-the-art works [3]–[7] tried to automatically construct a directed acyclic graph (DAG) using
PC algorithm [8] to represent the dependency relationships of
KPIs. However, due to the unknown latent variables which
are hardly observed and inferred, the constructed DAG can
generate incorrect dependency relationships, which tends to
cause incorrect localization results (details are discussed in
§II). To address this challenge, we propose an algorithm to
automatically construct a Weighted Undirected Dependency
Graph (WUDG) to accurately represent the dependency relationships of anomalous KPIs. To the best of our knowledge,
this is the first time that WUDG is used for anomaly diagnosis
of computer systems. We believe that WUDG can be used to
accurately infer the dependency relationships of KPIs for more
anomaly diagnosis scenarios beyond online database systems.
This is the first contribution of this paper.
Challenge 2: Automatically localize the root cause
related KPIs based on a WUDG. Some state-of-the-art
Conditional Variable Set 𝑆
Observable KPI Set
KPI_3
KPI_4
KPI_1
…
Latent Variable Set
Latent
Variable 1
…
KPI_2
Fig. 2: Explanation of I(KPI 1, KPI 2|S).
works [3]–[7] proposed to use Deep-First Search, Random
Walk, etc., to traverse DAG for localizing root cause. However,
these localization algorithms can only be used in the directed
dependency graphs, and they are inapplicable to the undirected
WUDG. To address this challenge, we propose to use a
weighted PageRank algorithm to traverse WUDG [9], which
can accurately localize root cause related KPIs. This is the
second contribution of this work.
Detailed evaluation experiments on our testbed show that the
AC@3, AC@5, and Avg@5 of FluxInfer are 0.90, 0.95, and
0.77, outperforming nine baselines by 64%, 60%, and 53%
on average, respectively. Additionally, the average diagnosis
time of FluxInfer is 53 seconds, which is approriate for online
diagnosis. This is the third contribution of this work.
II. R EPRESENTATION CHALLENGE
Inferring the dependency relationships among KPIs is crucial to automatically diagnose anomalies. Most of the stateof-the-art works [3]–[7] tried to automatically construct a
Directed Acyclic Graph (DAG) by PC algorithm [8] to represent the dependency relationships among KPIs. Then the
root causes are localized by traversing the DAG with different
algorithms. However, PC algorithm may infer some incorrect
dependency relationships due to the unknown latent variables,
which will lead to incorrect localization results. In this section,
we first introduce the details of the PC algorithm. Then we
explain the representation challenge of the constructed DAG.
A. A common method: PC algorithm
PC algorithm [8] is the most popular algorithm to construct
a causal graph from observational data. It assumes faithfulness,
which means that there is a directed acyclic graph, G, such that
the causal relationships among random variables are exactly
those represented by G. Its input is the observed values of
N random variables. PC algorithm will output a DAG with N
nodes, where each node represents one random variable. There
are four steps in the PC algorithm:
1) Construct a fully connected undirected graph of the N
random variables (all nodes are connected).
2) Perform a conditional independence test on each adjacent variables (X and Y ) conditional on a variable set
S, denoted by I(X, Y |S). If a conditional independence
exists, the edge between the two variables is removed.
In this step, the size of the conditional variable set S
increases step by step until there are no more variables
that can be added into S.
3) Determine the directions of some edges on the basis of
V-structure [10]. A V-structure is a condition to decide
2
FluxInfer
Services
Step-1:
Anomaly Detection
Performance
Alert
Databases
…
DB
Mitigation
KPIs
Time Series
Database
Anomalous KPIs
•
•
•
•
Mitigation Method
SQL flow control
SQL optimization
Auto-scale
…
Step-2:
Dependency Graph
Construction
WUDG
Response Time
Anomaly
Diagnosis
Result
operator
Step-3:
Root Cause Localization
Fig. 3: The overview of FluxInfer
the direction of an edge. Suppose three nodes X, Y,
Z be a part of a graph. X and Y are connected, and
Y and Z are connected (X-Y-Z). One obtains a causal
relationship X → Y ← Z if X and Z are not conditionally
independent for Y.
4) Determine the direction of the rest of the edges with
orientation rules.
B. Representation Challenge
As described in the details of the PC algorithm, if the
set of independencies (step-2 and step-3 of PC algorithm)
is faithful to a graph and we have a perfect way of determining whether I(X, Y |S), then the PC algorithm has a
guarantee of producing a graph equivalent to the original
one. However, none of these conditions is verified. Firstly,
the conditional independence test is a statistical test that may
have errors. Secondly, the conditional variable set S cannot
be completely observed. In practice, the conditional variable
set S may include some latent variables that are hard to
be observed or inferred, which will produce some incorrect
relationships in the DAG. As shown in Figure 2, the latent
variable set could affect the conditional independence test of
I(KPI 1, KPI 2|S). During evaluation, we found that PCalgorithm inferred some incorrect dependency relationship.
For example, NET SEND→ QPS. QPS indicates the workload
and NET SEND indicates the number of bytes sent by the
database. Obviously, QPS is not dependent on NET SEND.
III. S YSTEM OVERVIEW AND C URRENT L IMITATION
Figure 3 shows an overview of FluxInfer. The services
need databases to support their mission-critical and realtime applications. FluxInfer is triggered by the performance
anomaly alert, for example, the response time of a database
suddenly increases. FluxInfer contains three steps: Anomaly
Detection, Dependency Graph Construction and Root Cause
Localization. Here we will give a cursory introduction of them.
Step-1: Anomaly Detection. All KPIs of an anomalous
database are detected in this step. The KPI values around the
alert time are obtained from the time series database which
stores historical KPI values of databases. Due to the real
world’s KPI values are noisy and often fluctuate regardless
of system failure, a jumble of noises, normal fluctuations and
anomalous changes will affect the performance of anomaly
detection. Therefore, we designed a cluster-based anomaly
detection algorithm that can detect anomalies robustly.
Step-2: Dependency Graph Construction. The input of this
step is the anomalous KPIs detected by Step-1. A weighted
undirected dependency graph is constructed in this step, which
can accurately represent the dependency relationships among
the anomalous KPIs.
Step-3: Root Cause Localization. The purpose of this step
is to localize the root cause related KPIs based on the dependency graph constructed in step-2. We designed a localization
algorithm based on the weighted PageRank [9] algorithm to
analyze the dependency graph, and finally recommends the
ranking list of possible root cause related KPIs.
After localizing the root cause related KPIs, operators can
quickly take actions to mitigate the database anomalies based
on the diagnosis results of FluxInfer, like SQL flow control,
SQL optimization, auto-scale, etc.. Next, we will introduce the
limitation of FluxInfer.
FluxInfer diagnoses the anomalies whose root cause can
affect at least one of the KPIs. It is straightforward to see the
reason: if the root cause does not manifest itself in any of the
KPIs, FluxInfer has no means of distinguishing between root
cause related KPIs and symptom KPIs. For example, mostly
SELECT statements are executed slowly if the proper indexes
are not created on tables. In this scenario, although some KPIs
will also show anomalous patterns, the root cause (poor table
design) can not directly manifest itself in any of the KPIs.
IV. S YSTEM D ESIGN
A. Robust Anomaly Detection
The real world’s KPI values are noisy and often fluctuate
regardless of system anomaly. So the KPI values needs to be
smoothed before anomaly detection. However, the commonly
used data smoothing algorithms (e.g., Kalman Filter, Moving
Average, etc.) cannot be applied in our scenario. When a
system is anomalous, the anomalous changes could be wrongly
smoothed without distinction among a jumble of noises, normal fluctuations, and anomalous changes by these algorithms.
Therefore, we designed a cluster-based robust anomaly detection algorithm. The core idea is that the clustering algorithm
can divide KPI values into different segments, then the noisy
segments, anomalous segments, and normal segments can be
3
Algorithm 1 Smoothing
𝑠"
Input:
The values of a KPI, {xi };
Output:
Smoothed data, {x̀i }
Segments, {sj };
1: {x̀i } ⇐ {xi }
2: while True do
3:
Cluster {x̀i } into two clusters by Gaussian Mixture
Model;
4:
{sj } ← Split {x̀i } into k segments by the clustering
results;
5:
if k <= 2 then
6:
return {x̀i }, {sj };
7:
end if
8:
for j = 1; j < k − 1; j + + do
9:
if |sj | < |sj+1 | then
10:
Replace sj with samples which are randomly sampled from sj+1 ;
11:
end if
12:
end for
13:
if {x̀i } =
6 {xi } then
14:
{xi } ← {x̀i }
15:
else
16:
return {x̀i }, {sj };
17:
end if
18: end while
𝑠&%'
𝑠…
…
Cluster 1
Replace 𝑠" with samples
randomly sampled from 𝑠"%$
…
𝑠"#$
𝑠"%$
Cluster 2
𝑠…
Fig. 4: An example of smoothing segment. The values of a KPI
are clustered into two clusters. Then the clustering results are
used to split the KPI values into different segments.
Therefore, we focus on the last change: between sk−2 and
sk−1 . In our scenario, all KPIs (include normal KPIs and
anomalous KPIs) are smoothed by the smoothing algorithm
(§IV-A1). Therefore, the last change may be normal. So
we need to find the anomalous changes by measuring the
abnormality of the last change.
We use z-score to measure the abnormality of a change.
The z-score of each data xi in segment sk−1 is calculated
by the mean and standard deviation (std) of data in segment
. Then the mean of z-scores of data in segment
sk−2 : xi −mean
std
sk−1 can be used to represent the z-score of segment sk−1 ,
denoted by z̄. Finally, the 3-sigma rule is used to detect
anomalous changes. if |z̄| > 3 ∗ std, the change is anomalous.
After anomaly detection, all anomalous KPIs will be used to
construct the dependency graph (WUDG).
B. Dependency Graph Construction
To localize the root cause related KPIs, the dependency
relationships among KPIs need to be constructed. Due to
that the PC algorithm may infer some incorrect dependency
relationships (§II), we proposed a Weighted Undirected Dependency graph (WUDG) which can accurately represent
the dependency relationships among KPIs. The core idea
of WUDG is: if two KPIs have a dependency relationship,
then the two KPIs are not independent. Therefore, the design
of WUDG based on whether the dependency relationships
exist among KPIs (undirected dependency graph), instead of
inferring the dependency directions among KPIs (directed
dependency graph). It is more accurate to determine the
dependency’s existence than the direction of the dependency.
Next, we will introduce the details of WUDG construction.
Firstly, we construct a complete graph for all anomalous
KPIs. Secondly, the independence of all pairs on the complete graph is tested. Two algorithms are commonly used for
the independence test [12]: G-square Test [13] and FisherZ Test [13]. G-square Test works with discrete values, and
Fisher-Z works with continuous values. Due to all KPIs in
our scenario are continuous values, we use Fisher-Z Test to
test the independence among anomalous KPIs.
The Fisher-Z test evaluates independence on the basis of
Pearson’s correlation coefficient. This test is a combination of
two statistical techniques: Fisher-Z transformation to estimate
a population correlation coefficient, and a partial correlation to
evaluate the effect of other nodes. We use X and Y to denote
two KPIs, the statistic Zs between X and Y is defined as:
√
m−3
1+r
log
,
Zs =
2
1−r
precisely divided. Next, we first introduce how to smooth the
KPIs based on the clustering algorithm. Then we show how
to detect anomalous KPIs based on the smoothing results.
1) Data Smoothing: Algorithm 1 shows the details of
smoothing. It is a looping algorithm. During each loop, some
noises are smoothed. The algorithm will loop multiple times
until no data need to be smoothed. The input is the values of
a KPI, denoted by {xi }, and the smoothed values are denoted
by {x̀i }. In each loop, firstly the KPI values are clustered
into two clusters by Gaussian Mixture Model [11]. The reason
for setting two clusters is: The KPI values can be considered
as normal and abnormal. The abnormal cluster may include
noisy values and anomalous values. Then the clustering results
are used to split the KPI values into k segments (from 0
to k − 1). Figure 4 shows an example of the segments. We
use {sj } to denote the segment set, where sj denotes the
segment j. We also use |sj | to denote the length of sj , which
represents the number of data points in sj . Due to the length
of a noisy segment is smaller than its adjacent segments, sj
needs to be smoothed if the length of sj is smaller than sj+1 :
|sj | < |sj+1 |. A segment sj is smoothed by replacing its
values with samples that are randomly sampled from sj+1 ,
as shown in Figure 4. The first segment s0 is ignored as an
invalid segment because it may only be part of a segment. Due
to the last segment sk−1 is in the anomaly period, sk−1 does
not need to be smoothed.
2) Anomaly Detection : After smoothing, some segments
are obtained. Due to the last segment is in the database
anomaly period, the last change may be related to the anomaly.
4
1
KPI_1
1
𝑝_𝑣𝑎𝑙𝑢𝑒)*
𝑝_𝑣𝑎𝑙𝑢𝑒,*
1
OLTPBench
queries
KPI_3
𝑝_𝑣𝑎𝑙𝑢𝑒-*
1
1
𝑝_𝑣𝑎𝑙𝑢𝑒-,
𝑝_𝑣𝑎𝑙𝑢𝑒,)
KPI_2
KPI_4
1
MySQL
MySQL
MySQL
Docker
Docker
Docker
Fig. 6: The overview of Testbed.
𝑝_𝑣𝑎𝑙𝑢𝑒-)
MySQL database instance was deployed on each Docker container. We used the OLTPBenchmark framework [15] to simulate clients. The OLTPBenchmark framework was deployed
on another server. We used stress-ng [16] to inject anomalies
of CPU, IO, and memory. Each experiment consisted of 30
minutes of the normal state and different durations of abnormal
state. We ran our experiments using TPC-C [17]. The default
setting used in our TPC-C workload was a scale factor of 500
with 150 terminals. We also experimented with different scale
factors (from 50 to 500) and the number of terminals (from
10 to 150). The results were consistent across these different
settings. Table I shows the KPIs used for evaluation. These
KPIs are standard KPIs from Linux OS, Docker, and MySQL.
All the KPI data are collected at one-second intervals.
B. Evaluation cases
valuenm
Fig. 5: An example of a WUDG. p
denotes the
p value of Fisher-Z Test between KPI n and m.
where m is the size of KPIs, and r is the partial correlation
of X and Y .
The strength of the dependency between two KPIs can be
measured by the p value of the null hypothesis of the FisherZ Test. The smaller the p value, the stronger the dependency
1
relationship, and vice versa. Therefore, we set p value
to the
weight of the corresponding edge. Finally, we construct a
weighted undirected dependency graph (WUDG) to represent
the dependency relationships among the anomalous KPIs.
Figure 5 shows an example of a WUDG among four KPIs.
C. Root Cause Localization
The anomalous KPIs in the WUDG contain root cause
related KPIs and symptom KPIs. Due to that the root cause of a
database system can quickly spread and lead to more and more
anomalous KPIs, the root cause has the largest influence on the
anomaly spread network (the dependency graph). Therefore,
we suppose that the root cause related KPI is the KPI who has
the largest influence on the WUDG. The weighted PageRank
algorithm have been used to measure the influences of nodes
in a weighted undirected graph [9], [14]. For a KPI node u, we
designed a weighted PageRank algorithm to compute a score
P R(u) for u:
X
2
P R(u) = (1 − d) + d ∗
P R(v) ∗ W(u,v)
,
To evaluate FluxInfer, we injected five different types of
anomalies to represent some of the important types of realworld problems that may deteriorate the performance of online
databases. After the 30 minutes run of the normal workload
in each case, we invoked the actual cause of an anomaly
with different durations. Due to that our testbed needs to
be rebooted after each anomaly injection for restoring the
system to normal, we only collected 30 different cases for each
type of anomaly by varying the duration. The duration of the
anomalies ranged from 1 to 5 minutes with an increment of 10
seconds, yielding 30 cases for each type of anomaly (a total
of 150 cases). Table II lists the types and descriptions of the
different types of anomalies in our evaluation. These anomalies
are designed to reflect a wide range of realistic scenarios that
can negatively impact the performance of online databases.
C. Evaluation Metric
v∈B(u)
where B(u) denotes a set of KPI nodes that directly connect to
u, W(u,v) denotes the weight of edge (u, v), and d is set to 0.85
which is a commonly used value [9]. Finally, all anomalous
KPIs are ranked by their scores, and the possible root cause
related KPIs are the KPIs ranked at the top.
V. E VALUATION
In this section, we will introduce the details of the evaluation. The experiment setup, evaluation cases, and evaluation
metrics are presented in §V-A, §V-B, and §V-C, respectively.
FluxInfer is evaluated against nine state-of-the-art baseline
approaches in §V-D. The design of robust anomaly detection is
evaluated in §V-E. The diagnosis time of FluxInfer is discussed
in §V-F. Finally, an interesting case is shown in §V-G.
A. Experiment Setup
We constructed a testbed to generate accurately labeled
anomalies of database performance for evaluation. Figure 6
shows an overview of our testbed. We installed three Docker
containers on a server that has 130G memory and 256 cores.
To evaluate the performance of each algorithm on a set of
anomaly cases A, we use two metrics: AC@k and Avg@k.
These two metrics are the most commonly used metrics
in the evaluation of the ranking results of the root cause
localization algorithms in recent works [4], [5], [18], [19].
AC@k represents the accuracy that top k results include the
root causes related KPIs for all anomaly cases. A higher
AC@k score, especially for small values of k, indicates the
algorithm correctly localizes the actual root cause. Given the
anomaly cases set A, AC@k is calculated as follows:
P
a
1 X i<k Ra [i] ∈ Vrc
,
AC@k =
a |)
|A|
min(k, |Vrc
a∈A
a
a
where R [i] is the rank of KPIs for anomaly case a. Vrc
is the root cause related KPI set for anomaly case a. |A|
represents the number of elements in the set A. The overall
5
TABLE I: All KPIs used in our research. These KPIs are standard KPIs of MySQL database, Docker and Linux OS.
Type
Com (8)
Workload (12)
Rows (4)
Physical IO (10)
Storage Engine IO (18)
Logic IO (8)
CPU (2)
Instance Resource (16)
Memory (5)
Network Card (2)
Disk IO (2)
Storage (2)
Session (3)
CPU (8)
Memory (10)
Network Card (4)
Host Resource (33)
Disk IO (11)
Business (3)
Response Time (3)
KPIs
mysql.qps; mysql.tps; mysql.insert ps; mysql.update ps; mysql.delete ps;
mysql.commit ps; mysql.insert select ps; mysql.replace select ps; mysql.replace ps
mysql.innodb rows inserted; mysql.innodb rows read;
mysql.innodb rows deleted; mysql.innodb rows updated
mysql.bytes received; mysql.bytes sent; mysql.innodb data read;
mysql.innodb data reads;mysql.innodb data writes; mysql.innodb data written;
mysql.io bytes read; mysql.io bytes write; mysql.io bytes;mysql.innodb data written
mysql.innodb buffer pool reads requests; mysql.innodb bp usage pct;
mysql.Innodb buffer pool pages flushed; mysql.innodb buffer pool reads;
mysql.innodb buffer pool write requests; mysql.innodb bp data mbytes;
mysql.innodb log writes; mysql.innodb data fsyncs
mysql.cpu usage; docker.cpu usage;
mysql.mem used; docker.mem uasge percent; docker.mem used;
docker.mem cache; docker.mem buffer
docker.net send; docker.net recv
docker.io read; docker.io write
mysql.storage data; mysql.storage log
mysql.active session; mysql.threads connected; mysql.total session
cpu.hi usage; cpu.usage; cpu.usage percent; cpu.user usage;
cpu.sys usage; cpu.si usage; cpu.iowait usage; minion.cpu usage
mem.usage; mem.used; mem.swap used; mem.cached; mem.buffers; mem.pgpgout;
mem.pgpgin; mem.cache; mem.buffer; minion.mem used
net.recv; net.send; net.recv usage; net.send usage
disk.reads; disk.writes; disk.rkB ps; disk.wkB ps; disk.read rt; disk.write rt;
disk.queue; disk.io rt; disk.iops; disk.svctm; disk.util
rt; rt avg; max rt
TABLE II: Details of injected 150 database anomalies for evaluation.
Type of anomaly
CPU Saturation
Number
30
Network Congestion
30
IO Saturation
30
Memory Saturation
30
Anomalous workload
30
Description
Invoke stress-ng, which starts N workers that perform various matrix operations on floating point values.
Simulate network congestion by adding an artificial 500-milliseconds delay to every traffic over the
network via Linux’s tc (Traffic Control) command.
Invoke stress-ng, which starts N workers that perform a mix of sequential, random and memory mapped
read/write operations as well as forced sync’ing and cache dropping.
Invoke stress-ng, which starts N workers that grow their heaps by reallocating memory.
Greatly increase the rate of transactions and the number of clients simulated by OLTPBenchmark
(150 additional terminals with transaction rate of 50,000).
performance of each algorithm is evaluated by computing the
average Avg@k:
1 X
Avg@k =
AC@j.
k
of anomalous changes. This is the reason why PAL [20]
cannot achieve good performance. Due to the unknown latent
variables that are hardly observed and inferred, there are
some incorrect dependency relationships in the directed acyclic
graphs constructed by PC-algorithm [8]. Then because of
traversing the inaccurate dependency graphs, CauseInfer [3],
CloudRanger [4], Microscope [6], MicroRCA [21], MonitorRank [18], and TON18 [19] cannot achieve good performance.
It is also observed that the performance of MicroCause is
better than other PC-algorithm based baselines. The reason
is that MicroCause used pre-defined rules for root cause
localization. These rules are more suitable for diagnosing the
type of anomalous workload database anomaly (Table II).
Therefore, the performances of these PC-algorithm based
baselines demonstrate the value of our weighted undirected
dependency graph design.
It is essential to highlight that these root cause inference
algorithms (Deep First Search, Second-order Random Walk,
Traversing+Pearson Correlation, Personalized PageRank, Random Walk, and TCORW) are all designed for the directed
acyclic dependency graph, which can not be applied in the
undirected dependency graph (e.g., WUDG) of FluxInfer.
1≤j≤k
D. Baseline Comparison
FluxInfer was evaluated against nine baseline approaches, which also localize the root cause by analyzing KPIs: PAL [20], CauseInfer [3], CloudRanger [4], MSRank [7], Microscope [6], MicroRCA [21], MonitorRank [18],
TON18 [19], and MicroCause [5]. Table III summarizes the
approaches of dependency relationship learning and root cause
inference of these works. In our scenario, we can not obtain the
dependency graph based on system tools in MicroRCA [21],
MonitorRank [18] and TON18 [19], therefore we use the PCalgorithm [8] to generate the graph instead. The details of them
are introduced in the related work section (§VI).
Table III shows the evaluation results. The AC@3, AC@5,
and Avg@5 of FluxInfer are 0.9, 0.95, and 0.77, outperforming the other nine baselines. Due to that the database
anomalous KPIs almost change together, it is difficult to
confirm which data point is the starting point of the anomalous change. Further, the noises and fluctuations could affect
the change start time detection. Therefore, the root cause
related KPIs cannot be directly localized by the start times
E. Robust Anomaly detection
To evaluate the design of the robust anomaly detection
algorithm, we designed two comparison algorithms: FluxInferwith-CUSUM and FluxInfer-without-AD. In FluxInfer-with-
6
TABLE III: The evaluation results of different algorithms
Algorithm
PAL [20]
CauseInfer [3]
CloudRanger [4], MS-Rank [7]
Microscope [6]
MicroRCA [21]
MonitorRank [18], TON18 [19]
MicroCause [5]
FluxInfer
FluxInfer-with-CUSUM
FluxInfer-without-AD
Relationships Learning
N/A
PC Algorithm
PC Algorithm
PC Algorithm
PC Algorithm
PC Algorithm
PCTS
WUDG
WUDG
WUDG
Root Cause Inference
Anomaly Time Order
Deep First Search
Second-order Random Walk
Traversing+Pearson Correlation
Personalized PageRank
Random Walk
TCORW
Weighted PageRank
Weighted PageRank
Weighted PageRank
CUSUM, we replaced the robust anomaly detection algorithm of FluxInfer with the CUSUM [22] algorithm, a commonly used robust anomaly detection algorithm. In FluxInferwithout-AD, we removed the robust anomaly detection algorithm of FluxInfer. All KPIs (normal KPIs and anomalous
KPIs) are used to construct the dependency graph and localize the root cause. Table III shows the evaluation results.
Due to that the essence of CUSUM is the cumulative sum,
the irregular noises and fluctuations in KPIs will affect the
cumulative sum. Then the anomalous changes cannot be detected effectively. Therefore, FluxInfer-with-CUSUM cannot
achieve good performance. FluxInfer-without-AD uses normal
KPIs and anomalous KPIs to construct the dependency graph
and localize the root cause. However, the dependency graph
contains many dependencies among normal KPIs. FluxInferwithout-AD could recommend some root cause unrelated
normal KPIs, which leads to poor performance. Therefore, the
results of FluxInfer-with-CUSUM and FluxInfer-withoutAD demonstrate the effectiveness of our robust anomaly
detection design. Further, we can see that the performance
of FluxInfer-with-CUSUM is better than that of FluxInferwithout-AD, which demonstrates that anomaly detection is
necessary for the design of FluxInfer.
F. Diagnosis Time
Diagnosing online anomalies must be as quickly as possible
because the diagnosis time is always related to the duration
of the recovery time. Therefore, the running time of FluxInfer
should be as shorter as possible so that operators can quickly
take mitigation measures. The running time of FluxInfer
consists of three parts: anomaly detection, dependency graph
construction, and root cause localization. We used ten processes of our server to run anomaly detection for KPIs of each
case concurrently. After evaluation, the average running time
of FluxInfer is 53 seconds. We interviewed five experienced
database operators from two cloud service providers. They all
confirmed that the average online diagnosis time must be less
than one minutes. Therefore, the diagnosis time of FluxInfer
can satisfy the online diagnosis.
G. Case Study
CPU saturation is a type of common online database
anomaly. Figure 7 shows the diagnosis result of FluxInfer
and MicroCause [5] for a CPU saturation case. There are 71
anomalous KPIs in this case, and the root cause related KPIs
are cpu.usage and cpu.usage percent. The diagnosis result of
FluxInfer shows that the root cause related KPIs are ranked
top 1 and top 2, respectively. However, the root cause related
KPI is ranked top 7 in the diagnosis result of MicroCause,
AC@1
0.09
0.12
0.08
0.06
0.08
0.08
0.23
0.43
0.23
0.13
AC@2
0.12
0.20
0.19
0.11
0.17
0.16
0.38
0.69
0.40
0.20
AC@3
0.14
0.22
0.27
0.16
0.30
0.28
0.47
0.90
0.62
0.32
FluxInfer
1
2
3
4
5
6
7
8
…
cpu.usage
cpu.usage_percent
mysql.mem_used
docker.net_recv
docker.net_send
docker.io_write
mysql.io_bytes_write
mem.usage_percent
AC@5
0.20
0.28
0.36
0.24
0.38
0.39
0.60
0.95
0.73
0.50
Avg@5
0.14
0.21
0.24
0.15
0.25
0.24
0.44
0.77
0.53
0.38
MicroCause
1
2
3
4
5
6
7
8
…
mysql.qps
mysql.innodb_rows_read
mysql.active_session
docker.io_write
mem.usage_percent
mysql.io_bytes_write
cpu.usage
mysql.bytes_sent
Fig. 7: The diagnosis results of FluxInfer and MicroCause [5]
for a CPU saturation case. The bold red font represents the
root cause related KPI.
and the other eight baselines even cannot rank the root cause
related KPIs in top 10. In this scenario, the CPU saturation
anomaly caused the workload KPIs to drop suddenly. Due to
that MicroCause gives the workload KPIs higher priority to be
considered as the root cause, the workload KPIs (mysql.qps
and mysql.innodb rows read) are ranked top 1 and top 2.
The diagnosis result of MicroCause demonstrates that the
root cause inference rules designed for microservices are not
applicable in the database system. Further, the DAG’s incorrect
dependency relationships also affect the ranking of the root
cause related KPIs.
VI. R ELATED W ORK
It is vitally important to quickly localize the root causes of
system failures for assuring the quality of users’ experience.
Many solutions have been proposed to localize root causes
KPIs in databases, clouds, and microservices. Next, we will
introduce the details of these works.
Dependency graph based methods: CloudRanger [4],
Microscope [6], MS-Rank [7], and MicroCause [5] localized
the root cause of anomalies in clouds or microservices by automatically constructing the dependency graphs among KPIs.
They used PC-algorithm [8] to construct the directed acyclic
dependency graphs. Then, they applied different algorithms
to traverse the graph for root cause localization. CauseInfer
used DFS (Depth First Search) algorithm, CloudRanger and
MS-Rank used a second-order random walk algorithm. Microscope traversed the graph by some defined rules to obtain
root cause candidates. These candidates were then ranked
by the Pearson correlation coefficients between the front end
and the abnormal service instances. MicroCause designed a
Path Condition Time Series (PCTS) algorithm (an improved
algorithm based on PC algorithm) to learn the dependency
graph of KPIs, and a Temporal Cause Oriented Random Walk
(TCORW) approach to localize the root cause related KPIs.
7
MonitorRank [18], TON18 [19], and MicroRCA [21] also
focused on the anomaly diagnosis of clouds or microservices.
They used the pre-defined calling typologies to construct the
dependency graphs among KPIs. MonitorRank constructed the
call graph of the system’s APIs (Application Programming
Interface), which was easily obtained by batch processing
systems (e.g., Hadoop), to learn APIs’ dependencies. TON18
used the APIs of OpenStack to obtain physical dependencies,
and a popular trace analysis tool, PreciseTracer, to capture
call dependencies. MicroRCA constructed an attributed graph
to represent the dependencies in microservices environments.
After obtaining the dependencies, MonitorRank and TON18
applied a random walk algorithm for root cause localization.
MicroRCA used the Personalized PageRank algorithm for root
cause localization.
Other methods: DBSherlock [2] explains the anomaly in
the form of predicates and possible causes produced by causal
models. These explanations only assist operators in diagnosing
anomalies. The models need operators to explain by their
domain knowledge and experience manually, and the root
cause localization process needs the feedback of operators.
PerfXplain [23] used decision trees to explain the performance
of map-reduce jobs automatically. PerfAugur [24] used robust statistics for automatic performance explanation of cloud
services. However, these approaches are more likely to find
secondary symptoms when the root cause of the anomaly is
outside the database and not directly captured by the collected
KPIs. Scorpion [25] used sensitivity-analysis-based techniques
to find the individual tuples most responsible for extreme
aggregate values in scientific computations, which are not
applicable for database anomaly diagnosis. The reason is that
databases often avoid prohibitive data collecting overheads by
maintaining aggregate statistics rather than detailed statistics
for individual transactions. Works [26], [27] proposed tools
to an automatic diagnosis of commercial database anomalies.
However, work [26] needs DBAs to provide a set of manual
rules, and work [27] needs detailed internal performance
measurements from the database system. PAL [20] localized
the root cause by sorting the change start times of KPIs’
abnormal changes, which supposes that the root cause related
KPIs change firstly.
VII. C ONCLUSIONS
Province (2019B010136001), the Science and Technology
Planning Project of Guangdong Province LZC0023, and the
Beijing National Research Center for Information Science and
Technology (BNRist) key projects.
R EFERENCES
[1] M. Ma, Z. Yin, S. Zhang et al., “Diagnosing root causes of intermittent
slow queries in cloud databases,” Proceedings of the VLDB Endowment,
vol. 13, no. 8, pp. 1176–1189, 2020.
[2] D. Y. Yoon, N. Niu et al., “Dbsherlock: A performance diagnostic tool
for transactional databases,” in Proceedings of the 2016 International
Conference on Management of Data. ACM, 2016, pp. 1599–1614.
[3] P. Chen, Y. Qi, P. Zheng, and D. Hou, “Causeinfer: Automatic and
distributed performance diagnosis with hierarchical causality graph in
large distributed systems,” in IEEE INFOCOM 2014-IEEE Conference
on Computer Communications. IEEE, 2014, pp. 1887–1895.
[4] P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y. Wang, and P. Chen,
“Cloudranger: root cause identification for cloud native systems,” in
Proceedings of the 18th IEEE/ACM International Symposium on Cluster,
Cloud and Grid Computing. IEEE Press, 2018, pp. 492–502.
[5] Y. S. Yuan Meng, Shenglin Zhang et al., “Localizing failure root causes
in a microservice through causality inference,” in IWQoS 2020.
[6] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues
with causal graphs in micro-service environments,” in ICSOC 2018.
[7] M. Ma, W. Lin et al., “Ms-rank: Multi-metric and self-adaptive root
cause diagnosis for microservice applications,” in ICWS 2019.
[8] M. Kalisch and P. Bühlmann, “Estimating high-dimensional directed
acyclic graphs with the pc-algorithm,” Journal of Machine Learning
Research, vol. 8, no. Mar, pp. 613–636, 2007.
[9] W. Xing and A. Ghorbani, “Weighted pagerank algorithm,” in Proceedings. Second Annual Conference on Communication Networks and
Services Research, 2004. IEEE, 2004, pp. 305–314.
[10] L. G. Neuberg, “Causality: Models, reasoning, and inference , by judea
pearl, cambridge university press, 2000,” Econometric Theory, vol. 19,
no. 04, pp. 675–685, 2003.
[11] D. Reynolds, “Gaussian mixture models,” Encyclopedia of biometrics,
pp. 827–832, 2015.
[12] S. Kobayashi, K. Otomo, K. Fukuda, and H. Esaki, “Mining causality of
network events in log data,” IEEE Transactions on Network and Service
Management, vol. 15, no. 1, pp. 53–67, 2017.
[13] R. E. Neapolitan et al., Learning bayesian networks. Pearson Prentice
Hall Upper Saddle River, NJ, 2004, vol. 38.
[14] E. Yan and Y. Ding, “Discovering author impact: A pagerank perspective,” Information processing & management, 2011.
[15] D. E. Difallah, A. Pavlo et al., “Oltp-bench: an extensible testbed for
benchmarking relational databases,” in VLDB 2013.
[16] “Stress-ng.” https://kernel.ubuntu.com/ cking/stress-ng/.
[17] “Tpc-c.” http://www.tpc.org/tpcc/.
[18] M. Kim, R. Sumbaly, and S. Shah, “Root cause detection in a serviceoriented architecture,” ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 1, pp. 93–104, 2013.
[19] J. Weng, J. H. Wang, J. Yang, and Y. Yang, “Root cause analysis of
anomalies of multitier services in public clouds,” IEEE ACM Transactions on Networking, vol. 26, no. 4, pp. 1646–1659, 2018.
[20] H. Nguyen, Y. Tan, and X. Gu, “Pal: P ropagation-aware a nomaly
l ocalization for cloud hosted distributed applications,” in Managing
Large-scale Systems via the Analysis of System Logs and the Application
of Machine Learning Techniques. ACM, 2011, p. 1.
[21] L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause
localization of performance issues in microservices,” in NOMA 2020.
[22] F. Gustafsson and F. Gustafsson, Adaptive filtering and change detection.
Citeseer, 2000, vol. 1.
[23] N. Khoussainova, M. Balazinska, and D. Suciu, “Perfxplain: debugging
mapreduce job performance,” Proceedings of the VLDB Endowment,
vol. 5, no. 7, pp. 598–609, 2012.
[24] S. Roy, A. C. König et al., “Perfaugur: Robust diagnostics for performance anomalies in cloud services,” in 2015 IEEE 31st International
Conference on Data Engineering. IEEE, 2015, pp. 1167–1178.
[25] H. L. Stahnke, “Scorpion nomenclature and mensuration,” Entomological news, vol. 81, no. 12, pp. 297–316, 1970.
[26] D. G. Benoit, “Automatic diagnosis of performance problems in database
management systems,” in Second International Conference on Autonomic Computing (ICAC’05). IEEE, 2005, pp. 326–327.
[27] K. Dias, M. Ramacher et al., “Automatic performance diagnosis and
tuning in oracle.” in CIDR, 2005, pp. 84–94.
To relieve database operators from the tedious and timeconsuming manual diagnosis work, we propose FluxInfer,
which can automatically and accurately localize the root cause
related KPIs of online database performance anomalies. Our
evaluation on testbed shows that the AC@3, AC@5, and
Avg@5 of FluxInfer are 0.90, 0.95, and 0.77, outperforming
nine baselines by 64%, 60%, and 53% on average. The
diagnosis time of FluxInfer can satisfy the online diagnosis.
ACKNOWLEDGMENT
The authors thank Ruru Zhang for the helpful work on
baseline comparison. This work has been supported by the
National Key R&D Program of China (2019YFB1802504), the
National Natural Science Foundation of China (61902200), the
China Postdoctoral Science Foundation (2019M651015), the
Key-Area Research and Development Program of Guangdong
8