Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

R1 - Adaptive Profiling For Root-Cause Analysis of Performance Anomalies in Web-Based Applications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2011 IEEE International Symposium on Network Computing and Applications

Adaptive Profiling for Root-cause Analysis of Performance Anomalies in Web-based


Applications

João Paulo Magalhães Luis Moura Silva


CIICESI, ESTGF-Porto Polytechnic Institute CISUC, University of Coimbra
Felgueiras, Portugal 4610-156 Coimbra, Portugal 3030-290
Email: jpm@estgf.ipp.pt Email: luis@dei.uc.pt

Abstract—The most important factor in the assessment of and to suppress them. While essential to improve the ap-
the availability of a system is the mean-time to repair (MTTR). plications performance such off-line analysis do not capture
The lower the MTTR the higher the availability. A significant run-time performance anomalies where, according the Fail-
portion of the MTTR is spent in the detection and localization
of the cause of the failure. One possible method that may Stutter fault model [3], some of the application components
provide good results in the root-cause analysis of application can start performing differently leading to performance-
failures is run-time profiling. The major drawback of run-time faulty scenarios.
profiling is the performance impact. In [4] authors estimate that 75% of the time to recover
In this paper we describe two algorithms for selective and from application-level failures is spent just detecting and lo-
adaptive profiling of web-based applications. The algorithms
make use of a dynamic profiling interval and are mainly
calizing them. Quickly detection and localization is intended
triggered when some of the transactions start presenting some as a main contribution to reduce the MTTR (mean-time-to-
symptoms of performance anomaly. The algorithms were tested recovery) and so improve the service reliability.
under different types of degradation scenarios and compared to In this context, run-time application profiling is extremely
static sampling strategies. We observed through experimenta- important to provide timely detection of abnormal execution
tion that the pinpoint of performance anomalies, supported by
the data collected using the adaptive profiling algorithms, stills
patterns, pinpoint the faulty components and allow quickly
timely as with full-profiling while the response time overhead recovery. Is common sense that the more specific the profil-
is reduced in almost 60%. When compared to a non-profiled ing is, the more precise the analysis it allows. However,
version the response time overhead is less than 1.5%. These collect detailed data in run-time from across the entire
results show the viability of using run-time profiling to support application can introduce an overhead incompatible with the
quickly detection and pinpointing of performance anomalies
and enable timely recovery.
performance level required for the application. In past work
[5] we developed some techniques for root-cause failure
Keywords-application profiling; monitoring; root-cause anal- analysis and failure prediction that make use of Aspect-
ysis; performance anomalies; dependability
Oriented-Programming (AOP) to do run-time monitoring of
the application components and system values. The results
I. I NTRODUCTION were very sound but to avoid the AOP-based profiling
The response time is a crucial aspect for companies, which overhead (around 60%) we adopted a static profiling sam-
depends on web applications for most of their revenue. pling strategy. Such approach might not optimize the time
Recently Bojan Simic presented in [1] the results of his required for localization, so we need to work further in the
latest research. He found that website slowdowns can have profiling algorithms to improve the time required to pinpoint
twice the revenue impact on an organization as an outage. the faulty components as well to minimize the profiling
According him the average revenue loss for one hour of overhead.
website downtime is $21000 while the average revenue loss In this paper we propose two adaptive and selective
of an hour of website slowdown is estimated in $4100, algorithms to profile web-based or component-based ap-
however website slowdowns may occur 10 times more plications. The usefulness of such adaptive algorithms for
frequently than website outages. Likely, according a recent application profiling encompasses several challenges. In this
report provided by the Aberdeen Group [2], a delay of just 1- paper we focus algorithms suitable to:
second in page load time can represent a loss of $2.5 million • reduce the performance impact;
in sales per year for a site that typically earns $100.000 a • allow to timely pinpoint the root-cause of performance
day. anomalies;
Developers are aware of these issues and as part of the de- • minimize the number of end-users suffering from the
velopment cycle they adopt application profiling to identify performance anomalies effects;
where the overwhelming of system resources burdens are, • guarantee that application profiling is not itself con-

978-0-7695-4489-2/11 $26.00 © 2011 IEEE 163


171
DOI 10.1109/NCA.2011.30

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
tributing to slowdown the end-users response time. (methods execution time) belonging to each one of the user-
The adaptive and selective application profiling is sup- transactions.
ported by a closed-loop architecture that takes into account
the correlation degree between the workload and the user- CPU (usr/sys) tm
Available Heap
CPU (usr/sys) tm #Run Threads
transactions response time. As the correlation degree de- t1 t2 t3 t4 t... tn
...

teriorates (symptom of performance anomaly) the sampling timeline

interval is adjusted on-the-fly. The adjustment performs only C


D

to user-transactions with symptoms of performance anomaly B


I J R
E
R
and establishes a tradeoff between large sampling intervals E F S
P
Q
(may delay the root-cause analysis, affecting more users U
G H
O
N
E A E
during the anomaly period) and low sampling intervals S Application S
E
T
(provide timely analysis when a performance anomaly is S Server S

observed, however the number of user-transactions affected


by the profiler is higher). Figure 1. Logical representation of the lightweight and fine-grain moni-
The rest of the paper is organized as follows: section toring
2 describes the Root-cause Failure Analysis approach, the
adaptation algorithms are described in section 3, section 4 The data collected by the lightweight monitoring is used
presents some experimental results, section 5 describes some to detect slow user-transactions. According previous analysis
related work, and section 6 concludes the paper. it introduces a performance penalty inferior to 0.5%.
The fine-grain monitoring is useful to support the iden-
II. M ONITORING AND ROOT- CAUSE FAILURE A NALYSIS tification and localization of components associated with
The root-cause failure analysis is integrated into a frame- a given performance anomaly. Its major drawback is the
work described in [5] and [6]. The framework is targeted 60% of performance penalty that it introduces. A possible
for detection, localization and recovery of performance approach to reduce the performance penalty may be achieved
anomalies in web-based and component-based applications. through sampling.
There, a set of system, application and application server With this in mind, we conducted some experiments com-
parameters are collected on-line by the Monitoring module. bining lightweight monitoring and fixed sampling intervals
The data is then sent to the Performance Analyzer module for fine-grain monitoring. From the results we observed
to distinguish workload variations from performance anoma- that by adopting a sampling interval of 60 user-transactions,
lies. The Anomaly Detector and the Root-cause Failure i.e. the user-transactions are profiled only every 60th time
Analysis modules complement the performance anomaly they execute (Mod60), brings a performance penalty of
analysis. They verify if exists a system or application server about 0.5%. This value is close to the performance over-
parameter change correlated with the performance anomaly head introduced by the lightweight monitoring. Profile the
and look for changes in the components response time. user-transactions at every 5th time they execute (Mod5)
The ability to timely pinpoint the root-cause of perfor- introduces an overhead of about 2.4%. Sampling interval
mance anomalies depends on how much fine-grain data is values lower than 5 resulted in a high performance overhead,
provided by the Monitoring module and the time necessary meaning that other approaches should be explored if it is
to carry out the analysis. necessary to reduce even more the fine-grain monitoring
performance impact.
A. Application profiling Rather than using a static sampling period, our goal is to
devise adaptive and selective algorithms able to comprehend
The Monitoring module makes use of Aspect-Oriented
the degree of performance anomaly, for each of the user-
Programming (AOP). AOP allows intercepting some specific
transactions, and adjust the sampling interval on-the-fly.
points in the execution of a program and the injection of new
code to run in place of the original. It is implemented in Java B. Data analysis
- AspectJ [7] - and is easily attached to the application server The data analysis can be described as a top-down process.
thankful to the Load-Time Weaving (LTW) [8] support.
Figure 1 gives a logical representation of the Monitoring 1 st
step: Workload change? or performance anomaly?
module. It intercepts the user-transactions in two different The 1st step of our analysis is to detect if a performance
ways. In a lightweight mode it only intercepts the user- variation is due to a workload change or it is a performance
transactions at the begin and at the ending, evaluates its anomaly. This is accomplished by the Performance Analyzer
processing time and collects a set of different system and module. It make use of the lightweight monitoring data to
application server parameters. In a fine-grain mode (profiling measure the correlation between the response time and the
mode) it intercepts and measures all the internal calls number of user-transactions processed. The correlation is

172
164

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
n
determined by the Pearson correlation coefficient [9], also (fi − ȳ)2
called Pearson’s r. It is given by Eq. 1, and compares R2 = ni=1 2
(2)
i=1 (yi − ȳ)
how much a sample of paired data (X, Y ) change together
(covariance) relatively to its dispersion (standard deviations). An increased variance highlights the components that have
changed their behavior according to a performance anomaly
n
i=1 (Xi − X̄)(Yi − Ȳ )
already identified by the previous steps. Some experimental
r =   (1) results describing the Root-cause Failure Analysis ability
n 2 n 2
i=1 (X i − X̄) i=1 (Y i − Ȳ ) could be seen in [5]. There, the Root-cause Failure Analysis
According Jacob Cohen in [10] the output r can be module was extremely accurate to pinpoint the set of compo-
interpreted as follows: between [−1.0, −0.5] or [0.5, 1.0] nents involved with a performance anomaly and particularly
stand for a large correlation; between [−0.49, −0.3] or helpful to identify if the performance anomaly was motivated
[0.3, 0.49] for a medium correlation; between [−0.29, −0.1] by some remote service change, a system change or an
or [0.1, 0.29] for a small correlation; between [−0.09, 0.09] application change.
there is no correlation. III. T HE A LGORITHMS FOR A DAPTIVE AND S ELECTIVE
Considering that workload and response time might not M ONITORING
be fully aligned, we perform an additional analysis based-
The ability to timely detect and pinpoint the root-cause of
on the Dynamic-Time Warping algorithm (DTW) [11]. This
a given performance anomaly is determined by the amount
algorithm aligns the time-series and keeps track of the
of fine-grain data collected by the Monitoring module.
distance necessary to keep them aligned. Both, a sudden
More data provides better analysis but the user-transactions
decrease of Pearson’s r and an increase in the distance to
response time is affected. Achieve an equilibrium between
keep the workload and response time aligned is interpreted
timely pinpointing and a low performance penalty is the
as a performance anomaly.
motivation behind the adaptive and selective algorithms here
2nd step: System or Application server change? described.
The algorithms are said to be adaptive in the sense that the
The 2nd step is performed by the Anomaly Detector
sampling interval is adjusted on-the-fly and selective since
module. The Pearson’s r is also used, this time to measure
the adaptation only applies to user-transactions reporting
the relationship between the aggregated workload and the
symptoms of performance anomaly.
set of system and application server parameters collected
The adaptive algorithms compute the sampling interval
by the AOP Monitoring module (lightweight monitoring
to be used by the Monitoring module. The computation
data). A decrease in the correlation degree highlights the
is based-on the correlation degree evaluated by the Perfor-
parameter associated with the performance anomaly. This
mance Analyzer module.
analysis provides indication for causes related with system
Two adaptive algorithms will be presented and evaluated:
or application server changes, but we still do not know if
a linear adaptive algorithm versus a truncated exponential
the root-cause is motivated by some application or remote
adaptive algorithm.
service change.
A. Linear adaptation algorithm
3rd step: Internal or Remote service change?
In the linear adaptation algorithm the sampling interval
This step is performed if there is a symptom of perfor-
adjustment per user-transaction is proportional to decrease
mance anomaly. It makes use of the fine-grain monitoring
or increase in the correlation degree provided by the 1st step
data to track for changes in the execution time of each one
of the data analysis.
of the calls belonging to a user-transaction. By localizing the
We use the notation M to denote the maximum sampling
faulty-component we are able to detect if the extra response
interval and m to denote the minimal sampling interval. The
time of a given transaction is caused by some application or
average correlation degree per user-transaction, x̄, is derived
remote service changes (e.g., longer DB responses, longer
from the historical correlation degrees and r refers to the
Web-Service responses).
last correlation degree observed. K defines the value used
The Root-cause Failure Analysis is backed by the ANal-
to count the number of variations observed in the Pearson
ysis Of VAriance (ANOVA) [12], which explains the source
correlation coefficient and I is the amount of adjustment
of variation between a dependent variable Y (total response
to apply considering the number of N variations observed.
time) and an independent variable X (component response
K and I must be a number such that after a considerable
time). To describe the effect-size of estimations we adopt
decrease on the correlation degree (α) the sampling interval
the coefficient of determination (R2 ). R2 is given by Eq.
is equal to m. After an increase it should approximate to
2 and it tracks the proportion of variability between the
M . S takes a value between M and m and corresponds to
components response time relatively to the user-transaction
the sampling interval to be used by the Monitoring module.
response time.

173
165

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
Specifically, performance penalty introduced by the application profiling.
1) K := (x̄ − (x̄ − α))/m; The test environment is illustrated in Figure 2.
2) I := M/(K ∗ 100);
3) N := f loor(x̄ − r)/K); REBs System Under Test
4) S := M − (N ∗ I); Req AOP Java
uests
5) if (S > M ) then M
o
Application

do {S := M ;} Requests n
i
t
6) if (S < m) then Requ
ests o
r
JVM

do {S := m;} Application Server


REBs DB Server
2 CPUs Intel Xeon
4 CPUs Intel Xeon
2.66GHz
B. Truncated exponential adaptation algorithm 1 CPU Intel Pentium IV
3GHz
1GB Mem
2.66GHz
4GB Mem
2GB Mem

The truncated exponential adaption algorithm is inspired


on the exponential back-off algorithm used to space out ANALYSIS
2 CPUs Intel Xeon 2.66GHz
repeated retransmissions of the same block of data, often 2GB Mem

as part of network congestion avoidance. The ”truncated”


Figure 2. Test environment
simply means that after a certain number of adjustments,
the exponentiation stops.
The System Under Test consists of an application server
This algorithm improves on the linear adaption algorithm
(Glassfish) running a benchmark application (TPC-W [13]).
by considering the adjustment of the sampling interval S
The user requests are simulated using several remote em-
using a exponentiation factor (Eq. 3) that depends on N ,
ulated browsers (REBs). The TPC-W benchmark [13] sim-
and where N accounts for the number of times that the
ulates the activities of a retail store website. It defines 14
Pearson correlation coefficient has varied in more than K
user-transactions, which are classified as either of browsing
values.
and ordering types. The benchmark allows the execution
2N − 1 with different concurrent users, three different traffic mixes
E= (3) (browsing, shopping and ordering), different session length,
2
Specifically, user think-times and ramp-up/ramp-down periods. Accord-
ing to the specification the number of emulated browsers
1) f := 0;
is kept constant during the experiment. However, since
2) K := (x̄ − (x̄ − α))/m;
real e-commerce applications are characterized by dynamic
3) I := M/(K ∗ 100);
workloads we created workloads that run for 8400sec where
4) N := f loor((x̄ − r)/K);
the type of traffic mix, the number of emulated-users and the
5) if (N < 0) then
duration of each period changes randomly and periodically.
do {N := N ∗ (−1); f := 1;}
A MySQL database is used as a storage backend of TPC-W.
6) E := f loor((2N − 1)/2)
The Monitoring module intercepts each one of the 14 user-
7) T := r ∗ (E ∗ I/x̄)
transactions and captures all the internal calls according the
8) if (f == 0) then
defined sampling interval. The data is stored in a remote
do {S := M − T ; } else do {S := M + T ; }
database and aggregated into equal intervals of 5 seconds,
9) if (S > M ) then
called epochs. The analysis for detection and pinpointing
do {S := M ;}
of performance anomalies are performed on a separate
10) if (S < m) then
machine by a R program, provided with the Pearson’s r
do {S := m;}
implementation, a DTW synchronous algorithm and the one-
With this algorithm, the sampling interval will remain way ANOVA procedures.
large when r deviates only few K values from x̄, reducing All the machines run Linux with 2.6 kernel version and
the deluge of data collected and so minimizing the per- are interconnect through a 1Gbps Ethernet LAN.
formance impact. For significant deviations the sampling
frequency will become small, accelerating the fine-grain A. Experiments
monitoring and providing more data useful to localize the
We consider three different degradation scenarios to eval-
root-cause of a performance failure more quickly.
uate how timely the root-cause analysis performs while
IV. E XPERIMENTAL S ETUP backed by the linear and exponential adaptation algorithms.
In this section we present some experimental tests using • Scenario A - Slow degradation: In this scenario the
the algorithms described in section III. The goal is to observe performance degradation is very slow. The response
if the algorithms are able to timely pinpoint the root-cause of time of a given user-transaction increases 1 millisecond
a given performance anomaly and, simultaneously reduce the per minute.

174
166

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
• Scenario B - Fast degradation: In this scenario
the performance degradation is abrupt. After a certain
period of time the response time of a given user-
transaction increases 200 milliseconds at every 5 min-
utes.
• Scenario C - Momentary degradation: In this sce-
nario the response time increases abruptly from 3ms
(in average) to 53ms, remains high for 90 seconds and
then it recovers. This scenario is congruent with some
sporadic events that may occur at different levels on Figure 3. Slow degradation: percentage of user-transactions experiencing
the system, and affecting the user-transactions response a slowdown larger than 100ms
time in a short period of time.
We manually changed the source-code of one of the most
without compromising the ability to pinpoint the perfor-
frequent user-transactions (TPCW_home_interaction)
mance anomaly root-cause. Between the exponential and
to start sleeping for a specified amount of time, delaying
the linear adaptation algorithms, the exponential provided
its response. The ability of the Root-cause Failure Analysis
better results. Since it is more conservative when adjusting
module to pinpoint such types of application changes is
the sampling interval for small correlation degree variations,
presented in [5].
the final number of user-transactions affected by the perfor-
Considering the historical knowledge about the correla-
mance impact introduced by fine-grain monitoring is lower.
tion degree and the performance impact observed by using The collateral effects analysis for each one of the sampling
different sampling strategies, the constant variables declared strategies is presented in Figure 4. For the same reason
in the algorithms were defined as follows: described above, the exponential adaption algorithm pro-
• maximum sampling interval M = 60 user-transactions vided the lower performance impact on the user-transactions
• minimum sampling interval m = 5 user-transactions processed by the application server.
• α = 0.2 was considered as a significant decrease on
the correlation degree
A comparison between static and dynamic sampling inter-
vals computed by the adaptive algorithms is also provided.
Each experiment was repeated by five times, performing
a total of 80 experiments. A simple average is used to
summarize the achieved results.

Scenario A - Slow degradation


Figure 4. Slow degradation: percentage of indirectly involved user-
In this scenario the response time of the user-transaction transactions experiencing response times larger than 100ms
”TPCW_home_interaction” has increased very slowly:
1 millisecond per minute. In such scenario all of the sam- Scenario B - Fast degradation
pling strategies used may have time to provide sufficient
In this scenario the response time of the user-transaction
data for a quick root-cause analysis. Between the proposed
(TPCW_home_interaction) was incremented in 200
algorithms, the exponential adaptation might perform better.
milliseconds, every 5 minutes. Figure 5 contains three
It is more conservative for small decreases on the correlation
graphs, representing the percentage of end-users experienc-
degree than the linear algorithm reducing the performance
ing slow response times: (top) between 100 milliseconds and
penalty introduced by the profiler. At the same time, when
500 milliseconds; (middle) between 500 milliseconds and
a significant correlation decrease is observed, it quickly
2000 milliseconds; (bottom) larger than 2000 milliseconds.
adjusts the sampling frequency providing enough data for
To evaluate the percentage of slow requests we took, as
a timely analysis as the linear algorithm.
reference, the user-transactions response time without any
In Figure 3 is illustrated the percentage of end-users
type of fault injection.
affected by a slowdown larger than 100 milliseconds when
Each graph contains four intervals: [B, F I] - from the
requesting the TPCW_home_interaction transaction.
Beginning of the experiment till the Fault Injection mo-
As noticeable in the Figure 3, from the experimenta-
ment; ]F I, D] - from the Fault Injection point till the
tion beginning till the pinpointing moment ([B, P ]) all
Detection of the performance anomaly; ]D, P ] - from the
the sampling strategies, with exception to ”Mod5”, have
Detection till the Pinpointing phase; and finally the in-
a low percentage of affected end-users. Such confirms the
terval ]P, F ] - from the Pinpointing point till the exper-
advantages of using sampling to do fine-grain monitoring
iment Finalization. To observe if the adaptation provokes

175
167

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
some collateral effects, the user-transaction deliberated af- Scenario C - Momentary degradation
fected (TPCW_home_interaction) is presented sepa- In this scenario the response time of the user-transaction
rately from the others. ”TPCW_home_interaction” has been increased by
50ms, during 90 seconds. With this we expect to observe a
sudden decrease in correlation degree between the workload
and the user-transactions response time, followed by a
progressive recover. This scenario is helpful to understand
the adaptability of both algorithms and access its potential to
deal with momentary degradations. During this experiment
a common workload was executed in order to allow the
comparison between the results provided by the algorithms.
The percentage of end-users experiencing a slowdown larger
than 100 milliseconds was evaluated. The results are pre-
sented considering thee distinct intervals: [B, F I] - from
the Beginning of the experiment till the Fault Injection
moment; ]F I, SI] - from the Fault Injection point till the
Stop-Injection point; ]SI, F ] - from the Stop-Injection point
till the experiment Finalization.
From Figure 7 is clear that both algorithms reacted to the
decrease in the correlation degree (around epoch 350). Con-
sidering the drop in the correlation degree, it is noticeable
that the exponential algorithm has initially adjusted the sam-
pling interval to a low value, when compared with the linear
algorithm. As the correlation degree gradually recovers the
exponential algorithm has increased the sampling interval
(initially to 47 and then to 60) much time before the linear
Figure 5. Fast degradation: percentage of user-transactions experiencing algorithm did.
a slowdown between [100, 500[ ms, [500, 2000[ ms, and larger than 2000
ms 100 100
90 90
r - Correlation degree * 100

SI - Sampling Interval
80 80
70 70
The results illustrated in Figure 5 provides, at least, two 60 60
r-Exp
important conclusions regarding the adoption of adaptive 50
40
50
40 r-Linear

profiling. First, by summing the percentages since the be- 30


20
30
20
SI-Lin
SI-Exp

ginning of the experiment till the moment the performance 10


0
10
0

anomaly was pinpointed ([B, P ]) the advantage of the linear


1

51

101

151

201

251

301

351

401

451

501

551

601

651

701

751

801

851

901

951

1001

1051

1101
Epochs
and the exponential algorithms (with 0.47% and 0.56% of
end-users experiencing slowdowns, respectively) is clear. Figure 7. Adaptability of the sampling interval considering the exponential
Second, combining the analysis illustrated in Figure 5 and and linear algorithms
the collateral impacts, i.e. the performance penalty intro-
duced by the sampling strategies, illustrated in Figure 6, As consequence the percentage of end-users experiencing
reveals that the dynamic adaption of the sampling intervals a slowdown larger than 100ms (illustrated in Figure 8) was
minimizes the percentage of end-users affected by perfor- lower when the exponential adaptation algorithm was used.
mance issues.

Figure 8. Momentary degradation: percentage of user-transactions expe-


Figure 6. Fast degradation: percentage of indirectly involved user- riencing a slowdown larger than 100ms
transactions experiencing response times larger than 100ms
From the collateral effects, illustrated in Figure 9, is

176
168

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
perceptible the impact caused by sudden decrease computed application profiling to support the run-time pinpointing of
by the linear adaption algorithm. The amount of time that performance anomalies.
the sampling interval has remained low has defeated with A set of commercial tools, as well research projects are
the user-transactions processing time causing higher impact available in the field of application profiling. Symantec i3
on its response time. for J2EE [14] is a commercial tool that features the ability to
adaptively instrument Java applications based on the applica-
tion response time. Examples include instrumenting methods
of standard Java components, instrumenting methods that
take the longest time to complete and their execution paths or
instrumenting the longest running active execution path(s).
These methods make it easier to identify behavior deviations,
however the time-consuming tasks of problem localization
and solving stills left for the system operator.
Rish et al. in [15] describe a technique, called active prob-
Figure 9. Momentary degradation: percentage of indirectly involved user- ing. It combines probabilistic inference with active probing
transactions experiencing response times larger than 100ms to yield a diagnostic engine able to ”asks the right questions
at the right time”, i.e. dynamically selects probes that
provide maximum information gain about the current system
B. Discussion
state. From several simulated problems and in practical
From the experimentation becomes clear that larger sam- applications the authors observed that active probing requires
pling intervals reduce the performance penalty introduced by in average up to 75% less probes than the pre-planned
fine-grain monitoring, however in presence of performance probing. In [16] authors also apply transformations to the in-
anomaly they require more time for root-cause analysis, strumentation code to reduce the number of instrumentation
hence leading more users exposed to the anomaly. Short points executed as well the cost of instrumentation probes
sampling intervals improve the time-to-pinpoint, but such a and payload. With this transformations and optimizations
high frequency increases the response time even when the authors have improved the profiling performance by 1.26x
application is not experiencing performance issues. to 2.63x.
The advantages of using selective and adaptive algorithms A technique to switch between instrumented and non-
to do application profiling and support the root-cause analy- instrumented code is described in [17]. Authors combine
sis are also evident in our analysis. The dynamic adaptation code duplication with compiler inserted counter-based sam-
of the sampling interval reduces the performance penalty pling to activate instrumentation and collect data for a
as well the number of end-users experiencing slowdowns small, and bounded, amount of time. According the ex-
motivated by some performance anomaly or due to the fine- perimentation, the overhead provided by their sampling
grain monitoring. framework was reduced from 75% (in average) to just 4.9%
According our results the performance penalty introduced without affecting its final accuracy. In [18] authors argue
by fine-grain monitoring varies from 0.5% to 2.4%, i.e. that a monitoring system should continuously assess current
1.45% in average. A comparison between the proposed conditions by observing the most relevant data, it must
adaption algorithms reveals that the exponential adaptation promptly detect anomalies and it should help to identify
algorithm has some advantages. It is more robust to slightly the root-causes of problems. Adaption is used to reduce the
correlation variations and in scenarios of abrupt degradation adverse effect of measurement on system performance and
it quickly adjusts the sampling interval, providing enough the overhead associated with storing, transmitting, analyzing,
fine-grain data to support the accurate and timely pinpointing and reporting information. Under normal execution only a
of the root-causes behind a performance anomaly. set of set of healthy metrics are collected. When an anomaly
Taking into account the low performance penalty in- is found, the adaptive monitoring logic starts retrieving more
troduced by fine-grain monitoring and its usefulness, in metrics till a faulty component is identified. Unfortunately
particular to support the root-cause analysis, it is possible authors only describe the prototype and no performance
to do run-time profiling in production environments. analysis is presented.
The magpie [19] and the Pinpoint [20] are two well-
V. R ELATED W ORK known projects of the field. Magpie [19] collects fine-
Profiling is commonly applied to run-time anomaly de- grained traces from all software components; combines
tection and to application optimization. Both share the these traces across multiple machines; attributes trace events
same challenge: provide detailed analysis with the lowest and resource usage to the initiating request; uses machine
impact of the monitoring logic on system performance. learning to build a probabilistic model of request behavior;
Even related, in this paper we are particularly interested in and compares individual requests against this model to

177
169

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
detect anomalies. The overhead introduced with profiling is [2] Web performance impacts. [Online]. Available:
around 4%. The Pinpoint [20] project relies on a modified http://www.webperformancetoday.com/2010/06/15/everything-
you-wanted-to-know-about-web-performance/
version of the application server to build a distribution for
[3] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, “Fail-stutter
normal execution paths. In run-time the detected changes are fault tolerance,” in HOTOS ’01: Proceedings of the Eighth
regarded as failures and recovery procedures are initiated. Workshop on Hot Topics in Operating Systems. Washington,
According the results, Pinpoint brings an overhead of 8.4%. DC, USA: IEEE Computer Society, 2001, p. 33.
Like in [18], [19] and [20] our approach keeps application [4] E. Kiciman and A. Fox, “Detecting application-level fail-
profiling always turned-on. Rather than collect only a subset ures in component-based internet services,” Neural Networks,
IEEE Transactions on, vol. 16, no. 5, pp. 1027–1041, 2005.
of metrics, as done by [18], we profile the user-transactions [5] J. P. Magalhaes and L. M. Silva, “Root-cause analysis of per-
using adaptive sampling. According our experimentation formance anomalies in web-based applications,” in SAC ’11:
a performance penalty in less than 1.5% is achieved by Proceedings of the 26th Symposium On Applied Computing -
keeping the sampling intervals large enough to have some Dependable and Adaptive Distributed Systems, March, 2011.
data for posterior analysis. Accurate and timely pinpointing [6] J. P. Magalhaes and L. M. Silva, “Detection of performance
anomalies in web-based applications,” in NCA ’10: Proceed-
of faulty components is achieved by decreasing the sampling ings of the ninth IEEE International Symposium on Network
interval when the first symptoms of performance anomaly Computing and Applications, July, 2010, pp. 60–67.
are detected. [7] R. Laddad, AspectJ in Action: Practical Aspect-Oriented
Programming. Greenwich, CT, USA: Manning Publications
VI. C ONCLUSIONS AND F UTURE W ORK Co., ISBN 1-930110-93-6, July 2003.
This paper presented two selective and adaptive algo- [8] LTW: Load Time Weaving. [Online]. Available:
http://eclipse.org/aspectj/doc/released/devguide/ltw.html
rithms used to collect fine-grain data, crucial to support the
[9] J. L. Rodgers and A. W. Nicewander, “Thirteen ways to
root-cause analysis of performance anomalies in web-based look at the correlation coefficient,” The American Statistician,
applications. The experimental results achieved so far attest vol. 42, no. 1, pp. 59–66, 1988.
the advantages of doing selective and adaptive profiling, as [10] J. Cohen, Statistical Power Analysis for the Behavioral Sci-
also let us conclude the following: ences (2nd Edition), 2nd ed. Lawrence Erlbaum, ISBN 0-
8058-0283-5, Jan 1988.
• the adoption of selective and adaptive algorithms allows
[11] T. Giorgino, “Computing and visualizing dynamic time warp-
to minimize the application profiling performance im- ing alignments in R: The DTW package,” Journal of Statis-
pact. According our analysis the performance impact tical Software, vol. 31, no. 07, 08 2009.
is almost 60% lower than with full-profiling. When [12] R. A. Fisher, Statistical Methods for Research Workers: Study
compared to a non-profiled version the response time Guides, 1st ed. Cosmo Publications, ISBN 8130701332,
2006.
increase is on the order of 0.5% to 2.4%; [13] W. D. Smith., TPC-W: Benchmarking an Ecommerce
• considering the different sampling strategies and the Solution, Transaction Processing Performance Council Std.
corresponding percentage of end-users affected is per- [Online]. Available: http://www.tpc.org/
ceptible the advantage that comes from the adaptability [14] Symantec Corporation, “Symantec i3 for J2EE - performance
provided by the algorithms. The pinpointing of the management for the J2EE platform,” White paper.
[15] I. Rish, M. Brodie, S. Ma, N. Odintsova, A. Beygelzimer,
performance anomaly was done timely as observable G. Grabarnik, and K. Hernandez, “Adaptive diagnosis in
by the a low number of end-users affected; distributed systems,” Neural Networks, IEEE Transactions on,
• since the sampling interval adjustment only affects vol. 16, no. 5, pp. 1088–1109, 2005.
the user-transactions reporting a correlation degree de- [16] N. Kumar, B. R. Childers, and M. L. Soffa, “Low overhead
crease the response time of the overall user-transactions program monitoring and profiling,” SIGSOFT Softw. Eng.
Notes, vol. 31, pp. 28–34, September 2005.
is not significantly affected as if adopted a static sam- [17] M. Arnold and B. G. Ryder, “A framework for reducing the
pling interval. cost of instrumented code,” SIGPLAN Not., vol. 36, pp. 168–
Considering the root-cause analysis supported by the 179, May 2001.
selective and adaptive algorithms here presented, we plan [18] M. A. Munawar and P. A. S. Ward, “Adaptive monitoring
in enterprise software systems,” in SysML’06: Proceedings of
to devise and measure the effectiveness of recovery actions the First Workshop on Tackling Computer Systems Problems
considering different types of fault injection. Attending to with Machine Learning Techniques, June 2006.
the applications reliability requirements devise convenient [19] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, “Using
recovery actions, depending on the type and localization of magpie for request extraction and workload modelling,” in
the faults is definitely a topic of great relevance. OSDI’04: Proceedings of the 6th Conference on Symposium
on Operating Systems Design & Implementation. Berkeley,
R EFERENCES CA, USA: USENIX Association, 2004, pp. 18–18.
[20] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer,
[1] B. Simic. (2010, December) Ten areas that are changing “Pinpoint: Problem determination in large, dynamic internet
market dynamics in web performance management. services,” in DSN ’02: Proceedings of the 2002 International
[Online]. Available: http://www.trac-research.com/web- Conference on Dependable Systems and Networks. Washing-
performance/ten-areas-that-are-changing-market-dynamics- ton, DC, USA: IEEE Computer Society, 2002, pp. 595–604.
in-web-performance-management

178
170

Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.

You might also like