R1 - Adaptive Profiling For Root-Cause Analysis of Performance Anomalies in Web-Based Applications
R1 - Adaptive Profiling For Root-Cause Analysis of Performance Anomalies in Web-Based Applications
R1 - Adaptive Profiling For Root-Cause Analysis of Performance Anomalies in Web-Based Applications
Abstract—The most important factor in the assessment of and to suppress them. While essential to improve the ap-
the availability of a system is the mean-time to repair (MTTR). plications performance such off-line analysis do not capture
The lower the MTTR the higher the availability. A significant run-time performance anomalies where, according the Fail-
portion of the MTTR is spent in the detection and localization
of the cause of the failure. One possible method that may Stutter fault model [3], some of the application components
provide good results in the root-cause analysis of application can start performing differently leading to performance-
failures is run-time profiling. The major drawback of run-time faulty scenarios.
profiling is the performance impact. In [4] authors estimate that 75% of the time to recover
In this paper we describe two algorithms for selective and from application-level failures is spent just detecting and lo-
adaptive profiling of web-based applications. The algorithms
make use of a dynamic profiling interval and are mainly
calizing them. Quickly detection and localization is intended
triggered when some of the transactions start presenting some as a main contribution to reduce the MTTR (mean-time-to-
symptoms of performance anomaly. The algorithms were tested recovery) and so improve the service reliability.
under different types of degradation scenarios and compared to In this context, run-time application profiling is extremely
static sampling strategies. We observed through experimenta- important to provide timely detection of abnormal execution
tion that the pinpoint of performance anomalies, supported by
the data collected using the adaptive profiling algorithms, stills
patterns, pinpoint the faulty components and allow quickly
timely as with full-profiling while the response time overhead recovery. Is common sense that the more specific the profil-
is reduced in almost 60%. When compared to a non-profiled ing is, the more precise the analysis it allows. However,
version the response time overhead is less than 1.5%. These collect detailed data in run-time from across the entire
results show the viability of using run-time profiling to support application can introduce an overhead incompatible with the
quickly detection and pinpointing of performance anomalies
and enable timely recovery.
performance level required for the application. In past work
[5] we developed some techniques for root-cause failure
Keywords-application profiling; monitoring; root-cause anal- analysis and failure prediction that make use of Aspect-
ysis; performance anomalies; dependability
Oriented-Programming (AOP) to do run-time monitoring of
the application components and system values. The results
I. I NTRODUCTION were very sound but to avoid the AOP-based profiling
The response time is a crucial aspect for companies, which overhead (around 60%) we adopted a static profiling sam-
depends on web applications for most of their revenue. pling strategy. Such approach might not optimize the time
Recently Bojan Simic presented in [1] the results of his required for localization, so we need to work further in the
latest research. He found that website slowdowns can have profiling algorithms to improve the time required to pinpoint
twice the revenue impact on an organization as an outage. the faulty components as well to minimize the profiling
According him the average revenue loss for one hour of overhead.
website downtime is $21000 while the average revenue loss In this paper we propose two adaptive and selective
of an hour of website slowdown is estimated in $4100, algorithms to profile web-based or component-based ap-
however website slowdowns may occur 10 times more plications. The usefulness of such adaptive algorithms for
frequently than website outages. Likely, according a recent application profiling encompasses several challenges. In this
report provided by the Aberdeen Group [2], a delay of just 1- paper we focus algorithms suitable to:
second in page load time can represent a loss of $2.5 million • reduce the performance impact;
in sales per year for a site that typically earns $100.000 a • allow to timely pinpoint the root-cause of performance
day. anomalies;
Developers are aware of these issues and as part of the de- • minimize the number of end-users suffering from the
velopment cycle they adopt application profiling to identify performance anomalies effects;
where the overwhelming of system resources burdens are, • guarantee that application profiling is not itself con-
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
tributing to slowdown the end-users response time. (methods execution time) belonging to each one of the user-
The adaptive and selective application profiling is sup- transactions.
ported by a closed-loop architecture that takes into account
the correlation degree between the workload and the user- CPU (usr/sys) tm
Available Heap
CPU (usr/sys) tm #Run Threads
transactions response time. As the correlation degree de- t1 t2 t3 t4 t... tn
...
172
164
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
n
determined by the Pearson correlation coefficient [9], also (fi − ȳ)2
called Pearson’s r. It is given by Eq. 1, and compares R2 = ni=1 2
(2)
i=1 (yi − ȳ)
how much a sample of paired data (X, Y ) change together
(covariance) relatively to its dispersion (standard deviations). An increased variance highlights the components that have
changed their behavior according to a performance anomaly
n
i=1 (Xi − X̄)(Yi − Ȳ )
already identified by the previous steps. Some experimental
r = (1) results describing the Root-cause Failure Analysis ability
n 2 n 2
i=1 (X i − X̄) i=1 (Y i − Ȳ ) could be seen in [5]. There, the Root-cause Failure Analysis
According Jacob Cohen in [10] the output r can be module was extremely accurate to pinpoint the set of compo-
interpreted as follows: between [−1.0, −0.5] or [0.5, 1.0] nents involved with a performance anomaly and particularly
stand for a large correlation; between [−0.49, −0.3] or helpful to identify if the performance anomaly was motivated
[0.3, 0.49] for a medium correlation; between [−0.29, −0.1] by some remote service change, a system change or an
or [0.1, 0.29] for a small correlation; between [−0.09, 0.09] application change.
there is no correlation. III. T HE A LGORITHMS FOR A DAPTIVE AND S ELECTIVE
Considering that workload and response time might not M ONITORING
be fully aligned, we perform an additional analysis based-
The ability to timely detect and pinpoint the root-cause of
on the Dynamic-Time Warping algorithm (DTW) [11]. This
a given performance anomaly is determined by the amount
algorithm aligns the time-series and keeps track of the
of fine-grain data collected by the Monitoring module.
distance necessary to keep them aligned. Both, a sudden
More data provides better analysis but the user-transactions
decrease of Pearson’s r and an increase in the distance to
response time is affected. Achieve an equilibrium between
keep the workload and response time aligned is interpreted
timely pinpointing and a low performance penalty is the
as a performance anomaly.
motivation behind the adaptive and selective algorithms here
2nd step: System or Application server change? described.
The algorithms are said to be adaptive in the sense that the
The 2nd step is performed by the Anomaly Detector
sampling interval is adjusted on-the-fly and selective since
module. The Pearson’s r is also used, this time to measure
the adaptation only applies to user-transactions reporting
the relationship between the aggregated workload and the
symptoms of performance anomaly.
set of system and application server parameters collected
The adaptive algorithms compute the sampling interval
by the AOP Monitoring module (lightweight monitoring
to be used by the Monitoring module. The computation
data). A decrease in the correlation degree highlights the
is based-on the correlation degree evaluated by the Perfor-
parameter associated with the performance anomaly. This
mance Analyzer module.
analysis provides indication for causes related with system
Two adaptive algorithms will be presented and evaluated:
or application server changes, but we still do not know if
a linear adaptive algorithm versus a truncated exponential
the root-cause is motivated by some application or remote
adaptive algorithm.
service change.
A. Linear adaptation algorithm
3rd step: Internal or Remote service change?
In the linear adaptation algorithm the sampling interval
This step is performed if there is a symptom of perfor-
adjustment per user-transaction is proportional to decrease
mance anomaly. It makes use of the fine-grain monitoring
or increase in the correlation degree provided by the 1st step
data to track for changes in the execution time of each one
of the data analysis.
of the calls belonging to a user-transaction. By localizing the
We use the notation M to denote the maximum sampling
faulty-component we are able to detect if the extra response
interval and m to denote the minimal sampling interval. The
time of a given transaction is caused by some application or
average correlation degree per user-transaction, x̄, is derived
remote service changes (e.g., longer DB responses, longer
from the historical correlation degrees and r refers to the
Web-Service responses).
last correlation degree observed. K defines the value used
The Root-cause Failure Analysis is backed by the ANal-
to count the number of variations observed in the Pearson
ysis Of VAriance (ANOVA) [12], which explains the source
correlation coefficient and I is the amount of adjustment
of variation between a dependent variable Y (total response
to apply considering the number of N variations observed.
time) and an independent variable X (component response
K and I must be a number such that after a considerable
time). To describe the effect-size of estimations we adopt
decrease on the correlation degree (α) the sampling interval
the coefficient of determination (R2 ). R2 is given by Eq.
is equal to m. After an increase it should approximate to
2 and it tracks the proportion of variability between the
M . S takes a value between M and m and corresponds to
components response time relatively to the user-transaction
the sampling interval to be used by the Monitoring module.
response time.
173
165
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
Specifically, performance penalty introduced by the application profiling.
1) K := (x̄ − (x̄ − α))/m; The test environment is illustrated in Figure 2.
2) I := M/(K ∗ 100);
3) N := f loor(x̄ − r)/K); REBs System Under Test
4) S := M − (N ∗ I); Req AOP Java
uests
5) if (S > M ) then M
o
Application
do {S := M ;} Requests n
i
t
6) if (S < m) then Requ
ests o
r
JVM
174
166
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
• Scenario B - Fast degradation: In this scenario
the performance degradation is abrupt. After a certain
period of time the response time of a given user-
transaction increases 200 milliseconds at every 5 min-
utes.
• Scenario C - Momentary degradation: In this sce-
nario the response time increases abruptly from 3ms
(in average) to 53ms, remains high for 90 seconds and
then it recovers. This scenario is congruent with some
sporadic events that may occur at different levels on Figure 3. Slow degradation: percentage of user-transactions experiencing
the system, and affecting the user-transactions response a slowdown larger than 100ms
time in a short period of time.
We manually changed the source-code of one of the most
without compromising the ability to pinpoint the perfor-
frequent user-transactions (TPCW_home_interaction)
mance anomaly root-cause. Between the exponential and
to start sleeping for a specified amount of time, delaying
the linear adaptation algorithms, the exponential provided
its response. The ability of the Root-cause Failure Analysis
better results. Since it is more conservative when adjusting
module to pinpoint such types of application changes is
the sampling interval for small correlation degree variations,
presented in [5].
the final number of user-transactions affected by the perfor-
Considering the historical knowledge about the correla-
mance impact introduced by fine-grain monitoring is lower.
tion degree and the performance impact observed by using The collateral effects analysis for each one of the sampling
different sampling strategies, the constant variables declared strategies is presented in Figure 4. For the same reason
in the algorithms were defined as follows: described above, the exponential adaption algorithm pro-
• maximum sampling interval M = 60 user-transactions vided the lower performance impact on the user-transactions
• minimum sampling interval m = 5 user-transactions processed by the application server.
• α = 0.2 was considered as a significant decrease on
the correlation degree
A comparison between static and dynamic sampling inter-
vals computed by the adaptive algorithms is also provided.
Each experiment was repeated by five times, performing
a total of 80 experiments. A simple average is used to
summarize the achieved results.
175
167
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
some collateral effects, the user-transaction deliberated af- Scenario C - Momentary degradation
fected (TPCW_home_interaction) is presented sepa- In this scenario the response time of the user-transaction
rately from the others. ”TPCW_home_interaction” has been increased by
50ms, during 90 seconds. With this we expect to observe a
sudden decrease in correlation degree between the workload
and the user-transactions response time, followed by a
progressive recover. This scenario is helpful to understand
the adaptability of both algorithms and access its potential to
deal with momentary degradations. During this experiment
a common workload was executed in order to allow the
comparison between the results provided by the algorithms.
The percentage of end-users experiencing a slowdown larger
than 100 milliseconds was evaluated. The results are pre-
sented considering thee distinct intervals: [B, F I] - from
the Beginning of the experiment till the Fault Injection
moment; ]F I, SI] - from the Fault Injection point till the
Stop-Injection point; ]SI, F ] - from the Stop-Injection point
till the experiment Finalization.
From Figure 7 is clear that both algorithms reacted to the
decrease in the correlation degree (around epoch 350). Con-
sidering the drop in the correlation degree, it is noticeable
that the exponential algorithm has initially adjusted the sam-
pling interval to a low value, when compared with the linear
algorithm. As the correlation degree gradually recovers the
exponential algorithm has increased the sampling interval
(initially to 47 and then to 60) much time before the linear
Figure 5. Fast degradation: percentage of user-transactions experiencing algorithm did.
a slowdown between [100, 500[ ms, [500, 2000[ ms, and larger than 2000
ms 100 100
90 90
r - Correlation degree * 100
SI - Sampling Interval
80 80
70 70
The results illustrated in Figure 5 provides, at least, two 60 60
r-Exp
important conclusions regarding the adoption of adaptive 50
40
50
40 r-Linear
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
1001
1051
1101
Epochs
and the exponential algorithms (with 0.47% and 0.56% of
end-users experiencing slowdowns, respectively) is clear. Figure 7. Adaptability of the sampling interval considering the exponential
Second, combining the analysis illustrated in Figure 5 and and linear algorithms
the collateral impacts, i.e. the performance penalty intro-
duced by the sampling strategies, illustrated in Figure 6, As consequence the percentage of end-users experiencing
reveals that the dynamic adaption of the sampling intervals a slowdown larger than 100ms (illustrated in Figure 8) was
minimizes the percentage of end-users affected by perfor- lower when the exponential adaptation algorithm was used.
mance issues.
176
168
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
perceptible the impact caused by sudden decrease computed application profiling to support the run-time pinpointing of
by the linear adaption algorithm. The amount of time that performance anomalies.
the sampling interval has remained low has defeated with A set of commercial tools, as well research projects are
the user-transactions processing time causing higher impact available in the field of application profiling. Symantec i3
on its response time. for J2EE [14] is a commercial tool that features the ability to
adaptively instrument Java applications based on the applica-
tion response time. Examples include instrumenting methods
of standard Java components, instrumenting methods that
take the longest time to complete and their execution paths or
instrumenting the longest running active execution path(s).
These methods make it easier to identify behavior deviations,
however the time-consuming tasks of problem localization
and solving stills left for the system operator.
Rish et al. in [15] describe a technique, called active prob-
Figure 9. Momentary degradation: percentage of indirectly involved user- ing. It combines probabilistic inference with active probing
transactions experiencing response times larger than 100ms to yield a diagnostic engine able to ”asks the right questions
at the right time”, i.e. dynamically selects probes that
provide maximum information gain about the current system
B. Discussion
state. From several simulated problems and in practical
From the experimentation becomes clear that larger sam- applications the authors observed that active probing requires
pling intervals reduce the performance penalty introduced by in average up to 75% less probes than the pre-planned
fine-grain monitoring, however in presence of performance probing. In [16] authors also apply transformations to the in-
anomaly they require more time for root-cause analysis, strumentation code to reduce the number of instrumentation
hence leading more users exposed to the anomaly. Short points executed as well the cost of instrumentation probes
sampling intervals improve the time-to-pinpoint, but such a and payload. With this transformations and optimizations
high frequency increases the response time even when the authors have improved the profiling performance by 1.26x
application is not experiencing performance issues. to 2.63x.
The advantages of using selective and adaptive algorithms A technique to switch between instrumented and non-
to do application profiling and support the root-cause analy- instrumented code is described in [17]. Authors combine
sis are also evident in our analysis. The dynamic adaptation code duplication with compiler inserted counter-based sam-
of the sampling interval reduces the performance penalty pling to activate instrumentation and collect data for a
as well the number of end-users experiencing slowdowns small, and bounded, amount of time. According the ex-
motivated by some performance anomaly or due to the fine- perimentation, the overhead provided by their sampling
grain monitoring. framework was reduced from 75% (in average) to just 4.9%
According our results the performance penalty introduced without affecting its final accuracy. In [18] authors argue
by fine-grain monitoring varies from 0.5% to 2.4%, i.e. that a monitoring system should continuously assess current
1.45% in average. A comparison between the proposed conditions by observing the most relevant data, it must
adaption algorithms reveals that the exponential adaptation promptly detect anomalies and it should help to identify
algorithm has some advantages. It is more robust to slightly the root-causes of problems. Adaption is used to reduce the
correlation variations and in scenarios of abrupt degradation adverse effect of measurement on system performance and
it quickly adjusts the sampling interval, providing enough the overhead associated with storing, transmitting, analyzing,
fine-grain data to support the accurate and timely pinpointing and reporting information. Under normal execution only a
of the root-causes behind a performance anomaly. set of set of healthy metrics are collected. When an anomaly
Taking into account the low performance penalty in- is found, the adaptive monitoring logic starts retrieving more
troduced by fine-grain monitoring and its usefulness, in metrics till a faulty component is identified. Unfortunately
particular to support the root-cause analysis, it is possible authors only describe the prototype and no performance
to do run-time profiling in production environments. analysis is presented.
The magpie [19] and the Pinpoint [20] are two well-
V. R ELATED W ORK known projects of the field. Magpie [19] collects fine-
Profiling is commonly applied to run-time anomaly de- grained traces from all software components; combines
tection and to application optimization. Both share the these traces across multiple machines; attributes trace events
same challenge: provide detailed analysis with the lowest and resource usage to the initiating request; uses machine
impact of the monitoring logic on system performance. learning to build a probabilistic model of request behavior;
Even related, in this paper we are particularly interested in and compares individual requests against this model to
177
169
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.
detect anomalies. The overhead introduced with profiling is [2] Web performance impacts. [Online]. Available:
around 4%. The Pinpoint [20] project relies on a modified http://www.webperformancetoday.com/2010/06/15/everything-
you-wanted-to-know-about-web-performance/
version of the application server to build a distribution for
[3] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, “Fail-stutter
normal execution paths. In run-time the detected changes are fault tolerance,” in HOTOS ’01: Proceedings of the Eighth
regarded as failures and recovery procedures are initiated. Workshop on Hot Topics in Operating Systems. Washington,
According the results, Pinpoint brings an overhead of 8.4%. DC, USA: IEEE Computer Society, 2001, p. 33.
Like in [18], [19] and [20] our approach keeps application [4] E. Kiciman and A. Fox, “Detecting application-level fail-
profiling always turned-on. Rather than collect only a subset ures in component-based internet services,” Neural Networks,
IEEE Transactions on, vol. 16, no. 5, pp. 1027–1041, 2005.
of metrics, as done by [18], we profile the user-transactions [5] J. P. Magalhaes and L. M. Silva, “Root-cause analysis of per-
using adaptive sampling. According our experimentation formance anomalies in web-based applications,” in SAC ’11:
a performance penalty in less than 1.5% is achieved by Proceedings of the 26th Symposium On Applied Computing -
keeping the sampling intervals large enough to have some Dependable and Adaptive Distributed Systems, March, 2011.
data for posterior analysis. Accurate and timely pinpointing [6] J. P. Magalhaes and L. M. Silva, “Detection of performance
anomalies in web-based applications,” in NCA ’10: Proceed-
of faulty components is achieved by decreasing the sampling ings of the ninth IEEE International Symposium on Network
interval when the first symptoms of performance anomaly Computing and Applications, July, 2010, pp. 60–67.
are detected. [7] R. Laddad, AspectJ in Action: Practical Aspect-Oriented
Programming. Greenwich, CT, USA: Manning Publications
VI. C ONCLUSIONS AND F UTURE W ORK Co., ISBN 1-930110-93-6, July 2003.
This paper presented two selective and adaptive algo- [8] LTW: Load Time Weaving. [Online]. Available:
http://eclipse.org/aspectj/doc/released/devguide/ltw.html
rithms used to collect fine-grain data, crucial to support the
[9] J. L. Rodgers and A. W. Nicewander, “Thirteen ways to
root-cause analysis of performance anomalies in web-based look at the correlation coefficient,” The American Statistician,
applications. The experimental results achieved so far attest vol. 42, no. 1, pp. 59–66, 1988.
the advantages of doing selective and adaptive profiling, as [10] J. Cohen, Statistical Power Analysis for the Behavioral Sci-
also let us conclude the following: ences (2nd Edition), 2nd ed. Lawrence Erlbaum, ISBN 0-
8058-0283-5, Jan 1988.
• the adoption of selective and adaptive algorithms allows
[11] T. Giorgino, “Computing and visualizing dynamic time warp-
to minimize the application profiling performance im- ing alignments in R: The DTW package,” Journal of Statis-
pact. According our analysis the performance impact tical Software, vol. 31, no. 07, 08 2009.
is almost 60% lower than with full-profiling. When [12] R. A. Fisher, Statistical Methods for Research Workers: Study
compared to a non-profiled version the response time Guides, 1st ed. Cosmo Publications, ISBN 8130701332,
2006.
increase is on the order of 0.5% to 2.4%; [13] W. D. Smith., TPC-W: Benchmarking an Ecommerce
• considering the different sampling strategies and the Solution, Transaction Processing Performance Council Std.
corresponding percentage of end-users affected is per- [Online]. Available: http://www.tpc.org/
ceptible the advantage that comes from the adaptability [14] Symantec Corporation, “Symantec i3 for J2EE - performance
provided by the algorithms. The pinpointing of the management for the J2EE platform,” White paper.
[15] I. Rish, M. Brodie, S. Ma, N. Odintsova, A. Beygelzimer,
performance anomaly was done timely as observable G. Grabarnik, and K. Hernandez, “Adaptive diagnosis in
by the a low number of end-users affected; distributed systems,” Neural Networks, IEEE Transactions on,
• since the sampling interval adjustment only affects vol. 16, no. 5, pp. 1088–1109, 2005.
the user-transactions reporting a correlation degree de- [16] N. Kumar, B. R. Childers, and M. L. Soffa, “Low overhead
crease the response time of the overall user-transactions program monitoring and profiling,” SIGSOFT Softw. Eng.
Notes, vol. 31, pp. 28–34, September 2005.
is not significantly affected as if adopted a static sam- [17] M. Arnold and B. G. Ryder, “A framework for reducing the
pling interval. cost of instrumented code,” SIGPLAN Not., vol. 36, pp. 168–
Considering the root-cause analysis supported by the 179, May 2001.
selective and adaptive algorithms here presented, we plan [18] M. A. Munawar and P. A. S. Ward, “Adaptive monitoring
in enterprise software systems,” in SysML’06: Proceedings of
to devise and measure the effectiveness of recovery actions the First Workshop on Tackling Computer Systems Problems
considering different types of fault injection. Attending to with Machine Learning Techniques, June 2006.
the applications reliability requirements devise convenient [19] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, “Using
recovery actions, depending on the type and localization of magpie for request extraction and workload modelling,” in
the faults is definitely a topic of great relevance. OSDI’04: Proceedings of the 6th Conference on Symposium
on Operating Systems Design & Implementation. Berkeley,
R EFERENCES CA, USA: USENIX Association, 2004, pp. 18–18.
[20] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer,
[1] B. Simic. (2010, December) Ten areas that are changing “Pinpoint: Problem determination in large, dynamic internet
market dynamics in web performance management. services,” in DSN ’02: Proceedings of the 2002 International
[Online]. Available: http://www.trac-research.com/web- Conference on Dependable Systems and Networks. Washing-
performance/ten-areas-that-are-changing-market-dynamics- ton, DC, USA: IEEE Computer Society, 2002, pp. 595–604.
in-web-performance-management
178
170
Authorized licensed use limited to: Universita della Svizzera Italiana. Downloaded on January 07,2021 at 14:03:24 UTC from IEEE Xplore. Restrictions apply.