Production profiling: What, Why and How

Production Profiling: What, Why
and How
Richard Warburton (@richardwarburto)
Sadiq Jaffer (@sadiqj)
https://www.opsian.com

Why Performance Matters
Development isn’t Production
Profiling vs Monitoring
Production Profiling
Conclusion

Amazon: 100ms of latency costs 1% of sales
Google: 500ms seconds in search page generation time drops traffic by 20%
Responsive Applications make more Money

Performance testing in development can be easier
May not have access to production
Tooling often desktop-based
Not representative of production

The JVM may have very different behaviour in production
Hotspot does adaptive optimisation
Production may optimise differently

Production profiling: What, Why and How

Ambient/Passive/System Metrics
Preconfigured numerical measure about the
system
CPU Time Usage / Page-load Times
Cheap and sometimes effective

Logging
Records arbitrary events emitted by the system being monitored
log4j/slf4j/logback
Logs of GC events
Often manual, aids system understanding, expensive

Coarse Grained Instrumentation
Measures time within some instrumented section of the code
Time spent inside the controller layer of your web-app or performing SQL queries
More detailed and actionable though expensive

Production Profiling
What methods use up CPU time?
What lines of code allocate the most objects?
Where are your CPU Cache misses coming from?
Automatic, can be cheap but often isn’t

Where Instrumentation can be blind in the Real World
Problem: Every 5 seconds an HTTP endpoint would be really slow.
Instrumentation: on the servlet request, didn’t even show the pause!
Cause: Tomcat expired its resources cache every 5 seconds, on load one resource
scanned the entire classpath

Surely a better way?
Not just Metrics - Actionable Insights
Diagnostics aren’t Diagnosis
What about Profiling?

How to use Production Profilers
1) Extract relevant time period and apps/machines
2) Choose a type of profile: CPU Time/Wallclock Time/Memory
3) View results to tell you what the dominant consumer of a resource is
4) Fix biggest bottleneck
5) Deploy / Iterate

Instrumenting Profilers
Add instructions to collect timings (Eg: JVisualVM Profiler)
Inaccurate - modifies the behaviour of the program
High Overhead - > 2x slower

Sampling/Statistical Profilers
WebServerThread.run()
Controller.doSomething() Controller.next()
Repo.readPerson()
new Person()
View.printHtml() ??? ???

Safepoint Bias after Inlining
WebServerThread.run()
Controller.doSomething() Controller.next()
Repo.readPerson()
new Person()
View.printHtml() ???

Time to Safepoint
-XX:+PrintSafepointStatistics
Threads
Safepoint poll
VMOperation

Advanced Statistical Profiling in Java
OS Signals to interrupt threads on resource consumption threshold
JVM’s signal handler-safe AsyncGetCallTrace to walk the stack

People are put off by practical as
much as technical issues

Barriers to Ad-Hoc Production Profiling
Generally requires access to
production
Process involves manual work - hard
to automate
Low-overhead open source profilers
unsupported

What if we profiled all the time?

Historical Data
Allows for post-hoc incident analysis
Enables correlation with other data/metrics
Performance regression analysis

Putting Samples in Context
Application version
Environment parameters (machine type, CPU, location, etc.)
Ad-hoc profiling we can’t do this

Opsian - Continuous Profiling
Opsian
Aggregation
service
Web Reports
JVM Agents

Summary
We can profile in production with low overhead
To overcome practical issues we can profile production all the time
Profiling all the time opens up new capabilities

Performance Matters
Metrics can be unactionable
Instrumentation has high overhead
Continuous Profiling provides insight

We need an attitude shift on profiling
+ monitoring

ContinuousProactive
not Reactive
Systematic
not Ad Hoc

Please do Production Profiling.
All the time.

Any Questions?
https://www.opsian.com/

Production profiling: What, Why and How

More Related Content

Production profiling: What, Why and How