Slides for JavaOne 2015 talk by Brendan Gregg, Netflix (video/audio, of some sort, hopefully pending: follow @brendangregg on twitter for updates). Description: "At Netflix we dreamed of one visualization to show all CPU consumers: Java methods, GC, JVM internals, system libraries, and the kernel. With the help of Oracle this is now possible on x86 systems using system profilers (eg, Linux perf_events) and the new JDK option -XX:+PreserveFramePointer. This lets us create Java mixed-mode CPU flame graphs, exposing all CPU consumers. We can also use system profilers to analyze memory page faults, TCP events, storage I/O, and scheduler events, also with Java method context. This talk describes the background for this work, instructions generating Java mixed-mode flame graphs, and examples from our use at Netflix where Java on x86 is the primary platform for the Netflix cloud."
3. Quickly
• Via SSH and open source tools (covered in this talk)
• Or using Netflix Vector GUI (also open source):
1. Observe high CPU usage
2. Generate a flame graph
5. Messy House Fallacy
• Don't overlook system code: kernel, libraries, etc.
Fallacy:
my
code
is
a
mess,
I
bet
yours
is
immaculate,
therefore
the
bug
must
be
mine
Reality:
everyone's
code
is
terrible
and
buggy
7. • Over 60 million subscribers
– Just launched in Spain!
• AWS EC2 Linux cloud
• FreeBSD CDN
• Awesome place to work
8. Cloud
• Tens of thousands of AWS EC2 instances
• Mostly running Java applications (Oracle JVM)
Linux
(usually
Ubuntu)
Java
(JDK
8)
Tomcat
GC
and
thread
dump
logging
hystrix,
metrics
(Servo),
health
check
OpMonal
Apache,
memcached,
Node.js,
…
Atlas,
S3
log
rotaMon,
sar,
Trace,
perf,
stap,
perf-‐tools
Vector,
pcp
ApplicaMon
war
files,
plaYorm,
base
servlet
9. Why we need CPU profiling
• Improving performance
– Identify tuning targets
– Incident response
– Non-regression testing
– Software evaluations
– CPU workload
characterization
• Cost savings
– ASGs often scale on load
average (CPU), so CPU
usage is proportional to cost
Instance
Instance
Instance
Scaling
Policy
loadavg,
latency,
…
CloudWatch,
Servo
Auto
Scaling
Group
14. System Profilers
• Visibility
– JVM (C++)
– GC (C++)
– libraries (C)
– kernel (C)
• Typical problems (x86):
– Stacks missing for Java
– Symbols missing for Java methods
• Other architectures (e.g., SPARC) have fared better
• Profile everything except Java
15. Workaround
• Capture both Java and system profiles, and examine
side by side
• An improvement, but Java context is often crucial for
interpreting system profiles
Java System
17. Solution
• Fix system profiling
– Only way to see it all
• Visibility is everything:
– Java methods
– JVM (C++)
– GC (C++)
– libraries (C)
– kernel (C)
• Minor Problems:
– 0-3% CPU overhead to enable frame pointers (usually <1%).
– Symbol dumps can consume a burst of CPU
• Complete and accurate (asynchronous) profiling
Java
JVM
Kernel
GC
18. Simple Production Example
1. Poor performance,
and one CPU at 100%
2. perf_events flame
graph shows JVM
stuck compiling
21. Exonerating The System
• From last week:
- Frequent thread creation/
destruction assumed to be
consuming CPU resources.
Recode application?
- A flame graph quantified this
CPU time: near zero
- Time mostly other Java methods
24. CPU Profiling
A
B
block interrupt
on-CPU off-CPU
A
B
A A
B
A
syscall
time
• Record stacks at a timed interval: simple and effective
– Pros: Low (deterministic) overhead
– Cons: Coarse accuracy, but usually sufficient
stack
samples: A
25. Stack Traces
• A code path snapshot. e.g., from jstack(1):
$ jstack 1819
[…]
"main" prio=10 tid=0x00007ff304009000
nid=0x7361 runnable [0x00007ff30d4f9000]
java.lang.Thread.State: RUNNABLE
at Func_abc.func_c(Func_abc.java:6)
at Func_abc.func_b(Func_abc.java:16)
at Func_abc.func_a(Func_abc.java:23)
at Func_abc.main(Func_abc.java:27)
running
parent
g.parent
g.g.paren
running
codepath
start
26. System Profilers
• Linux
– perf_events (aka "perf")
• Oracle Solaris
– DTrace
• OS X
– Instruments
• Windows
– XPerf
• And many others…
27. Linux perf_events
• Standard Linux profiler
– Provides the perf command (multi-tool)
– Usually pkg added by linux-tools-common, etc.
• Features:
– Timer-based sampling
– Hardware events
– Tracepoints
– Dynamic tracing
• Can sample stacks of (almost) everything on CPU
– Can miss hard interrupt ISRs, but these should be near-zero. They can
be measured if needed (I wrote my own tools)
28. perf record Profiling
• Stack profiling on all CPUs at 99 Hertz, then dump:
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf script
[…]
bash 13204 cpu-clock:
459c4c dequote_string (/root/bash-4.3/bash)
465c80 glob_expand_word_list (/root/bash-4.3/bash)
466569 expand_word_list_internal (/root/bash-4.3/bash)
465a13 expand_words (/root/bash-4.3/bash)
43bbf7 execute_simple_command (/root/bash-4.3/bash)
435f16 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
43a771 execute_while_or_until (/root/bash-4.3/bash)
43a636 execute_while_command (/root/bash-4.3/bash)
436129 execute_command_internal (/root/bash-4.3/bash)
435580 execute_command (/root/bash-4.3/bash)
420cd5 reader_loop (/root/bash-4.3/bash)
41ea58 main (/root/bash-4.3/bash)
7ff2294edec5 __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[… ~47,000 lines truncated …]
one
stack
sample
34. Flame Graphs
• Flame Graphs:
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random (default), or a dimension
• Currently made from Perl + SVG + JavaScript
– Multiple d3 versions are being developed
• Easy to get working
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
– Above commands are Linux; see URL for other OSes
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a –g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg
35. Linux perf_events Workflow
perf stat perf record
perf report perf script
count events capture stacks
text UI dump profile
stackcollapse-perf.pl
flamegraph.pl
perf.data
flame graph
visualization
perf list
list events
Typical
Workflow
39. Flame Graph Interpretation (3/3)
a()
b() h()
c()
d()
e() f()
g()
i()
Widths are proportional to presence in samples
e.g., comparing b() to h() (incl. children)
40. Flame Graph Colors
• Randomized by default
• Can be used as a dimension. e.g.:
– Mixed-mode flame graphs
– Differential flame graphs
– Search
41. Mixed-Mode Flame Graphs
• Hues:
– green == Java
– red == system
– yellow == C++
• Intensity randomized
to differentiate frames
– Or hashed based on
function name
Java JVM
Kernel
Mixed-Mode
42. Differential Flame Graphs
• Hues:
– red == more samples
– blue == less samples
• Intensity shows the
degree of difference
• Used for comparing
two profiles
• Also used for showing
other metrics: e.g., CPI
Differential
more less
44. Flame Charts
• Flame charts: x-axis is time
• Flame graphs: x-axis is population (maximize merging)
• Final note: these are useful, but are not flame graphs
46. System Profiling Java on x86
• For example,
using Linux perf
• The stacks are
1 or 2 levels
deep, and have
junk values
# perf record –F 99 –a –g – sleep 30
# perf script
[…]
java 4579 cpu-clock:
ffffffff8172adff tracesys ([kernel.kallsyms])
7f4183bad7ce pthread_cond_timedwait@@GLIBC_2…
java 4579 cpu-clock:
7f417908c10b [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101c97 [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f41792fc65f [unknown] (/tmp/perf-4458.map)
a2d53351ff7da603 [unknown] ([unknown])
java 4579 cpu-clock:
7f4179349aec [unknown] (/tmp/perf-4458.map)
java 4579 cpu-clock:
7f4179101d0f [unknown] (/tmp/perf-4458.map)
[…]
47. … as a Flame Graph
Broken Java stacks
(missing frame pointer)
48. Why Stacks are Broken
• On x86 (x86_64), hotspot uses
the frame pointer register (RBP)
as general purpose
• This "compiler optimization"
breaks (simple) stack walking
• Once upon a time, x86 had fewer
registers, and this made much more sense
• gcc provides -fno-omit-frame-pointer to avoid
doing this, but the JVM had no such option…
49. Fixing Stack Walking
Possibilities:
A. Fix frame pointer-based stack walking (the default)
– Pros: simple, supported by many tools
– Cons: might cost a little extra CPU
B. Use a custom walker (likely needing kernel support)
– Pros: full stack walking (incl. inlining) & arguments
– Cons: custom kernel code, can cost more CPU when in use
C. Try libunwind and DWARF
– Even feasible with JIT?
Our current preference is (A)
50. Hacking OpenJDK (1/2)
• As a proof of concept, I hacked hotspot to support an
x86_64 frame pointer
--- openjdk8clean/hotspot/src/cpu/x86/vm/x86_64.ad 2014-03-04 …
+++ openjdk8/hotspot/src/cpu/x86/vm/x86_64.ad 2014-11-08 …
@@ -166,10 +166,9 @@
// 3) reg_class stack_slots( /* one chunk of stack-based "registers" */ )
//
-// Class for all pointer registers (including RSP)
+// Class for all pointer registers (including RSP, excluding RBP)
reg_class any_reg(RAX, RAX_H,
RDX, RDX_H,
- RBP, RBP_H,
RDI, RDI_H,
RSI, RSI_H,
RCX, RCX_H,
[...]
Remove RBP from
register pools
51. Hacking OpenJDK (2/2)
• We used this patched version successfully for some limited
(and urgent) performance analysis
--- openjdk8clean/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-03-04…
+++ openjdk8/hotspot/src/cpu/x86/vm/macroAssembler_x86.cpp 2014-11-07 …
@@ -5236,6 +5236,7 @@
// We always push rbp, so that on return to interpreter rbp, will be
// restored correctly and we can correct the stack.
push(rbp);
+ mov(rbp, rsp);
// Remove word for ebp
framesize -= wordSize;
--- openjdk8clean/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
+++ openjdk8/hotspot/src/cpu/x86/vm/c1_MacroAssembler_x86.cpp …
[...]
Fix x86_64 function
prologues
52. -XX:+PreserveFramePointer
• We shared our patch publicly
– See "A hotspot patch for stack profiling (frame pointer)" on the
hotspot complier dev mailing list
– It became JDK-8068945 for JDK 9 and JDK-8072465 for JDK 8,
and the -XX:+PreserveFramePointer option
• Zoltán Majó (Oracle) took this on, rewrote it, and it is now:
– In JDK 9
– In JDK 8 update 60 build 19
– Thanks to Zoltán, Oracle, and the other hotspot engineers for
helping get this done!
• It might cost 0 – 3% CPU, depending on workload
56. Stacks & Inlining
• Frames may be missing (inlined)
• Disabling inlining:
– -XX:-Inline
– Many more Java frames
– Can be 80% slower!
• May not be necessary
– Inlined flame graphs often make
enough sense
– Or tune -XX:MaxInlineSize and
-XX:InlineSmallCode a little to reveal more frames
• Can even improve performance!
• perf-map-agent (next) has experimental un-inline support
No inlining
58. Missing Symbols
12.06% 62 sed sed [.] re_search_internal
|
--- re_search_internal
|
|--96.78%-- re_search_stub
| rpl_re_search
| match_regex
| do_subst
| execute_program
| process_files
| main
| __libc_start_main
71.79% 334 sed sed [.] 0x000000000001afc1
|
|--11.65%-- 0x40a447
| 0x40659a
| 0x408dd8
| 0x408ed1
| 0x402689
| 0x7fa1cd08aec5
broken
not broken
• Missing symbols may show up as hex; e.g., Linux perf:
59. Fixing Symbols
• For JIT'd code, Linux perf already looks for an
externally provided symbol file: /tmp/perf-PID.map, and
warns if it doesn't exist
• This file can be created by a Java agent
# perf script
Failed to open /tmp/perf-8131.map, continuing without symbols
[…]
java 8131 cpu-clock:
7fff76f2dce1 [unknown] ([vdso])
7fd3173f7a93 os::javaTimeMillis() (/usr/lib/jvm…
7fd301861e46 [unknown] (/tmp/perf-8131.map)
[…]
60. Java Symbols for perf
• perf-map-agent
– https://github.com/jrudolph/perf-map-agent
– Agent attaches and writes the /tmp file on demand (previous
versions attached on Java start, wrote continually)
– Thanks Johannes Rudolph!
• Use of a /tmp symbol file
– Pros: simple, can be low overhead (snapshot on demand)
– Cons: stale symbols
• Using a symbol logger with perf instead
– Patch by Stephane Eranian currently being discussed on
lkml; see "perf: add support for profiling jitted code"
64. Instructions
1. Check Java version
2. Install perf-map-agent
3. Set -XX:+PreserveFramePointer
4. Profile Java
5. Dump symbols
6. Generate Mixed-Mode Flame Graph
Note these are unsupported: use at your own risk
Reference: http://techblog.netflix.com/2015/07/java-in-flames.html
65. 1. Check Java Version
• Need JDK8u60 or better
– for -XX:+PreserveFramePointer
• Upgrade Java if necessary
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
66. 2. Install perf-map-agent
• Check https://github.com/jrudolph/perf-map-agent for the
latest instructions. e.g.:
$ sudo bash
# apt-get install -y cmake
# export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# cd /usr/lib/jvm
# git clone --depth=1 https://github.com/jrudolph/perf-map-agent
# cd perf-map-agent
# cmake .
# make
67. 3. Set -XX:+PreserveFramePointer
• Needs to be set on Java startup
• Check it is enabled (on Linux):
$ ps wwp `pgrep –n java` | grep PreserveFramePointer
68. 4. Profile Java
• Using Linux perf_events to profile all processes, at 99
Hertz, for 30 seconds (as root):
• Just profile one PID (broken on some older kernels):
• These create a perf.data file
# perf record -F 99 -a -g -- sleep 30
# perf record -F 99 -p PID -g -- sleep 30
69. 5. Dump Symbols
• See perf-map-agent docs for updated usage
• e.g., as the same user as java:
• perf-map-agent contains helper scripts. I wrote my own:
– https://github.com/brendangregg/Misc/blob/master/java/jmaps
• Dump symbols quickly after perf record to minimize stale
symbols. How I do it:
$ cd /usr/lib/jvm/perf-map-agent/out
$ java -cp attach-main.jar:$JAVA_HOME/lib/tools.jar
net.virtualvoid.perf.AttachOnce PID
# perf record -F 99 -a -g -- sleep 30; jmaps
70. 6. Generate a Mixed-Mode Flame Graph
• Using my FlameGraph software:
– perf script reads perf.data with /tmp/*.map
– out.stacks01 is an intermediate file; can be handy to keep
• Finally open flame01.svg in a browser
• Check for newer flame graph implementations (e.g., d3)
# perf script > out.stacks01
# git clone --depth=1 https://github.com/brendangregg/FlameGraph
# cat out.stacks01 | ./FlameGraph/stackcollapse-perf.pl |
./FlameGraph/flamegraph.pl --color=java --hash > flame01.svg
74. Netflix Vector
• Open source, on-demand, instance analysis tool
– https://github.com/netflix/vector
• Shows various real-time metrics
• Flame graph support currently in development
– Automating previous steps
– Using it internally already
– Also developing a new d3 front end
78. Advanced Flame Graphs
• Examples:
– Page faults
– Context switches
– Disk I/O requests
– TCP events
– CPU cache misses
– CPI
• Any event issued in synchronous Java context
79. Synchronous Java Context
• Java thread still on-CPU, and event is directly triggered
• Examples:
– Disk I/O requests issued directly by Java à yes
• direct reads, sync writes, page faults
– Disk I/O completion interrupts à no*
– Disk I/O requests triggered async, e.g., readahead à no*
* can be made yes by tracing and associating context
80. Page Faults
• Show what triggered main memory (resident) to grow:
• "fault" as (physical) main memory is allocated on-
demand, when a virtual page is first populated
• Low overhead tool to solve some types of memory leak
# perf record -e page-faults -p PID -g -- sleep 120
RES column in top(1) grows
because
82. Context Switches
• Show why Java blocked and stopped running on-CPU:
• Identifies locks, I/O, sleeps
– If code path shouldn't block and looks random, it's an involuntary context switch. I
could filter these, but you should have solved them beforehand (CPU load).
• e.g., was used to understand framework differences:
# perf record -e context-switches -p PID -g -- sleep 5
vs
rxNetty Tomcat
85. Disk I/O Requests
• Shows who issued disk I/O (sync reads & writes):
• e.g.: page faults in GC? This JVM has swapped out!:
# perf record -e block:block_rq_insert -a -g -- sleep 60
GC
86. TCP Events
• TCP transmit, using dynamic tracing:
• Note: can be high overhead for high packet rates
– For the current perf trace, dump, post-process cycle
• Can also trace TCP connect & accept (lower overhead)
• TCP receive is async
– Could trace via socket read
# perf probe tcp_sendmsg
# perf record -e probe:tcp_sendmsg -a -g -- sleep 1; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso,trace > out.stacks
# perf probe --del tcp_sendmsg
87. TCP Send Flame Graph
kernel
Java
JVM
Only one code-path
taken in this example
ab (client process)
88. CPU Cache Misses
• In this example, sampling via Last Level Cache loads:
• -c is the count (samples
once per count)
• Use other CPU counters to
sample hits, misses, stalls
# perf record -e LLC-loads -c 10000 -a -g -- sleep 5; jmaps
# perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso > out.stacks
89. One Last Example
• Back to a
mixed-mode
CPU flame graph
• What else can we
show with color?
90. CPI Flame Graph
• Cycles Per
Instruction!
– red == instruction
heavy
– blue == cycle
heavy (likely mem
stall cycles)
zoomed: