InfoQ - Java Performance PDF
InfoQ - Java Performance PDF
JAVA by
eMag Issue 7 - December 2013
PERFORMANCE
Visualizing Java GC
Ben Evans discusses garbage collection in Java along with some tooling for understanding and
visualizing how it works.
PAGE 4
Contents
Contents Page 2
Contents Page 3
Java Performance / Issue 7 - Dec 2013
Visualizing Java GC
By Ben Evans
Garbage Collection, like Backgammon takes minutes from other language backgrounds often focus on GC
to learn and a lifetime to master. pauses without fully understanding the context that
automatic memory management operates in.
In his talk Visualizing Garbage Collection Master
trainer/consultant Ben Evans discusses GC from the Mark & Sweep is the fundamental algorithm used
ground up. for GC by Java (and other runtimes).
A brief summary of his talk follows. In the Mark & Sweep algorithm you have references
pointing from the frames of each stack’s thread,
Basics which point into program heap. So we start in the
GC has largely replaced earlier techniques, such stack, follow pointers to all possible references, and
as manual memory management and reference then follow those references, recursively.
counting. When you’re done, you have all live objects, and
everything else is garbage.
This is a good thing, as memory management is
boring, pedantic bookkeeping that computers excel Note that one point people often miss is that the
at whereas people do not. Language runtimes are runtime itself also has a list of pointers to every
better at this than humans are. object called the “allocation list” that is maintained
by the garbage collector and helps the garbage
Modern GC is highly efficient, far more so than collector to clean up. So the runtime can always find
manual allocation typical in earlier languages. People an object that it created but has not yet collected.
Contents Page 4
Java Performance / Issue 7 - Dec 2013
The stack depicted in the illustration above is just is not doing garbage collection - because all of our GC
the stack associated with a single application thread; memory bookkeeping is done in user space.
there is a similar stack for each and every application
thread, with its own set of pointers into the heap. Memory Pools
Contents Page 5
Java Performance / Issue 7 - Dec 2013
Contents Page 6
Java Performance / Issue 7 - Dec 2013
GarbageCat (http://code.google.com/a/eclipselabs.
org/p/garbagecat/) WATCH THE FULL PRESENTATION ONLINE ON INFOQ
Best name http://www.infoq.com/presentations/Visualizing-Java-GC
In Summary
You need to understand some basic GC theory
You want most objects to die young in young gen
Turn on GC logging!
– Reading raw log files is hard – Use a tool
Use tools to help you tweak – Measure, don’t guess.
Contents Page 7
Java Performance / Issue 7 - Dec 2013
Many articles describe how a poorly tuned garbage First (G1) collector, HotSpot’s latest GC (introduced
collector (GC) can bring an application’s service- in JDK7 update 4).
level-agreement (SLA) commitments to its knees. G1 is an incremental, parallel, compacting GC
For example, an unpredictably protracted garbage- that provides more predictable pause times than
collection pause can easily exceed the response- CMS and Parallel Old. By introducing a parallel,
time requirements of an otherwise performant concurrent, and multi-phased marking cycle,
application. Moreover, the irregularity increases G1 GC can work with much larger heaps while
when you have a non-compacting GC such as providing reasonable worst-case pause times. The
Concurrent Mark and Sweep (CMS) that tries to basic idea with G1 GC is to set your heap ranges
reclaim its fragmented heap with a serial (single- (using -Xms for min heap size and -Xmx for the max
threaded) full garbage collection that is stop-the- size) and a realistic (soft real time) pause-time goal
world (STW). (using -XX:MaxGCPauseMillis) and then let the GC
do its job.
Suppose an allocation failure in the young
generation triggers a young collection, leading to With G1 GC, HotSpot moves away from its
promotions to the old generation. Further, suppose conventional GC layout where a contiguous
that the fragmented old generation has insufficient Java heap splits into (contiguous) young and old
space for the newly promoted objects. Such generations. In G1 GC, HotSpot introduces the
conditions would trigger a full garbage-collection concept of “regions”. A single, large, contiguous
cycle, which will compact the heap. Java heap space divides into multiple fixed-sized
heap regions. A list of “free” regions maintains these
With CMS GC, the full collection is serial and STW, regions. As the need arises, the free regions are
so your application threads are stopped for the assigned to either the young or the old generation.
entire duration while the heap space is reclaimed
and compacted. The duration of the STW pause These regions can range from 1 MB to 32 MB in
depends on your heap size and the surviving objects. size depending on your total Java heap. G1 aims
This is a common scenario with Parallel Old GC. for around 2048 regions for the total heap. Once a
Parallel Old reclaims the old generation with a region frees up, it goes back to the free regions list.
parallel STW full garbage-collection pause. This
full garbage collection is not incremental; it is one The principle of G1 GC is to reclaim the Java heap
big STW pause and does not interleave with the as much as possible (while trying its best to meet
application execution. the pause-time goal) by collecting the regions with
the least amount of live data, i.e. the ones with most
(Note: You can read more about HotSpot GCs here.) garbage, first - hence the name Garbage First.
Contents Page 8
Java Performance / Issue 7 - Dec 2013
Contents Page 9
Java Performance / Issue 7 - Dec 2013
regions will be either copied to the to-space survivor GC worker threads in scanning the external roots
regions or, based on the object’s age and the tenuring such as registers, thread stacks, etc. that point into the
threshold, will be promoted to region(s) from the old- collection set.
generation space. 2. U pdate remembered sets (RSets): RSets aid G1 GC
Every young collection involves parallel worker in tracking references that point into a region. The
time and sequential/serial time. To explain this time shown here is the amount of time the parallel
further, I will use a log output from the latest Java worker threads spent in updating the RSets.
7 update release, which at the time of publication is 3. P rocessed buffers: The count shows how many update
7u25. (We also have an Early Access (EA) for 7u40. buffers were processed by the worker threads.
Please feel free to try out the EA bundles for your 4. Scan RSets: The time spent in scanning the RSets
platform. With 7u40 EA, you may see a difference for references into a region. This time will depend
in the log format, but the basic premise remains the on the “coarseness” of the RSet data structures.
same.) 5. O bject copy: During every young collection, the
GC copies all live data from the Eden and from-
The following command-line options generated this space survivor, either to the regions in the to-
GC log output: space survivor or to the old-generation regions.
java –Xmx1G –Xms1G –XX:+UseG1GC – The amount of time it takes the worker threads to
XX:+PrintGCDetails –XX:+PrintGCTimeStamps complete this task is listed here.
GCTestBench 6. Termination: After completing their particular
work (e.g. object scan and copy), each worker
Note: I went with the default pause-time goal of 200 ms. thread enters its termination protocol. Prior to
The indentation demarcates the parallel and terminating, the worker thread looks for work from
the sequential work groups. The parallel worker time is other threads to steal and terminates when there is
further split into: none. The time listed here indicates the time spent
by the worker threads offering to terminate.
7. Parallel worker other time: Time spent by the
worker threads that was not accounted for in any of
the parallel activities listed above.
Contents Page 10
Java Performance / Issue 7 - Dec 2013
I have just skimmed the surface with respect to many the initial-mark phase, which the first line of output
things like the RSets, its coarsening, the update buffers, is telling us. The initial-mark phase is piggybacked
and the CSet. The next few paragraphs will add a few (done at the same time) on a normal (STW) young-
more things like the Snapshot-at-the-Beginning (SATB) garbage collection, so the output is similar to what
algorithm and barriers, etc. However, in order to learn you see during a young-evacuation pause.
more about them, we would have to deeply dive into •R oot-region scanning phase – During this phase, G1
the internals of G1 GC, an interesting topic that is GC scans survivor regions of the initial-mark phase
outside the scope of this article. for references into the old generation and marks the
referenced objects. This phase runs concurrently
Now that we understand how the young collections (not STW) with the application. It is important that
start filling up the old generation, we need to this phase complete before the next young-garbage
introduce (and understand) the concept of a marking collection happens.
threshold. When the occupancy of the total heap •C oncurrent-marking phase – During this phase,
crosses this threshold, G1 GC will trigger a multi- G1 GC looks for reachable (live) objects across the
phased concurrent marking cycle. The command-line entire Java heap. This phase happens concurrently
option that sets the threshold is –XX:InitiatingHeap with the application and a young-garbage collection
OccupancyPercent and it defaults to 45% of the total can interrupt the concurrent-marking phase (shown
Java heap size. G1 GC uses SATB, a marking algorithm above).
that takes a logical snapshot of the set of live objects • Remark phase – The remark phase helps the
in the heap at the beginning of the marking cycle. completion of marking. During this STW phase, G1
GC drains any remaining SATB buffers and traces
This algorithm uses a pre-write barrier to record any as yet unvisited live objects. G1 GC also does
and mark the objects that are a part of the logical reference processing during the remark phase.
snapshot. Now let us spend some time discussing •C leanup phase – This is the final phase of the multi-
the individual phases of the multi-phased concurrent phase marking cycle. It is partly STW when G1 GC
marking and first a look at the output from the GC log: does live-ness accounting (to identify completely
free regions and mixed garbage-collection candidate
regions) and when G1 GC scrubs the RSets. It
is partly concurrent when G1 GC resets and returns
the empty regions to the free list.
Contents Page 11
Java Performance / Issue 7 - Dec 2013
of old regions added to the CSets: cleanup phase (of the multi-phased marking cycle)
when the completely free (i.e. full of garbage) regions
–X X:G1MixedGCLiveThresholdPercent: The are reclaimed and returned to the free list. The next
occupancy threshold of live objects in the old region level happens during the incremental mixed garbage
to be included in the mixed collection. collections. If all else fails, the entire Java heap is
–X X:G1HeapWastePercent: The threshold of garbage collected. This is the well-known fail-safe of full-
that you can tolerate in the heap. garbage collection.
–X X:G1MixedGCCountTarget: The target number of All of the above makes the reclamation of the old
mixed garbage collections within which the regions generation a lot easier and, in a way, tiered.
with at most G1MixedGCLiveThresholdPercent live
data should be collected. About the Author
–X X:G1OldCSetRegionThresholdPercent: A limit on Monica Beckwith is the performance lead for
the max number of old regions that can be collected Garbage First garbage collector. She has worked in
during a mixed collection. the performance and architecture industry for over
10 years. Prior to Oracle and Sun Microsystems,
Monica lead the performance effort at Spansion Inc.
Monica has worked with many industry standard
Java-based benchmarks with a constant goal of
finding opportunities for improvement in Oracle’s
HotSpot VM.
Contents Page 12
Java Performance / Issue 7 - Dec 2013
Contents Page 13
Java Performance / Issue 7 - Dec 2013
cards. Both Region 1 and Region 3 happen to be references to the owning region.
referencing objects in Region 2. Therefore, the RSet A collection set (CSet) is a set of regions to be collected
for Region 2 tracks the two references to Region 2, during a garbage collection. For a young collection,
the “owning region”. the CSet only includes young regions. For a mixed
collection, the CSet includes young and old regions.
There are two concepts that help maintain RSets:
1. Post-write barriers If the CSet includes many regions with coarsened
2. Concurrent refinement threads RSets (note that “coarsening of RSets” is defined as
the transitioning of RSets through different levels
The barrier code steps in after a write (hence the of granularity), you will see an increase in scan time
name “post-write barrier”) and helps track cross- for the RSets. These scan times are represented in
region updates. Update log buffers are responsible for the GC pause as “Scan RS (ms)” in the GC logs. If the
logging the cards that contain the updated reference Scan RS times seem high relative to the overall GC
field. Once these buffers are full, they are retired. pause time, or they appear high for your application,
Concurrent refinement threads process these full then please look for the text string “Did xyz coarsenings”
buffers. in your GC log output when using the diagnostic
option -XX:+G1SummarizeRSetStats (you can also specify
Note that the concurrent refinement threads help the reporting frequency period in number of GCs by
maintain RSets by updating them concurrently (while setting -XX:G1SummarizeRSetStatsPeriod=period).
the application is also running). The deployment of the
concurrent refinement threads is tiered: initially only Recalll from the previous article that “Update RS
a small number of threads are deployed and more are (ms)” in the GC-pause output shows the time spent
eventually added depending on the amount of filled updating RSets, and the “Processed Buffers” show the
update buffers to be processed. count of the update buffers process during the GC
pause. If you spot issues in these in your GC logs then
The max number of concurrent refinement threads can use the above-mentioned options to dive even further
be controlled by -XX:G1ConcRefinementThreads or into the issues.
even -XX:ParallelGCThreads. If the concurrent
refinement threads cannot keep up with the amount Those options can also help identify potential issues
of filled buffers, then the mutator threads own with the update log buffers and the concurrent
and handle the processing of the buffers - usually refinement threads.
something that you should strive to avoid. A sample output of -XX:+G1SummarizeRSetStats with the
period set to one -XX:G1SummarizeRSetStatsPeriod=1:
There is one RSet per region. There are three levels of Concurrent RS processed 784125 cards
granularity for RSets - sparse, fine, and coarse. A per-
region table (PRT) is an abstraction that houses the Of 4870 completed buffers:
granularity level for RSet. A sparse PRT is a hash table
that contains card indices. G1 GC internally maintains 4870 (100.0%) by concurrent RS threads.
these cards. The card may contain references from
the region that spans the address associated with 0 ( 0.0%) by mutator threads.
the card to the owning region. A fine-grain PRT is
an open hash table where each entry represents a Concurrent RS threads times (s)
region with a reference into the owning region. The
card indices, within the region, are held in a bitmap. 0.64 0.30 0.26 0.18 0.17 0.16 0.17 0.15 0.15 0.12 0.13
When reaching the max capacity of the fine-grain 0.08 0.13 0.13 0.12 0.13 0.12 0.11 0.12 0.11 0.12 0.13
PRT, a corresponding coarse-grained bit is set in the 0.11
coarse-grain bitmap and the corresponding entry is
deleted from the fine-grain PRT. A coarse bitmap has Concurrent sampling threads times (s)
one bit for each region. A set bit in the coarse-grain
map means that the associated region may contain 0.00
Contents Page 14
Java Performance / Issue 7 - Dec 2013
Total heap region rem set sizes = 199140K. Max = During an evacuation pause, the reference objects
661K. are discovered during the object scanning and copying
phase and are processed after that. In the GC log,
Static structures = 660K, free_lists = 15052K. you can see the reference-processing (Ref proc) time
clubbed under the sequential work group called “Other”:
1009422114 occupied cards represented.
[Other: 0.2 ms]
Max size region = [Choose CSet: 0.0 ms]
313:(O)[0x000000054e400000,0x000000054 [Ref Proc: 0.2 ms]
e800000,0x000000054e800000], size = 662K, [Ref Enq: 0.0 ms]
occupied = 1214K. [Free CSet: 0.0 ms]
Did 2759 coarsenings. Note that references with dead referents are added
to the pending list and that time is shown in the GC
The above output shows the count of processed log as reference-enqueing time (Ref Enq).
cards and completed buffers. It also shows that
the concurrent refinement threads did 100% of the During a remark pause, the discovery happens
work and mutator threads did none (which, as I said, during the earlier phase of concurrent marking. (Both
is a good sign!). It then lists the concurrent refinement are a part of the multi-phase, concurrent marking
thread times for each thread involved in the work. cycle. Please refer to the previous article for more
information.) The remark phase deals with the
The segment in brown shows the cumulative stats processing of the discovered references. In the GC
since the start of the HotSpot VM. The cumulative log, you can see the reference processing (GC ref-
stats include the total RSet sizes and max RSet size, proc) time shown in the GC remark section:
total number of occupied cards, and max region
size information. It also shows the total number of 0.094: [GC remark 0.094: [GC ref-proc,
coarsenings done since the start of the VM. 0.0000033 secs], 0.0004374 secs]
[Times: user=0.00 sys=0.00, real=0.00 secs]
At this point, I think it is safe to introduce another
option flag -XX:G1RSetUpdatingPauseTimePerce If you see lengthy times during reference processing
nt=10. This flag sets a target percentage (defaults to then please turn on parallel reference processing
10% of the pause-time goal) that G1 GC should spend by enabling the following option on the command
updating RSets during a GC evacuation pause. line -XX:+ParallelRefProcEnabled.
Contents Page 15
Java Performance / Issue 7 - Dec 2013
1. Find out if the failures are a side effect of 6062.121: [GC pause (G1 Evacuation Pause)
overtuning - Get a simple baseline with min (mixed) 6062.121: [G1Ergonomics (CSet
and max heap and a realistic pause-time goal. Construction) start choosing CSet, _pending_
Remove any additional heap sizing such as -Xmn, cards: 129059, predicted base time: 52.34 ms,
-XX:NewSize, -XX:MaxNewSize, -XX:SurvivorRatio, remaining time: 147.66 ms, target pause time:
etc. Use only -Xms, -Xmx and a pause-time 200.00 ms]
goal -XX:MaxGCPauseMillis.
6062.121: [G1Ergonomics (CSet Construction)
2. If the problem persists in the baseline run and if add young regions to CSet, eden: 912 regions,
humongous allocations (see next section below) are survivors: 112 regions, predicted young region
not the issue, the corrective action is to increase time: 256.16 ms]
your Java heap size - if you can, of course.
6062.122: [G1Ergonomics (CSet Construction)
3. I f increasing the heap size is not an option and if finish adding old regions to CSet, reason: old
you notice that the marking cycle is not starting CSet region num reached min, old: 149 regions,
early enough for G1 GC to be able to reclaim the min: 149 regions]6062.122: [G1Ergonomics (CSet
old generation then drop your -XX:InitiatingHeap Construction) finish choosing CSet, eden: 912
OccupancyPercent. The default for this is 45% of regions, survivors: 112 regions, old: 149
your total Java heap. Dropping the value will start regions, predicted pause time: 344.87 ms,
the marking cycle earlier. Conversely, if the marking target pause time: 200.00 ms]
cycle is starting early and not reclaiming much, you
should increase the threshold above the default 6062.281: [G1Ergonomics (Heap Sizing) attempt
value to make sure that you are accommodating the heap expansion, reason: region allocation
live data set for your application. request failed, allocation request: 2097152
bytes]
4. Concurrent marking cycles can start on time, but
take so much time to finish that they delay the 6062.281: [G1Ergonomics (Heap Sizing) expand
mixed garbage-collection cycles, eventually lead the heap, requested expansion amount: 2097152
to an evacuation failure since old generation is not bytes, attempted expansion amount: 4194304
timely reclaimed. To avoid this, increase the number bytes]
of concurrent marking threads using the command-
line option: -XX:ConcGCThreads. 6062.281: [G1Ergonomics (Heap Sizing) did
not expand the heap, reason: heap expansion
5. If to-space survivor is the issue, then increase operation failed]
the -XX:G1ReservePercent. The default is 10% of
the Java heap. G1 GC creates a false ceiling and 6062.902: [G1Ergonomics (Heap Sizing) attempt
reserves the memory, in case there is a need for heap expansion, reason: recent GC overhead
more to-space. Of course, G1 GC caps it off at 50% higher than threshold after GC, recent
since we do not want the end-user to set it to a very GC overhead: 20.30 %, threshold: 10.00 %,
large value. uncommitted: 0 bytes, calculated expansion
amount: 0 bytes (20.00 %)]
To help explain the cause of evacuation
failure, I want to introduce a useful 6062.902: [G1Ergonomics (Concurrent Cycles)
option: -XX:+PrintAdaptiveSizePolicy. This will do not request concurrent cycle initiation,
provide many ergonomic details that are purposefully reason: still doing mixed collections,
kept out of the -XX:+PrintGCDetails option. occupancy: 9596567552 bytes, allocation
Contents Page 16
Java Performance / Issue 7 - Dec 2013
Contents Page 17
Java Performance / Issue 7 - Dec 2013
By allocating H-objs directly in the old generation, This information is helpful, since you not only can tell
G1 avoids including them in any of the evacuation how many humongous allocations your application
pauses, and thus they are never moved. During a full made (and whether they were excessive or not), but
garbage-collection cycle, G1 GC compacts around also the sizes of the allocations. Moreover, if you deem
the live H-objs. Outside of a full GC, dead H-objs that there were excessive humongous allocations, all
are reclaimed during the cleanup phase of the multi- you have to do is increase the G1 region size to fit the
phased, concurrent marking cycle. In other words, H-objs as regular ones.
H-objs are collected either during the cleanup phase
or they are collected during a full GC. Recall from my last article that G1 regions can
span from 1 MB to 32 MB in powers of two. The
Before allocating H-obj, G1 GC will check if the size of this allocation request in this example is just
allocation will cross the initiating heap-occupancy above 4 MB so an 8-MB region size is not quite
percentage (the marking threshold). If so, G1 GC will large enough to avoid the humongous allocations.
initiate a G1 concurrent marking cycle. This is done We need to size up to the next power of two, 16
in this manner since we want to avoid evacuation MB. You set that explicitly on the command line:
failures and full garbage-collection cycles as much -XX:G1HeapRegionSize=16M
as possible. As a result, we check as early as possible
so as to give the G1 concurrent cycle as much time About the Author
as possible to complete before there are no more Monica Beckwith is the performance architect at a
available regions for live-object evacuations. cool startup called Servergy. She has worked in the
performance and architecture industry for over 10
G1 GC’s basic premise is that there are not too many years. Prior to Servergy, Monica lead the Garbage
H-objs and that they are long-lived. However, since G1 First Garbage Collector performance at Oracle. You
GC’s region size is dependent on your minimum heap can follow Monica on twitter at @mon_beck.
size, your “normal” allocation may look humongous
to G1 GC. This would lead to lots of H-obj allocations
taking regions from old generation, which would READ THIS ARTICLE ONLINE ON InfoQ
eventually lead to an evacuation failure since G1 http://www.infoq.com/articles/tuning-tips-G1-GC
would not be able to keep up with those humongous
allocations.
Contents Page 18
Java Performance / Issue 7 - Dec 2013
Serial, Parallel, concurrent, CMS, G1, young gen, new to events, which is impacted by pauses introduced by
gen, old gen, PermGen, Eden, tenured, survivor spaces, garbage collection. Target latency for GC pauses with
safepoints, and the hundreds of JVM start-up flags.... XX:MaxGCPauseMillis=<n>.
Does all this all baffle you when trying to tune garbage
collector to get the required throughput and latency from 3. Memory: The amount of memory our systems
your Java application? If it does then don’t worry, you are use to store state, which is often copied and moved
not alone. Documentation describing garbage collection around while being managed. The set of active objects
feels like man pages for an aircraft. Every knob and dial is retained by the application at any point in time is
detailed and explained but nowhere can you find a guide known as the live set. Maximum heap size –Xmx<n> is
on how to fly. a tuning parameter for setting the heap size available
to an application.
This article will attempt to explain the tradeoffs when
choosing and tuning garbage-collection (GC) algorithms Note: Often HotSpot cannot achieve these targets
for a particular workload. and will silently continue without warning, having
The focus will be on Oracle Hotspot JVM and missed its target by a great margin.
OpenJDK collectors as those are most commonly
used. Towards the end, other commercial JVMs will Latency is a distribution across events. It may be
be discussed to illustrate alternatives. acceptable to have an increased average latency
to reduce the worst-case latency or make it less
The Tradeoffs frequent. We should not interpret “real time” to mean
Wise folk keep telling us, “You don’t get something for the lowest possible latency; rather, it refers to having
nothing.” When we get something, we usually have a deterministic latency regardless of throughput.
to give up something in return. When it comes to
garbage collection, we play with three major variables For some application workloads, throughput is the
that set targets for the collectors: most important target. An example would be a long-
running batch-processing job. It does not matter if
1. Throughput: The amount of work done by an a batch job occasionally pauses for a few seconds
application as a ratio of time spent in GC. Target while GC takes place as long as the overall job can be
throughput with XX:GCTimeRatio=99; 99 is the completed sooner.
default, equating to 1% GC time.
For virtually all other workloads, from human-facing
2. Latency: The time taken by systems in responding interactive applications to financial-trading systems,
Contents Page 19
Java Performance / Issue 7 - Dec 2013
For example, loop iterators are mostly short-lived Safepoint checks are normally performed on method
whereas static strings are effectively immortal. returns and loop back edges, but can be optimized in
Experimentation has shown that generational some places making them more dynamically rare. For
garbage collectors can usually support an order-of- example, if a thread is copying a large array, cloning
magnitude greater throughput than non-generational a large object, or executing a monotonic counted
collectors do, and thus are almost ubiquitously used loop with a finite bound, it may be many milliseconds
in server JVMs. When separating the generations of before a safepoint is reached. Time to safepoint is an
objects, a region of newly allocated objects is likely to important consideration in low-latency applications.
be sparse for live objects. A collector that scavenges This time can be surfaced by enabling the XX:+Prin
for the few live objects in this new region and copies tGCApplicationStoppedTime flag in addition to the
them to another region for older objects can be very other GC flags.
efficient. HotSpot garbage collectors record the age
of an object in terms of the number of GC cycles Applications with a large number of running threads
survived. when a stop-the-world event occurs a system will
Contents Page 20
Java Performance / Issue 7 - Dec 2013
undergo significant scheduling pressure as the interned strings were moved from permgen to tenured,
threads resume when released from safepoints. and Java 8 did away with the perm generation so it will
Therefore, algorithms with less reliance on stop-the- not be discussed in this article. Most other commercial
world events can potentially be more efficient. collectors do not use a separate perm space and tend to
treat all long-living objects as tenured.
Heap Organization in HotSpot Note: The virtual spaces allow the collectors to adjust
To understand how the different collectors operate, the size of regions to meet throughput and latency
it is best to explore how the Java heap is organized to targets. Collectors keep statistics for each collection
support generational collectors. phase and adjust the region sizes accordingly in an
Eden is the region where most objects are initially attempt to reach the targets.
allocated. The survivor spaces are a temporary
store for objects that have survived a collection
of the Eden space. Survivor-space usage will be Object Allocation
described when minor collections are discussed. To avoid contention, each thread is assigned a thread
Collectively, Eden and the survivor spaces are known local-allocation buffer (TLAB) from which it allocates
as the “young” or “new” generation. objects. Using TLABs allows object allocation to scale
with number of threads by avoiding contention on a
Objects that live long enough are eventually single memory resource. Object allocation via a TLAB
promoted to the tenured space. is inexpensive; it simply bumps a pointer for the object
size, which takes roughly 10 instructions on most
The perm generation is where the runtime stores objects platforms. Heap memory allocation for Java is even
it “knows” to be effectively immortal, such as classes cheaper than using malloc from the C runtime.
and static strings. Unfortunately, the common use of Note: Whereas individual object allocation is
class-loading on an ongoing basis in many applications inexpensive, the rate at which minor collection must
makes the motivating assumption behind the perm occur is directly proportional to the rate of object
generation (that classes are immortal) wrong. In Java 7, allocation.
Contents Page 21
Java Performance / Issue 7 - Dec 2013
Contents Page 22
Java Performance / Issue 7 - Dec 2013
The old-generation collector will try to predict when Parallel collector (XX:+UseParallelGC) uses multiple
it needs to collect to avoid a promotion failure from threads to perform minor collections of the young
the young generation. The collectors track a fill generation and a single thread for major collections
threshold for the old generation and begin collection on the old generation. The Parallel Old collector
when this threshold is passed. If this threshold is not (XX:+UseParallelOldGC) , the default since Java
sufficient to meet promotion requirements then a 7u4, uses multiple threads for minor collections
FullGC is triggered. A FullGC involves promoting all and for major collections. Objects are allocated in
live objects from the young generations followed by the tenured space using a simple bump-the-pointer
a collection and compaction of the old generation. algorithm. Major collections are triggered when
Promotion failure is a very expensive operation as the tenured space is full.
state and promoted objects from this cycle must be
unwound so the FullGC can occur. On multiprocessor systems, the Parallel Old collector
will give the greatest throughput of any collector.
Note: To avoid promotion failure, you It has no impact on a running application until a
will need to tune the padding that the old collection occurs, and then will collect in parallel using
generation allows to accommodate promotions multiple threads using the most efficient algorithm.
(XX:PromotedPadding=<n>). This makes the Parallel Old collector suitable for
batch applications.
Note: When the heap needs to grow, a FullGC is
triggered. These heap-resizing FullGCs can be The cost of collecting the old generations is affected
avoided by setting –Xms and –Xmx to the same value. by the number of objects to retain more than by
the size of the heap. The efficiency of the Parallel
Other than a FullGC, a compaction of Old collector can be increased to achieve greater
the old generation is likely to be the largest stop-the- throughput by providing more memory and accepting
world pause an application will experience. The time larger but fewer collection pauses.
for this compaction tends to grow linearly with the
number of live objects in the tenured space. Expect the fastest minor collections with this
collector because the promotion to tenured space is a
The rate at which the tenured space fills up can simple bump-the-pointer and copy operation.
sometimes be reduced by increasing the size of
the survivor spaces and raising the age limit of For server applications, the Parallel Old collector
objects to be promoted to the tenured generation. should be the first port-of-call. However if the major-
However, increasing the size of the survivor spaces collection pauses are more than your application
and object promotion age in minor collections (– can tolerate then you need to consider employing a
XX:MaxTenuringThreshold) can also increase the cost and concurrent collector that collects the tenured objects
pause times in the minor collections due to the increased concurrently while the application is running.
copy cost between survivor spaces on minor collections.
Note: Expect pauses in the order of one to five
Serial Collector seconds per GB of live data on modern hardware
The Serial collector (-XX:+UseSerialGC) is the while the old generation is compacted.
simplest collector and is a good option for single-
processor systems. It also has the smallest Concurrent Mark Sweep (CMS)
footprint of any collector. It uses a single thread Collector
for both minor and major collections. Objects are The CMS (-XX:+UseConcMarkSweepGC) collector
allocated in the tenured space using a simple bump- runs in the old generation, collecting tenured objects
the-pointer algorithm. Major collections are triggered that are no longer reachable during a major collection.
when the tenured space is full. It runs concurrently with the application with the goal
of keeping sufficient free space in the old generation
Parallel Collector so that a promotion failure from the young generation
The Parallel collector comes in two forms. The does not occur.
Contents Page 23
Java Performance / Issue 7 - Dec 2013
Contents Page 24
Java Performance / Issue 7 - Dec 2013
space. These regions are then collected in stop-the- heaps that tend to become fragmented when an
world pause increments by evacuating the live objects application can tolerate pauses in the 0.5-1.0 second
to an empty region, thus compacting in the process. range for incremental compactions. G1 tends to
Objects larger than 50% of a region are allocated in reduce the frequency of the worst-case pauses
humongous regions that are a multiple of region size. seen by CMS because of fragmentation at the cost
Allocation and collection of humongous objects can of extended minor collections and incremental
be very costly under G1, and to date has had little or compactions of the old generation. Most pauses
no optimization effort applied. end up constrained to regional rather than full heap
compactions.
The challenge with any compacting collector is not
the moving of objects but the updating of references Like CMS, G1 can also fail to keep up with promotion
to those objects. If an object is referenced from rates, and will fall back to a stop-the-world FullGC.
Just like CMS has “concurrent mode failure”, G1 can
suffer an evacuation failure, seen in the logs as “to-
space overflow”. This occurs when there are no free
regions into which objects can be evacuated, which is
similar to a promotion failure. If this occurs, try using
a larger heap and more marking threads, but in some
cases application changes may be necessary to reduce
allocation rates.
Contents Page 25
Java Performance / Issue 7 - Dec 2013
Note: All the concurrent collectors tend to function collector works, run your load tests with different
more efficiently when sufficient space is allocated. configurations as experiments until you reach your
As a starting-point rule of thumb, you should budget throughput and latency targets. It is important to
a heap at least two to three times the size of the measure latency from the end-user perspective.
live set for efficient operation. However, space
requirements for maintaining concurrent operation This can be achieved by capturing the response
grow with application throughput and the associated time of every test request in a histogram, and you
allocation and promotion rates. So higher throughput can read more about that here. If you have latency
applications may warrant a higher heap-size to live- spikes that are outside your acceptable range, try
set ratio. Given the huge memory spaces available to and correlate these with the GC logs to determine
today’s systems, footprint is seldom an issue on the if GC is the issue. It is possible other issues may be
server side. causing latency spikes. Another useful tool to consider
is jHiccup which can be used to track pauses within
Garbage-Collection Monitoring the JVM and across a system as a whole.
and Tuning
To understand how your application and garbage If latency spikes are due to GC then invest in tuning
collector are behaving, start your JVM with at least CMS or G1 to see if your latency targets can be met.
the following settings: Sometimes this may not be possible because of high
allocation and promotion rates combined with very-
-verbose:gc low-latency requirements. GC tuning can become a
Contents Page 26
Java Performance / Issue 7 - Dec 2013
Contents Page 27