Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
295 views

InfoQ - Java Performance PDF

This document discusses several topics related to optimizing Java garbage collection: 1. It provides an overview of Java garbage collection basics, including the mark-and-sweep algorithm. 2. It describes some key aspects of how garbage collection works in HotSpot, including that it stops application threads to get an accurate snapshot of live objects. 3. It outlines the memory layout in Java, including different memory pools like heap, permgen, and metaspace.

Uploaded by

pbecic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
295 views

InfoQ - Java Performance PDF

This document discusses several topics related to optimizing Java garbage collection: 1. It provides an overview of Java garbage collection basics, including the mark-and-sweep algorithm. 2. It describes some key aspects of how garbage collection works in HotSpot, including that it stops application threads to get an accurate snapshot of live objects. 3. It outlines the memory layout in Java, including different memory pools like heap, permgen, and metaspace.

Uploaded by

pbecic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN ENTERPRISE SOFT WARE DEVELOPMENT

JAVA by
eMag Issue 7 - December 2013

PERFORMANCE

Visualizing Java GC
Ben Evans discusses garbage collection in Java along with some tooling for understanding and
visualizing how it works.

PAGE 4

G1: One Garbage Tips for Tuning Java Garbage


Collector the Garbage First Collection
To Rule Them All Garbage Collector Distilled
Oracle’s new G1 Collector in HotSpot In this second instalment, Monica CMS, G1, Young Gen, New Gen, Old
moves away from the conventional GC
delves into more practical aspects and Gen, Eden, and the hundreds of JVM
model, where a Java heap splits into
(contiguous) young and old generations, provides guidance for tuning. start-up flags... does this all baffle
and instead introduces the concept you when trying to tune the garbage
of “regions”, for a generally more collector to get the required throughput
performant and manageable GC. and latency from your Java application?

PAGE 8 PAGE 13 PAGE 19


Contents
Java Performance / Issue 7 - Dec 2013

Contents

Visualizing Java Garbage Collection  Page 4


Ben Evans discusses garbage collection in Java along with some tooling for
understanding and visualizing how it works.

G1: One Garbage Collector


To Rule Them All Page 8
Oracle’s new G1 Collector in HotSpot moves away from the conventional
GC model, where a Java heap splits into (contiguous) young and old
generations, and instead introduces the concept of “regions”, for a generally
more performant and manageable GC.

Tips for Tuning the Garbage


First Garbage Collector Page 13
In this second installment, Monica delves into more practical aspects and
provides guidance for tuning.

Java Garbage Collection Distilled  Page 19


CMS, G1, Young Gen, New Gen, Old Gen, Eden, and the hundreds of JVM
start-up flags... does this all baffle you when trying to tune the garbage
collector to get the required throughput and latency from your Java
application? Don’t worry, you are not alone. This article will attempt to explain
the tradeoffs when choosing and tuning garbage collection algorithms for a
particular workload.

Contents Page 2
Contents Page 3
Java Performance / Issue 7 - Dec 2013

Visualizing Java GC
By Ben Evans

Garbage Collection, like Backgammon takes minutes from other language backgrounds often focus on GC
to learn and a lifetime to master. pauses without fully understanding the context that
automatic memory management operates in.
In his talk Visualizing Garbage Collection Master
trainer/consultant Ben Evans discusses GC from the Mark & Sweep is the fundamental algorithm used
ground up. for GC by Java (and other runtimes).

A brief summary of his talk follows. In the Mark & Sweep algorithm you have references
pointing from the frames of each stack’s thread,
Basics which point into program heap. So we start in the
GC has largely replaced earlier techniques, such stack, follow pointers to all possible references, and
as manual memory management and reference then follow those references, recursively.
counting. When you’re done, you have all live objects, and
everything else is garbage.
This is a good thing, as memory management is
boring, pedantic bookkeeping that computers excel Note that one point people often miss is that the
at whereas people do not. Language runtimes are runtime itself also has a list of pointers to every
better at this than humans are. object called the “allocation list” that is maintained
by the garbage collector and helps the garbage
Modern GC is highly efficient, far more so than collector to clean up. So the runtime can always find
manual allocation typical in earlier languages. People an object that it created but has not yet collected.

Contents Page 4
Java Performance / Issue 7 - Dec 2013

The stack depicted in the illustration above is just is not doing garbage collection - because all of our GC
the stack associated with a single application thread; memory bookkeeping is done in user space.
there is a similar stack for each and every application
thread, with its own set of pointers into the heap. Memory Pools

If the garbage collector were to attempt to get a


snapshot of what’s living while the application was
running, then it would be chasing a moving target,
and could easily miss some badly timed objects
allocations, and could not get an accurate snapshot.
Therefore it is necessary to “Stop the World”; i.e. stop
the application threads long enough to capture the
live object snapshot.

There are two golden rules that the garbage collector


must abide by:
1. The garbage collector must collect all of the
garbage.
2. The garbage collector must never collect any live
PermGen is the storage area for things like class
object.
metadata, that need to remain alive for the life of the
program. However with the advent of application
But these rules are not created equal; if rule 2 would
servers that have their own classloaders and need to
be violated, you would end up with corrupted data.
reload class metadata, PermGen starts looking like a
bad optimization decision, which fortunately is going
On the other hand, if rule 1 were violated, and instead
away in Java 8.
we had a system that did not collect all of the garbage
all of the time, but rather only collected it eventually,
A new concept will be used called “Metaspace” which
then that could be tolerated, and in fact could be the
is not exactly the same thing as PermGen. Metaspace
basis of a garbage collector.
is outside the heap, and is managed by the operating
HotSpot system. That means it will be going not into the Java
heap, but rather into native memory. Currently this is
Now let’s talk about HotSpot, which is actually
not such good news because there aren’t many tools
a conglomeration of C and C++ as well as a lot of
that allow you to look easily into native memory. So it
platform-specific assembler.
is good that PermGen is going away but it is going to
take some time until the tooling can catch up.
When people think of an interpreter, they think of
a big while-loop with a large switch statement. But Java Heap Layout
the HotSpot interpreter is much more sophisticated
Now let’s take a look at the Java heap. Notice the
than that (for performance reasons). When you start
Virtual spaces between the heap spaces. These
looking at the JDK source code, you realize just how
provide a little wiggle to allow some amount of
much assembler code is in Hotspot.
resizing of the pools without suffering the expense of
Object Creation moving everything.
In Java we allocate a large contiguous amount of
space up front, which is what we know as “the heap”.
This memory is then managed, purely in user space, by
HotSpot.

If you see a Java process that is using a huge amount


of system (or kernel) time, then you can rest assured it

Contents Page 5
Java Performance / Issue 7 - Dec 2013

Weak Generational Hypothesis Runtime Switches


Now, why do we actually separate the heap into all of
these memory pools? ‘Mandatory’ Flags
-verbose:gc
– Get me some GC output
-Xloggc:<pathtofile>
– Path to the log output, make sure you’ve got disk
space
- XX:+PrintGCDetails
– Minimum information for tools to help
– Replace -verbose:gc with this
- X X:+PrintTenuringDistribution – Premature
promotion information

Basic Heap Sizing Flags


- Xms<size>
– Set the minimum size reserved for the heap
- Xmx<size>
– Set the maximum size reserved for the heap
There are runtime facts that cannot be deduced by - X X:MaxPermSize=<size> – Set the maximum size
static analysis. The graph above illustrates that there of your perm gen – Good for Spring apps and App
are two groups of objects; those that die young and servers
those that live for a long time – so it makes sense to do In the old days, we were taught to set –Xms to be the
extra bookkeeping to take advantage of that fact. The same value as –Xmx. However this has changed. So
Java platform is littered with similar facts that have now you can set –Xms to something reasonably small,
been codified into the platform as optimizations. or just not set it at all, because the heap adaptiveness
is now very good.
Demos
Other Flags
Ben goes on to perform a series of animated demos
(download from https://github.com/kittylyst/ -XX:NewRatio=N
jfx-mem). The first demo in Flash illustrates the -XX:NewSize=N
movement between Eden and one of the young gen -XX:MaxNewSize=N
survivor spaces, and finally into tenured. -XX:MaxHeapFreeRatio
-XX:MinHeapFreeRatio
Next he presents a JavaFX rendition of the same. -XX:SurvivorRatio=N
-XX:MaxTenuringThreshold=N

Contents Page 6
Java Performance / Issue 7 - Dec 2013

Why Log Files


Log files have the advantage of being able to be used About the speaker
in forensic analysis, which can save you from having Ben Evans is the CEO of jClarity, a Java/JVM
to run the code a second time to reproduce the issue performance analysis startup. In his spare time he is
(not easy if it’s a rare production bug). one of the leaders of the London Java Community and
They also have more information than the JMX holds a seat on the Java Community Process
MXBeans for memory, not to mention that polling Executive
JMX can introduce its own set of GC problems. Committee. His previous projects include
performance
Tooling testing the Google IPO, financial trading systems,
HP JMeter (Google it) writing award-winning websites for some of the
Free, reasonably solid, but no longer supported / biggest
enhanced films of the 90s, and others.
GCViewer (http://www.tagtraum.com/gcviewer.html)
Free, OSS, but a bit ugly Presentation Summary by Victor Grazi.

GarbageCat (http://code.google.com/a/eclipselabs.
org/p/garbagecat/) WATCH THE FULL PRESENTATION ONLINE ON INFOQ
Best name http://www.infoq.com/presentations/Visualizing-Java-GC

IBM GCMV (http://www.ibm.com/developerworks/


java/jdk/tools/gcmv/)
J9 support

jClarity Censum (http://www.jclarity.com/products/


censum)
The prettiest and most useful – But we’re biased!

In Summary
You need to understand some basic GC theory
You want most objects to die young in young gen
Turn on GC logging!
– Reading raw log files is hard – Use a tool
Use tools to help you tweak – Measure, don’t guess.

Contents Page 7
Java Performance / Issue 7 - Dec 2013

G1: One Garbage Collector


to Rule Them All
By Monica Beckwith

Many articles describe how a poorly tuned garbage First (G1) collector, HotSpot’s latest GC (introduced
collector (GC) can bring an application’s service- in JDK7 update 4).
level-agreement (SLA) commitments to its knees. G1 is an incremental, parallel, compacting GC
For example, an unpredictably protracted garbage- that provides more predictable pause times than
collection pause can easily exceed the response- CMS and Parallel Old. By introducing a parallel,
time requirements of an otherwise performant concurrent, and multi-phased marking cycle,
application. Moreover, the irregularity increases G1 GC can work with much larger heaps while
when you have a non-compacting GC such as providing reasonable worst-case pause times. The
Concurrent Mark and Sweep (CMS) that tries to basic idea with G1 GC is to set your heap ranges
reclaim its fragmented heap with a serial (single- (using -Xms for min heap size and -Xmx for the max
threaded) full garbage collection that is stop-the- size) and a realistic (soft real time) pause-time goal
world (STW). (using -XX:MaxGCPauseMillis) and then let the GC
do its job.
Suppose an allocation failure in the young
generation triggers a young collection, leading to With G1 GC, HotSpot moves away from its
promotions to the old generation. Further, suppose conventional GC layout where a contiguous
that the fragmented old generation has insufficient Java heap splits into (contiguous) young and old
space for the newly promoted objects. Such generations. In G1 GC, HotSpot introduces the
conditions would trigger a full garbage-collection concept of “regions”. A single, large, contiguous
cycle, which will compact the heap. Java heap space divides into multiple fixed-sized
heap regions. A list of “free” regions maintains these
With CMS GC, the full collection is serial and STW, regions. As the need arises, the free regions are
so your application threads are stopped for the assigned to either the young or the old generation.
entire duration while the heap space is reclaimed
and compacted. The duration of the STW pause These regions can range from 1 MB to 32 MB in
depends on your heap size and the surviving objects. size depending on your total Java heap. G1 aims
This is a common scenario with Parallel Old GC. for around 2048 regions for the total heap. Once a
Parallel Old reclaims the old generation with a region frees up, it goes back to the free regions list.
parallel STW full garbage-collection pause. This
full garbage collection is not incremental; it is one The principle of G1 GC is to reclaim the Java heap
big STW pause and does not interleave with the as much as possible (while trying its best to meet
application execution. the pause-time goal) by collecting the regions with
the least amount of live data, i.e. the ones with most
(Note: You can read more about HotSpot GCs here.) garbage, first - hence the name Garbage First.

With the above information, we would like to


consider one solution in the form of the Garbage

Contents Page 8
Java Performance / Issue 7 - Dec 2013

The young generation can range anywhere from


Fig. 1: Conventional GC Layout the preset min to the preset max sizes, which are
functions of the Java heap size. When Eden reaches
One thing to note is that for G1 GC, neither the young capacity, a “young garbage collection”, also known
nor the old generation has to be contiguous. This is a as an “evacuation pause”, will occur. This is a STW
handy feature since the sizing of the generation is now pause that copies (evacuates) the live objects from
more dynamic. the regions that make up the Eden, to the to-space
survivor regions.
Adaptive-sized GC algorithms like Parallel Old GC end
up reserving extra space that each generation may
require so that they can fit in their contiguous space
constraint. In case of CMS, a full garbage collection is
required to resize the Java heap and the generations.
In contrast, G1 GC uses logical generations (a
collection of non-contiguous regions of the young
generation and a remainder in the old generation), so
there is not much waste in space or time.

To be sure, the G1 GC algorithm does utilize some of


HotSpot’s basic concepts. For example, the concepts
of allocation, copying to survivor space, and promotion
to old generation are similar to previous HotSpot GC
implementations. Eden regions and survivor regions
still make up the young generation. Most allocations
happen in Eden except for “humongous” allocations.
G1 GC considers objects that span more than half
a region size to be humongous objects and directly
allocates them into “humongous” regions out of the Fig. 2: Garbage First GC Layout
old generation. G1 GC selects an adaptive young
generation size based on your pause-time goal. In addition, live objects from the from-space survivor

Contents Page 9
Java Performance / Issue 7 - Dec 2013

regions will be either copied to the to-space survivor GC worker threads in scanning the external roots
regions or, based on the object’s age and the tenuring such as registers, thread stacks, etc. that point into the
threshold, will be promoted to region(s) from the old- collection set.
generation space. 2. U pdate remembered sets (RSets): RSets aid G1 GC
Every young collection involves parallel worker in tracking references that point into a region. The
time and sequential/serial time. To explain this time shown here is the amount of time the parallel
further, I will use a log output from the latest Java worker threads spent in updating the RSets.
7 update release, which at the time of publication is 3. P rocessed buffers: The count shows how many update
7u25. (We also have an Early Access (EA) for 7u40. buffers were processed by the worker threads.
Please feel free to try out the EA bundles for your 4. Scan RSets: The time spent in scanning the RSets
platform. With 7u40 EA, you may see a difference for references into a region. This time will depend
in the log format, but the basic premise remains the on the “coarseness” of the RSet data structures.
same.) 5. O bject copy: During every young collection, the
GC copies all live data from the Eden and from-
The following command-line options generated this space survivor, either to the regions in the to-
GC log output: space survivor or to the old-generation regions.
java –Xmx1G –Xms1G –XX:+UseG1GC – The amount of time it takes the worker threads to
XX:+PrintGCDetails –XX:+PrintGCTimeStamps complete this task is listed here.
GCTestBench 6. Termination: After completing their particular
work (e.g. object scan and copy), each worker
Note: I went with the default pause-time goal of 200 ms. thread enters its termination protocol. Prior to
The indentation demarcates the parallel and terminating, the worker thread looks for work from
the sequential work groups. The parallel worker time is other threads to steal and terminates when there is
further split into: none. The time listed here indicates the time spent
by the worker threads offering to terminate.
7. Parallel worker other time: Time spent by the
worker threads that was not accounted for in any of
the parallel activities listed above.

The sequential work (which could be parallelized,


individually) is divided into:

1. Clear CT: Time spent by the GC worker threads in


clearing the card table of RSet scanning metadata.
2. And a few others clubbed under the Other time,
comprised of:
- Choose collection set (CSet): A garbage-collection
cycle collects the set of regions in the CSet. The
collection pause collects/evacuates all the live data
in a particular CSet. The time listed here is the time
spent finalizing the set of regions added to the CSet.
- Reference processing: The time spent
processing the deferred references (soft, weak,
final, and phantom) from the prior garbage-
collection phases.
- Reference En-queuing: The time spent placing
the references on the pending list.
- Free CSet: Time spent freeing the just-collected
set of regions. This includes the time spent
1. E
 xternal root scanning: The time spent by the parallel freeing their RSets as well.

Contents Page 10
Java Performance / Issue 7 - Dec 2013

I have just skimmed the surface with respect to many the initial-mark phase, which the first line of output
things like the RSets, its coarsening, the update buffers, is telling us. The initial-mark phase is piggybacked
and the CSet. The next few paragraphs will add a few (done at the same time) on a normal (STW) young-
more things like the Snapshot-at-the-Beginning (SATB) garbage collection, so the output is similar to what
algorithm and barriers, etc. However, in order to learn you see during a young-evacuation pause.
more about them, we would have to deeply dive into •R  oot-region scanning phase – During this phase, G1
the internals of G1 GC, an interesting topic that is GC scans survivor regions of the initial-mark phase
outside the scope of this article. for references into the old generation and marks the
referenced objects. This phase runs concurrently
Now that we understand how the young collections (not STW) with the application. It is important that
start filling up the old generation, we need to this phase complete before the next young-garbage
introduce (and understand) the concept of a marking collection happens.
threshold. When the occupancy of the total heap •C  oncurrent-marking phase – During this phase,
crosses this threshold, G1 GC will trigger a multi- G1 GC looks for reachable (live) objects across the
phased concurrent marking cycle. The command-line entire Java heap. This phase happens concurrently
option that sets the threshold is –XX:InitiatingHeap with the application and a young-garbage collection
OccupancyPercent and it defaults to 45% of the total can interrupt the concurrent-marking phase (shown
Java heap size. G1 GC uses SATB, a marking algorithm above).
that takes a logical snapshot of the set of live objects • Remark phase – The remark phase helps the
in the heap at the beginning of the marking cycle. completion of marking. During this STW phase, G1
GC drains any remaining SATB buffers and traces
This algorithm uses a pre-write barrier to record any as yet unvisited live objects. G1 GC also does
and mark the objects that are a part of the logical reference processing during the remark phase.
snapshot. Now let us spend some time discussing •C  leanup phase – This is the final phase of the multi-
the individual phases of the multi-phased concurrent phase marking cycle. It is partly STW when G1 GC
marking and first a look at the output from the GC log: does live-ness accounting (to identify completely
free regions and mixed garbage-collection candidate
regions) and when G1 GC scrubs the RSets. It
is partly concurrent when G1 GC resets and returns
the empty regions to the free list.

Once G1 GC successfully completes the concurrent-


marking cycle, it has the information that it needs to
start the old-generation collection. Until now, the
collection of the old regions was not possible since G1
GC did not have any marking information associated
with those regions. A collection that facilitates the
compaction and evacuation of old generation is
appropriately called a mixed collection since G1 GC
not only collects the Eden and the survivor regions,
but also (optionally) adds old regions to the mix. Let
us now discuss some details that are important to
understand a mixed collection.

A mixed collection can (and usually does) happen over


multiple mixed garbage-collection cycles. When a
sufficient number of old regions are collected, G1 GC
reverts to performing the young-garbage collections
In addition, here are the details: until the next marking cycle completes. A number of
• I nitial-mark phase – G1 GC marks the roots during flags listed and defined here control the exact number

Contents Page 11
Java Performance / Issue 7 - Dec 2013

of old regions added to the CSets: cleanup phase (of the multi-phased marking cycle)
when the completely free (i.e. full of garbage) regions
–X X:G1MixedGCLiveThresholdPercent: The are reclaimed and returned to the free list. The next
occupancy threshold of live objects in the old region level happens during the incremental mixed garbage
to be included in the mixed collection. collections. If all else fails, the entire Java heap is
–X X:G1HeapWastePercent: The threshold of garbage collected. This is the well-known fail-safe of full-
that you can tolerate in the heap. garbage collection.
–X X:G1MixedGCCountTarget: The target number of All of the above makes the reclamation of the old
mixed garbage collections within which the regions generation a lot easier and, in a way, tiered.
with at most G1MixedGCLiveThresholdPercent live
data should be collected. About the Author
–X X:G1OldCSetRegionThresholdPercent: A limit on Monica Beckwith is the performance lead for
the max number of old regions that can be collected Garbage First garbage collector. She has worked in
during a mixed collection. the performance and architecture industry for over
10 years. Prior to Oracle and Sun Microsystems,
Monica lead the performance effort at Spansion Inc.
Monica has worked with many industry standard
Java-based benchmarks with a constant goal of
finding opportunities for improvement in Oracle’s
HotSpot VM.

READ THIS ARTICLE ONLINE ON InfoQ


http://www.infoq.com/articles/G1-One-Garbage-Collector-To-
Rule-Them-All

Let us look at a mixed collection cycle output from a


G1 GC log:
In summary, G1 improves upon its predecessor GCs
by introducing the concept of regions that make up
a logical generation. The regions help provide finer
granularity for an incremental collection of the old
generation. G1 does most of its reclamation through
copying of the live data, thus achieving compaction.
This is definitely a step up from in-space de-allocation
without compaction, which leaves the old generation
looking like Swiss cheese!

The first level of reclamation happens during the

Contents Page 12
Java Performance / Issue 7 - Dec 2013

Tips for Tuning the Garbage First


Garbage Collector
By Monica Beckwith

This second article in a two-part series on the G1 Remembered Sets


garbage collector follows the first one above, which
Recall from the previous article that remembered
originally appeared on InfoQ July 15, 2013: G1: One
sets (RSets) are per-region entries that aid G1
Garbage Collector To Rule Them All.
GC in tracking outside references that point into a
heap region. Instead of scanning the entire heap for
Before we can understand how to tune the
references into a region, G1 just needs to scan its RSet.
Garbage First garbage collector (G1 GC), we must
first understand the key concepts that define G1.
Figure 1: Remembered Sets (below)
In this article, I will first introduce the concept
Let us look at an example. We show three regions
and then talk about how to tune G1 (where
(in gray): Region 1, Region 2 and Region 3 and their
appropriate) with respect to that concept.
associated RSets (in pink) that represent a set of

Contents Page 13
Java Performance / Issue 7 - Dec 2013

cards. Both Region 1 and Region 3 happen to be references to the owning region.
referencing objects in Region 2. Therefore, the RSet A collection set (CSet) is a set of regions to be collected
for Region 2 tracks the two references to Region 2, during a garbage collection. For a young collection,
the “owning region”. the CSet only includes young regions. For a mixed
collection, the CSet includes young and old regions.
There are two concepts that help maintain RSets:
1. Post-write barriers If the CSet includes many regions with coarsened
2. Concurrent refinement threads RSets (note that “coarsening of RSets” is defined as
the transitioning of RSets through different levels
The barrier code steps in after a write (hence the of granularity), you will see an increase in scan time
name “post-write barrier”) and helps track cross- for the RSets. These scan times are represented in
region updates. Update log buffers are responsible for the GC pause as “Scan RS (ms)” in the GC logs. If the
logging the cards that contain the updated reference Scan RS times seem high relative to the overall GC
field. Once these buffers are full, they are retired. pause time, or they appear high for your application,
Concurrent refinement threads process these full then please look for the text string “Did xyz coarsenings”
buffers. in your GC log output when using the diagnostic
option -XX:+G1SummarizeRSetStats (you can also specify
Note that the concurrent refinement threads help the reporting frequency period in number of GCs by
maintain RSets by updating them concurrently (while setting -XX:G1SummarizeRSetStatsPeriod=period).
the application is also running). The deployment of the
concurrent refinement threads is tiered: initially only Recalll from the previous article that “Update RS
a small number of threads are deployed and more are (ms)” in the GC-pause output shows the time spent
eventually added depending on the amount of filled updating RSets, and the “Processed Buffers” show the
update buffers to be processed. count of the update buffers process during the GC
pause. If you spot issues in these in your GC logs then
The max number of concurrent refinement threads can use the above-mentioned options to dive even further
be controlled by -XX:G1ConcRefinementThreads or into the issues.
even -XX:ParallelGCThreads. If the concurrent
refinement threads cannot keep up with the amount Those options can also help identify potential issues
of filled buffers, then the mutator threads own with the update log buffers and the concurrent
and handle the processing of the buffers - usually refinement threads.
something that you should strive to avoid. A sample output of -XX:+G1SummarizeRSetStats with the
period set to one -XX:G1SummarizeRSetStatsPeriod=1:
There is one RSet per region. There are three levels of Concurrent RS processed 784125 cards
granularity for RSets - sparse, fine, and coarse. A per-
region table (PRT) is an abstraction that houses the Of 4870 completed buffers:
granularity level for RSet. A sparse PRT is a hash table
that contains card indices. G1 GC internally maintains 4870 (100.0%) by concurrent RS threads.
these cards. The card may contain references from
the region that spans the address associated with 0 ( 0.0%) by mutator threads.
the card to the owning region. A fine-grain PRT is
an open hash table where each entry represents a Concurrent RS threads times (s)
region with a reference into the owning region. The
card indices, within the region, are held in a bitmap. 0.64 0.30 0.26 0.18 0.17 0.16 0.17 0.15 0.15 0.12 0.13
When reaching the max capacity of the fine-grain 0.08 0.13 0.13 0.12 0.13 0.12 0.11 0.12 0.11 0.12 0.13
PRT, a corresponding coarse-grained bit is set in the 0.11
coarse-grain bitmap and the corresponding entry is
deleted from the fine-grain PRT. A coarse bitmap has Concurrent sampling threads times (s)
one bit for each region. A set bit in the coarse-grain
map means that the associated region may contain 0.00

Contents Page 14
Java Performance / Issue 7 - Dec 2013

Total heap region rem set sizes = 199140K. Max = During an evacuation pause, the reference objects
661K. are discovered during the object scanning and copying
phase and are processed after that. In the GC log,
Static structures = 660K, free_lists = 15052K. you can see the reference-processing (Ref proc) time
clubbed under the sequential work group called “Other”:
1009422114 occupied cards represented.
[Other: 0.2 ms]
Max size region = [Choose CSet: 0.0 ms]
313:(O)[0x000000054e400000,0x000000054 [Ref Proc: 0.2 ms]
e800000,0x000000054e800000], size = 662K, [Ref Enq: 0.0 ms]
occupied = 1214K. [Free CSet: 0.0 ms]

Did 2759 coarsenings. Note that references with dead referents are added
to the pending list and that time is shown in the GC
The above output shows the count of processed log as reference-enqueing time (Ref Enq).
cards and completed buffers. It also shows that
the concurrent refinement threads did 100% of the During a remark pause, the discovery happens
work and mutator threads did none (which, as I said, during the earlier phase of concurrent marking. (Both
is a good sign!). It then lists the concurrent refinement are a part of the multi-phase, concurrent marking
thread times for each thread involved in the work. cycle. Please refer to the previous article for more
information.) The remark phase deals with the
The segment in brown shows the cumulative stats processing of the discovered references. In the GC
since the start of the HotSpot VM. The cumulative log, you can see the reference processing (GC ref-
stats include the total RSet sizes and max RSet size, proc) time shown in the GC remark section:
total number of occupied cards, and max region
size information. It also shows the total number of 0.094: [GC remark 0.094: [GC ref-proc,
coarsenings done since the start of the VM. 0.0000033 secs], 0.0004374 secs]
[Times: user=0.00 sys=0.00, real=0.00 secs]
At this point, I think it is safe to introduce another
option flag -XX:G1RSetUpdatingPauseTimePerce If you see lengthy times during reference processing
nt=10. This flag sets a target percentage (defaults to then please turn on parallel reference processing
10% of the pause-time goal) that G1 GC should spend by enabling the following option on the command
updating RSets during a GC evacuation pause. line -XX:+ParallelRefProcEnabled.

You can increase or decrease the percent value, so Evacuation Failure


as to spend respectively more or less timeupdating You might have come across some of the terms
the RSets during the STW GC pause and let the “evacuation failure”, “to-space exhausted”, “to-space
concurrent refinement threads deal with the update overflow”, or maybe something like “promotion
buffers accordingly. failure” during your GC travels. These terms all refer
to the same thing, and the concept is similar in G1
Keep in mind that by decreasing the percent value, GC as well. When there are no more free regions
you are pushing off the work to the concurrent to promote to the old generation or to copy to the
refinement threads, so you will see an increase in survivor space, and the heap cannot expand since it is
concurrent work. already at its maximum, an evacuation failure occurs.
For G1 GC, an evacuation failure is expensive:
Reference Processing
G1 GC processes references during an evacuation 1. For successfully copied objects, G1 needs to update
pause and during a remark pause (a part of the multi- the references and the regions have to be tenured.
phased concurrent marking).
2. For unsuccessfully copied objects, G1 will self-

Contents Page 15
Java Performance / Issue 7 - Dec 2013

forward them and tenure the regions in place.


So what should you do when you encounter an Let us look at a log snippet
evacuation failure in your G1 GC logs? with -XX:+PrintAdaptiveSizePolicy enabled:

1. Find out if the failures are a side effect of 6062.121: [GC pause (G1 Evacuation Pause)
overtuning - Get a simple baseline with min (mixed) 6062.121: [G1Ergonomics (CSet
and max heap and a realistic pause-time goal. Construction) start choosing CSet, _pending_
Remove any additional heap sizing such as -Xmn, cards: 129059, predicted base time: 52.34 ms,
-XX:NewSize, -XX:MaxNewSize, -XX:SurvivorRatio, remaining time: 147.66 ms, target pause time:
etc. Use only -Xms, -Xmx and a pause-time 200.00 ms]
goal -XX:MaxGCPauseMillis.
6062.121: [G1Ergonomics (CSet Construction)
2. If the problem persists in the baseline run and if add young regions to CSet, eden: 912 regions,
humongous allocations (see next section below) are survivors: 112 regions, predicted young region
not the issue, the corrective action is to increase time: 256.16 ms]
your Java heap size - if you can, of course.
6062.122: [G1Ergonomics (CSet Construction)
3. I f increasing the heap size is not an option and if finish adding old regions to CSet, reason: old
you notice that the marking cycle is not starting CSet region num reached min, old: 149 regions,
early enough for G1 GC to be able to reclaim the min: 149 regions]6062.122: [G1Ergonomics (CSet
old generation then drop your -XX:InitiatingHeap Construction) finish choosing CSet, eden: 912
OccupancyPercent. The default for this is 45% of regions, survivors: 112 regions, old: 149
your total Java heap. Dropping the value will start regions, predicted pause time: 344.87 ms,
the marking cycle earlier. Conversely, if the marking target pause time: 200.00 ms]
cycle is starting early and not reclaiming much, you
should increase the threshold above the default 6062.281: [G1Ergonomics (Heap Sizing) attempt
value to make sure that you are accommodating the heap expansion, reason: region allocation
live data set for your application. request failed, allocation request: 2097152
bytes]
4. Concurrent marking cycles can start on time, but
take so much time to finish that they delay the 6062.281: [G1Ergonomics (Heap Sizing) expand
mixed garbage-collection cycles, eventually lead the heap, requested expansion amount: 2097152
to an evacuation failure since old generation is not bytes, attempted expansion amount: 4194304
timely reclaimed. To avoid this, increase the number bytes]
of concurrent marking threads using the command-
line option: -XX:ConcGCThreads. 6062.281: [G1Ergonomics (Heap Sizing) did
not expand the heap, reason: heap expansion
5. If to-space survivor is the issue, then increase operation failed]
the -XX:G1ReservePercent. The default is 10% of
the Java heap. G1 GC creates a false ceiling and 6062.902: [G1Ergonomics (Heap Sizing) attempt
reserves the memory, in case there is a need for heap expansion, reason: recent GC overhead
more to-space. Of course, G1 GC caps it off at 50% higher than threshold after GC, recent
since we do not want the end-user to set it to a very GC overhead: 20.30 %, threshold: 10.00 %,
large value. uncommitted: 0 bytes, calculated expansion
amount: 0 bytes (20.00 %)]
To help explain the cause of evacuation
failure, I want to introduce a useful 6062.902: [G1Ergonomics (Concurrent Cycles)
option: -XX:+PrintAdaptiveSizePolicy. This will do not request concurrent cycle initiation,
provide many ergonomic details that are purposefully reason: still doing mixed collections,
kept out of the -XX:+PrintGCDetails option. occupancy: 9596567552 bytes, allocation

Contents Page 16
Java Performance / Issue 7 - Dec 2013

request: 0 bytes, threshold: 5798205810 bytes Finally leading to a full GC:


(45.00 %), source: end of GC] 6086.564: [Full GC (Allocation Failure)
11G->3795M(12G), 15.0980440 secs]
6062.902: [G1Ergonomics (Mixed GCs) continue
mixed GCs, reason: candidate old regions [Eden: 0.0B(4096.0M)->0.0B(4096.0M)
available, candidate old regions: 1038 Survivors: 0.0B->0.0B Heap: 12.0G(12.0G)-
regions, reclaimable: 2612574984 bytes (20.28 >3795.2M(12.0G)]
%), threshold: 10.00 %]
The full GC could have been avoided by letting the
(to-space exhausted), 0.7805160 secs] nursery/young gen shrink to the default minimum (5%
of the total Java heap). As you may be able to tell, the
The above snippet carries a lot of information. old generation was big enough to accommodate the
First, let me highlight a few things from live data set (LDS) of 3795M.
the set of command-line options that were
used for the above GC log: -server -Xms12g However, the LDS coupled with the explicitly set
-Xmx12g -XX:+UseG1GC -XX:NewSize=4g minimum young generation of 4Gs pushed the
-XX:MaxNewSize=5g occupancy to above 7891M. Since the marking
threshold was at its default value of 45% of the heap
The switches in red show that the user has restricted (i.e. around 5529M), the marking cycle started earlier
the nursery in the 4G to 5G range thus restricting the and reclaimed very little during the mixed collections.
adaptability of G1 GC. If G1 needs to drop nursery to The heap occupancy kept increasing and another
a smaller value, it cannot; if G1 needs to increase the marking cycle started but by that time, the marking
nursery spaces, beyond its distribution, it cannot! cycle was done and the mixed GCs kicked in.
This is evident from the heap-utilization information
printed at the end of this evacuation pause: The occupancy was already at 11.3G (as seen in the
first heap utilization information). This collection also
[Eden: 3648.0M(3648.0M)->0.0B(3696.0M) encountered evacuation failures. Therefore, this issue
Survivors: 448.0M->400.0M Heap: 11.3G(12.0G)- falls into the overtuning and “starting marking cycle
>9537.9M(12.0G)] too early” categories.

After the cycle, G1 has to adhere to 4096M as the


Humongous Allocations
minimum nursery (-XX:NewSize=4g), out of which, One last thing that I would like to cover in this article is a
based on G1’s calculations, 3696M should be for Eden concept that may be new to many end-users: humongous
space and 400M for the survivor space. However, objects (H-objs) and G1 GC’s handling of them.
post-collection, the data in the Java heap is already
at 9537.9M. So, G1 ran out of to-space! The next two Why do we need a different allocation path for H-objs?
evacuation pauses also result in evacuation failures
with the following heap information: For G1 GC, objects are considered humongous if they
span 50% or more of G1’s region size. A humongous
Next mixed evacuation pause 1: allocation needs a contiguous set of regions. As you
can imagine, if G1 would allocate H-objs in the young
[Eden: 2736.0M(3696.0M)->0.0B(4096.0M) generation and if they stayed alive long enough,
Survivors: 400.0M->0.0B Heap: 12.0G(12.0G)- there would be a lot of unnecessary and expensive
>12.0G(12.0G)] (remember, H-objs need contiguous regions) copying
of these H-objs to survivor space and eventually these
Next mixed evacuation pause 2: H-objs will be promoted to the old generation. To
avoid this overhead, H-objs are directly allocated out
[Eden: 0.0B(4096.0M)->0.0B(4096.0M) Survivors: of the old generation and then categorized or mapped
0.0B->0.0B Heap: 12.0G(12.0G)->12.0G(12.0G)] as humongous regions.

Contents Page 17
Java Performance / Issue 7 - Dec 2013

By allocating H-objs directly in the old generation, This information is helpful, since you not only can tell
G1 avoids including them in any of the evacuation how many humongous allocations your application
pauses, and thus they are never moved. During a full made (and whether they were excessive or not), but
garbage-collection cycle, G1 GC compacts around also the sizes of the allocations. Moreover, if you deem
the live H-objs. Outside of a full GC, dead H-objs that there were excessive humongous allocations, all
are reclaimed during the cleanup phase of the multi- you have to do is increase the G1 region size to fit the
phased, concurrent marking cycle. In other words, H-objs as regular ones.
H-objs are collected either during the cleanup phase
or they are collected during a full GC. Recall from my last article that G1 regions can
span from 1 MB to 32 MB in powers of two. The
Before allocating H-obj, G1 GC will check if the size of this allocation request in this example is just
allocation will cross the initiating heap-occupancy above 4 MB so an 8-MB region size is not quite
percentage (the marking threshold). If so, G1 GC will large enough to avoid the humongous allocations.
initiate a G1 concurrent marking cycle. This is done We need to size up to the next power of two, 16
in this manner since we want to avoid evacuation MB. You set that explicitly on the command line:
failures and full garbage-collection cycles as much -XX:G1HeapRegionSize=16M
as possible. As a result, we check as early as possible
so as to give the G1 concurrent cycle as much time About the Author
as possible to complete before there are no more Monica Beckwith  is the performance architect at a
available regions for live-object evacuations. cool startup called Servergy. She has worked in the
performance and architecture industry for over 10
G1 GC’s basic premise is that there are not too many years. Prior to Servergy, Monica lead the Garbage
H-objs and that they are long-lived. However, since G1 First Garbage Collector performance at Oracle. You
GC’s region size is dependent on your minimum heap can follow Monica on twitter at @mon_beck.
size, your “normal” allocation may look humongous
to G1 GC. This would lead to lots of H-obj allocations
taking regions from old generation, which would READ THIS ARTICLE ONLINE ON InfoQ
eventually lead to an evacuation failure since G1 http://www.infoq.com/articles/tuning-tips-G1-GC
would not be able to keep up with those humongous
allocations.

Now, you may be wondering how to find


out if humongous allocations are leading
to evacuation failures. Here, once again,
-XX:+PrintAdaptiveSizePolicy will come to your
rescue.

In your GC log, you will see something like this:

1361.680: [G1Ergonomics (Concurrent Cycles)


request concurrent cycle initiation, reason:
occupancy higher than threshold, occupancy:
1459617792 bytes, allocation request: 4194320
bytes, threshold: 1449551430 bytes (45.00 %),
source: concurrent humongous allocation]

You can tell that a concurrent cycle was


requested because there was a humongous allocation
request for 4194320 bytes.

Contents Page 18
Java Performance / Issue 7 - Dec 2013

Java Garbage Collection Distilled


By Martin Thompson

Serial, Parallel, concurrent, CMS, G1, young gen, new to events, which is impacted by pauses introduced by
gen, old gen, PermGen, Eden, tenured, survivor spaces, garbage collection. Target latency for GC pauses with
safepoints, and the hundreds of JVM start-up flags.... XX:MaxGCPauseMillis=<n>.
Does all this all baffle you when trying to tune garbage
collector to get the required throughput and latency from 3. Memory: The amount of memory our systems
your Java application? If it does then don’t worry, you are use to store state, which is often copied and moved
not alone. Documentation describing garbage collection around while being managed. The set of active objects
feels like man pages for an aircraft. Every knob and dial is retained by the application at any point in time is
detailed and explained but nowhere can you find a guide known as the live set. Maximum heap size –Xmx<n> is
on how to fly. a tuning parameter for setting the heap size available
to an application.
This article will attempt to explain the tradeoffs when
choosing and tuning garbage-collection (GC) algorithms Note: Often HotSpot cannot achieve these targets
for a particular workload. and will silently continue without warning, having
The focus will be on Oracle Hotspot JVM and missed its target by a great margin.
OpenJDK collectors as those are most commonly
used. Towards the end, other commercial JVMs will Latency is a distribution across events. It may be
be discussed to illustrate alternatives. acceptable to have an increased average latency
to reduce the worst-case latency or make it less
The Tradeoffs frequent. We should not interpret “real time” to mean
Wise folk keep telling us, “You don’t get something for the lowest possible latency; rather, it refers to having
nothing.” When we get something, we usually have a deterministic latency regardless of throughput.
to give up something in return. When it comes to
garbage collection, we play with three major variables For some application workloads, throughput is the
that set targets for the collectors: most important target. An example would be a long-
running batch-processing job. It does not matter if
1. Throughput: The amount of work done by an a batch job occasionally pauses for a few seconds
application as a ratio of time spent in GC. Target while GC takes place as long as the overall job can be
throughput with XX:GCTimeRatio=99; 99 is the completed sooner.
default, equating to 1% GC time.
For virtually all other workloads, from human-facing
2. Latency: The time taken by systems in responding interactive applications to financial-trading systems,

Contents Page 19
Java Performance / Issue 7 - Dec 2013

a system that goes unresponsive for anything more


than a few seconds or milliseconds can spell disaster. Note: If your application consistently generates a lot
In financial trading, it is often worthwhile to trade of objects that live for a fairly long time, expect your
some throughput in return for consistent latency. application to spend a significant portion of its time
garbage collecting and expect to spend a significant
We may also have applications that are limited by the portion of your time tuning the HotSpot garbage
amount of physical memory available and have to collectors. The less effective generational filter
maintain a footprint, in which case we have to give up reduces GC efficiency and the cost of collecting the
performance on both latency and throughput fronts. longer living generations increases. Older generations
are less sparse, and as a result the efficiency of
Tradeoffs often play out as follows: older-generation-collection algorithms tends to be
much lower. Generational garbage collectors tend
• To a large extent, providing the GC algorithms to operate in two distinct collection cycles: minor
with more memory can reduce the cost of garbage collections that collect short-lived objects and less
collection as an amortized cost. frequent major collections that collect the older
• Containing the live set and keeping the heap regions..
size small can reduce the observed worst-case
latency-inducing pauses due to GC. Stop-the-World Events
• Managing the heap and generation sizes and The pauses that applications suffer during GC are
controlling the application’s object-allocation rate due to stop-the-world events. For garbage collectors
can reduce the frequency of pauses. to operate, it is necessary for practical engineering
• Concurrently running the GC with the application, reasons to periodically stop the running application
sometimes at the expense of throughput, can so that memory can be managed. Depending on the
reduce the frequency of large pauses. algorithms, different collectors „will stop the world”
at specific points of execution for varying durations of
time.
Object Lifetimes
GC algorithms are often optimized with the To bring an application to a total stop, it is necessary
expectation that most objects live for a short to pause all running threads. Garbage collectors
period of time while relatively few live for long. In do this by signaling the threads to stop when they
most applications, objects that live for a significant come to a safepoint, which is a point during program
period of time tend to constitute a small percentage execution at which all GC roots are known and all
of objects allocated over time. In GC theory, heap-object contents are consistent. Depending on
this observed behavior is often known as “infant what a thread is doing, it may take some time to reach
mortality” or the “weak generational hypothesis”. a safepoint.

For example, loop iterators are mostly short-lived Safepoint checks are normally performed on method
whereas static strings are effectively immortal. returns and loop back edges, but can be optimized in
Experimentation has shown that generational some places making them more dynamically rare. For
garbage collectors can usually support an order-of- example, if a thread is copying a large array, cloning
magnitude greater throughput than non-generational a large object, or executing a monotonic counted
collectors do, and thus are almost ubiquitously used loop with a finite bound, it may be many milliseconds
in server JVMs. When separating the generations of before a safepoint is reached. Time to safepoint is an
objects, a region of newly allocated objects is likely to important consideration in low-latency applications.
be sparse for live objects. A collector that scavenges This time can be surfaced by enabling the XX:+Prin
for the few live objects in this new region and copies tGCApplicationStoppedTime flag in addition to the
them to another region for older objects can be very other GC flags.
efficient. HotSpot garbage collectors record the age
of an object in terms of the number of GC cycles Applications with a large number of running threads
survived. when a stop-the-world event occurs a system will

Contents Page 20
Java Performance / Issue 7 - Dec 2013

undergo significant scheduling pressure as the interned strings were moved from permgen to tenured,
threads resume when released from safepoints. and Java 8 did away with the perm generation so it will
Therefore, algorithms with less reliance on stop-the- not be discussed in this article. Most other commercial
world events can potentially be more efficient. collectors do not use a separate perm space and tend to
treat all long-living objects as tenured.

Heap Organization in HotSpot Note: The virtual spaces allow the collectors to adjust
To understand how the different collectors operate, the size of regions to meet throughput and latency
it is best to explore how the Java heap is organized to targets. Collectors keep statistics for each collection
support generational collectors. phase and adjust the region sizes accordingly in an
Eden is the region where most objects are initially attempt to reach the targets.
allocated. The survivor spaces are a temporary
store for objects that have survived a collection
of the Eden space. Survivor-space usage will be Object Allocation
described when minor collections are discussed. To avoid contention, each thread is assigned a thread
Collectively, Eden and the survivor spaces are known local-allocation buffer (TLAB) from which it allocates
as the “young” or “new” generation. objects. Using TLABs allows object allocation to scale
with number of threads by avoiding contention on a
Objects that live long enough are eventually single memory resource. Object allocation via a TLAB
promoted to the tenured space. is inexpensive; it simply bumps a pointer for the object
size, which takes roughly 10 instructions on most
The perm generation is where the runtime stores objects platforms. Heap memory allocation for Java is even
it “knows” to be effectively immortal, such as classes cheaper than using malloc from the C runtime.
and static strings. Unfortunately, the common use of Note: Whereas individual object allocation is
class-loading on an ongoing basis in many applications inexpensive, the rate at which minor collection must
makes the motivating assumption behind the perm occur is directly proportional to the rate of object
generation (that classes are immortal) wrong. In Java 7, allocation.

Contents Page 21
Java Performance / Issue 7 - Dec 2013

to the new generation may exist in the associated


When a TLAB is exhausted, a thread simply requests 512-byte heap region. At collection time, the card
a new one from the Eden space. When Eden has been table is used to scan for such cross-generational
filled, a minor collection commences. references, which effectively represent additional GC
roots into the new generation. A significant fixed cost
Large objects (-XX:PretenureSizeThreshold=n) may of minor collections is directly proportional to the size
fail to be accommodated in the young generation and of the old generation.
thus have to be allocated in the old generation, e.g.
a large array. If the threshold is set below TLAB size There are two survivor spaces in the HotSpot
then objects that fit in the TLAB will not be created new generation, which alternate in their to-
in the old generation. The new G1 collector handles space and from-space roles. At the beginning of
large objects differently and will be discussed later in a minor collection, the to-space survivor space is
its own section. always empty and acts as a target copy area for
the minor collection. The previous minor collection’s
Minor Collections target survivor space is part of the from-space, which
A minor collection is triggered when Eden becomes also includes Eden, where live objects that need to be
full. This is done by copying all the live objects copied may be found.
in the new generation to either a survivor space
or the tenured space as appropriate. Copying to The cost of a minor GC is usually dominated
the tenured space is known as promotion or tenuring. by the cost of copying objects to the survivor 
Promotion occurs for objects that are sufficiently and tenured spaces. Objects that do not survive
old (– XX:MaxTenuringThreshold), or when a minor collection are effectively free to be dealt with.
the survivor space overflows. The work done during a minor collection is directly
proportional to the number of live objects found,
Live objects are objects that are reachable by and not to the size of the new generation. The total
the application; any other objects cannot be time spent doing minor collections can be almost be
reached and can therefore be considered dead. halved each time the Eden size is doubled. Memory
In a minor collection, the copying of live objects is can therefore be traded for throughput. A doubling of
performed by first following what are known as GC Eden size will result in an increase in collection time
roots, and iteratively copying anything reachable per collection cycle, but this is relatively small if both
to the survivor space. GC roots normally include the number of objects being promoted and size of
references from application and JVM-internal static the old generation is constant.
fields, and from thread stack-frames, all of which
effectively point to the application’s reachable object Note: In HotSpot, minor collections are stop-the-
graphs. world events. This is rapidly becoming a major issue
as our heaps get larger with more live objects. We
In generational collection, the GC roots for are already starting to see the need for concurrent
the new generation’s reachable object graph also collection of the young generation to reach pause-
include any references from the old generation to time targets.
the new generation. These references must also
be processed to make sure all reachable objects in Major Collections
the new generation survive the minor collection. Major collections collect the old generation so that
Identifying these cross-generational references is objects can be promoted from the young generation.
achieved by use of a card table. The HotSpot card In most applications, the vast majority of program
table is an array of bytes in which each byte is used state ends up in the old generation. The greatest
to track the potential existence of cross-generational variety of GC algorithms exists for the old generation.
references in a corresponding 512-byte region of Some will compact the whole space when it fills,
the old generation. As references are stored to the whereas others will collect concurrently with the
heap, store-barrier code will mark cards to indicate application to try and prevent it from filling up.
that a potential reference from the old generation

Contents Page 22
Java Performance / Issue 7 - Dec 2013

The old-generation collector will try to predict when Parallel collector (XX:+UseParallelGC) uses multiple
it needs to collect to avoid a promotion failure from threads to perform minor collections of the young
the young generation. The collectors track a fill generation and a single thread for major collections
threshold for the old generation and begin collection on the old generation. The Parallel Old collector
when this threshold is passed. If this threshold is not (XX:+UseParallelOldGC) , the default since Java
sufficient to meet promotion requirements then a 7u4, uses multiple threads for minor collections
FullGC is triggered. A FullGC involves promoting all and for major collections. Objects are allocated in
live objects from the young generations followed by the tenured space using a simple bump-the-pointer
a collection and compaction of the old generation. algorithm. Major collections are triggered when
Promotion failure is a very expensive operation as the tenured space is full.
state and promoted objects from this cycle must be
unwound so the FullGC can occur. On multiprocessor systems, the Parallel Old collector
will give the greatest throughput of any collector.
Note: To avoid promotion failure, you It has no impact on a running application until a
will need to tune the padding that the old collection occurs, and then will collect in parallel using
generation allows to accommodate promotions multiple threads using the most efficient algorithm.
(XX:PromotedPadding=<n>). This makes the Parallel Old collector suitable for
batch applications.
Note: When the heap needs to grow, a FullGC is
triggered. These heap-resizing FullGCs can be The cost of collecting the old generations is affected
avoided by setting –Xms and –Xmx to the same value. by the number of objects to retain more than by
the size of the heap. The efficiency of the Parallel
Other than a FullGC, a compaction of Old collector can be increased to achieve greater
the old generation is likely to be the largest stop-the- throughput by providing more memory and accepting
world pause an application will experience. The time larger but fewer collection pauses.
for this compaction tends to grow linearly with the
number of live objects in the tenured space. Expect the fastest minor collections with this
collector because the promotion to tenured space is a
The rate at which the tenured space fills up can simple bump-the-pointer and copy operation.
sometimes be reduced by increasing the size of
the survivor spaces and raising the age limit of For server applications, the Parallel Old collector
objects to be promoted to the tenured generation. should be the first port-of-call. However if the major-
However, increasing the size of the survivor spaces collection pauses are more than your application
and object promotion age in minor collections (– can tolerate then you need to consider employing a
XX:MaxTenuringThreshold) can also increase the cost and concurrent collector that collects the tenured objects
pause times in the minor collections due to the increased concurrently while the application is running.
copy cost between survivor spaces on minor collections.
Note: Expect pauses in the order of one to five
Serial Collector seconds per GB of live data on modern hardware
The Serial collector (-XX:+UseSerialGC) is the while the old generation is compacted.
simplest collector and is a good option for single-
processor systems. It also has the smallest Concurrent Mark Sweep (CMS)
footprint of any collector. It uses a single thread Collector
for both minor and major collections. Objects are The CMS (-XX:+UseConcMarkSweepGC) collector
allocated in the tenured space using a simple bump- runs in the old generation, collecting tenured objects
the-pointer algorithm. Major collections are triggered that are no longer reachable during a major collection.
when the tenured space is full. It runs concurrently with the application with the goal
of keeping sufficient free space in the old generation
Parallel Collector so that a promotion failure from the young generation
The Parallel collector comes in two forms. The does not occur.

Contents Page 23
Java Performance / Issue 7 - Dec 2013

Note: If an application sees significant mutation


Promotion failure will trigger a FullGC. CMS follows a of tenured objects then the re-mark phase can be
multistep process: significant. At the extremes it may take longer than a
full compaction with the Parallel Old collector.
1. Initial mark <stop-the-world>: Find GC roots. CMS makes FullGC a less frequent event at the
2. Concurrent mark: Mark all reachable objects from expense of reduced throughput, more expensive
the GC roots. minor collections, and greater footprint.
3. C oncurrent pre-clean: Check for object references
that have been updated and objects that have been The reduction in throughput can be anything
promoted during the concurrent mark phase by from 10%-40% compared to the Parallel collector,
remarking. depending on promotion rate. CMS also requires a
4. Re-mark <stop-the-world>: Capture object 20% greater footprint to accommodate additional
references that have been updated since the pre- data structures and “floating garbage” that can be
clean stage. missed during the concurrent marking and that gets
5. Concurrent sweep: Update the free lists by carried over to the next cycle.
reclaiming memory occupied by dead objects.
6. Concurrent reset: Reset data structures for next High promotion rates and resulting fragmentation
run. can sometimes be reduced by increasing the size of
both the young and old generation spaces.
As tenured objects become unreachable, the space
is reclaimed by CMS and put on free lists. When Note: CMS can suffer “concurrent mode failures”,
promotion occurs, the free lists must be searched for which can be seen in the logs, when it fails to collect
a suitably sized hole for the promoted object. This at a sufficient rate to keep up with promotion. This
increases the cost of promotion and thus increases can be caused when the collection commences too
the cost of the minor collections compared to the late, which can be addressed by tuning. But it can
Parallel collector. also occur when the collection rate cannot keep up
with the high promotion rate or with the high object-
Note: CMS is not a compacting collector, mutation rate of some applications.
which over time can result in old-generation
fragmentation. Object promotion can fail because If the promotion rate or mutation rate of the
a large object may not fit in the available holes in application is too high then your application might
the old generation. When this happens a “promotion require some changes to reduce the promotion
failed” message is logged and a FullGC is triggered pressure. Adding more memory to such a system can
to compact the live tenured objects. For such sometimes make the situation worse, as CMS would
compaction-driven FullGCs, expect pauses to then have more memory to scan.
worse than major collections using the Parallel Old
collector because CMS uses a single thread for Garbage First (G1) Collector
compaction. G1 (-XX:+UseG1GC) is a new collector introduced
in Java 6 and now officially supported in Java 7. It is a
CMS is mostly concurrent with the application, partially concurrent collecting algorithm that also tries
which has a number of implications. First, CPU time to compact the tenured space in smaller incremental
is taken by the collector, thus reducing the CPU stop-the-world pauses to try and minimize the FullGC
available to the application. The amount of time events that plague CMS because of fragmentation.
required by CMS grows linearly with the amount G1 is a generational collector that organizes the heap
of object promotion to the tenured space. Second, differently from the other collectors by dividing it into
for some phases of the concurrent GC cycle, all fixed-size regions of variable purpose, rather than
application threads have to be brought to a safepoint contiguous regions for the same purpose.
for marking GC roots and performing a parallel re- G1 takes the approach of concurrently marking
mark to check for mutation. regions to track references between regions, and to
focus collection on the regions with the most free

Contents Page 24
Java Performance / Issue 7 - Dec 2013

space. These regions are then collected in stop-the- heaps that tend to become fragmented when an
world pause increments by evacuating the live objects application can tolerate pauses in the 0.5-1.0 second
to an empty region, thus compacting in the process. range for incremental compactions. G1 tends to
Objects larger than 50% of a region are allocated in reduce the frequency of the worst-case pauses
humongous regions that are a multiple of region size. seen by CMS because of fragmentation at the cost
Allocation and collection of humongous objects can of extended minor collections and incremental
be very costly under G1, and to date has had little or compactions of the old generation. Most pauses
no optimization effort applied. end up constrained to regional rather than full heap
compactions.
The challenge with any compacting collector is not
the moving of objects but the updating of references Like CMS, G1 can also fail to keep up with promotion
to those objects. If an object is referenced from rates, and will fall back to a stop-the-world FullGC.
Just like CMS has “concurrent mode failure”, G1 can
suffer an evacuation failure, seen in the logs as “to-
space overflow”. This occurs when there are no free
regions into which objects can be evacuated, which is
similar to a promotion failure. If this occurs, try using
a larger heap and more marking threads, but in some
cases application changes may be necessary to reduce
allocation rates.

A challenging problem for G1 is dealing with popular


objects and regions. Incremental stop-the-world
compaction works well when regions have live objects
many regions then updating those references can that are not heavily referenced from other regions. If
take significantly longer than moving the object. G1 an object or region is popular then the remembered
tracks which objects in a region have references set will be large, and G1 will try to avoid collecting
from other regions via the “remembered sets”. If the those objects. Eventually it can have no choice, which
remembered sets become large, G1 can significantly results in very frequent mid-length pauses as the heap
slow down. When evacuating objects from one region gets compacted.
to another, the length of the associated stop-the-
world event tends to be proportional to the number of Alternative Concurrent Collectors
regions with references that need to be scanned and CMS and G1 are often called mostly concurrent
potentially patched. collectors. When you look at the total work
performed, it is clear that the young generation,
Maintaining the remembered sets increases the cost promotion, and even much of the old generation work
of minor collections, resulting in pauses greater than is not concurrent at all. CMS is mostly concurrent
those seen with Parallel Old or CMS collectors for for the old generation; G1 is much more of a stop-
minor collections. the-world incremental collector. Both CMS and G1
have significant and regular stop-the-world events,
G1 is target-driven on latency – and their worst-case scenarios often make them
XX:MaxGCPauseMillis=<n>, default value = 200ms. unsuitable for strict low-latency applications, such as
The target will influence the amount of work done on financial trading or reactive user interfaces.
each cycle on a best-efforts-only basis. Setting targets
in tens of milliseconds is mostly futile, and as of this Alternative collectors are available such as Oracle
writing targeting tens of milliseconds has not been a JRockit Real Time, IBM Websphere Real Time, and
focus of G1. Azul Zing. The JRockit and Websphere collectors
have latency advantages in most cases over CMS
G1 is a good general-purpose collector for larger and G1 but often see throughput limitations and still
suffer significant stop-the-world events. Zing is the

Contents Page 25
Java Performance / Issue 7 - Dec 2013

only Java collector known to this author that can be -Xloggc:<filename>


truly concurrent for collection and compaction while -XX:+PrintGCDetails
maintaining a high throughput rate for all generations. -XX:+PrintGCDateStamps
Zing does have some sub-millisecond stop-the-world -XX:+PrintTenuringDistribution
events but these are for phase shifts in the collection -XX:+PrintGCApplicationConcurrentTime
cycle that are not related to live-set size. -XX:+PrintGCApplicationStoppedTime
JRockit RT can achieve typical pause times in the tens
of milliseconds for high allocation rates at contained Then load the logs into a tool like Chewiebug for
heap sizes but occasionally has to fail back to full analysis.
compaction pauses. Websphere RT can achieve
single-digit millisecond pause times via constrained To see the dynamic nature of GC, launch JVisualVM
allocation rates and live-set sizes. Zing can achieve and install the Visual GC plugin. This will enable you
sub-millisecond pauses with high allocation rates to see the GC in action for your application as below.
by being concurrent for all phases, including To understand your applications’ GC needs, you
during minor collections. Zing is able to maintain this need representative load tests that can be executed
consistent behavior regardless of heap size, allowing repeatedly. As you get to grips with how each
the user to apply large heap sizes as needed for
keeping up with application throughput or object-
model-state needs, without fear of increased pause
times.

All the concurrent collectors targeting latency force


you to give up some throughput and gain footprint.
Depending on the efficiency of the concurrent
collector, you may give up a little throughput but
you are always adding significant footprint. If truly
concurrent, with few stop-the-world events, then
more CPU cores are needed to enable the concurrent
operation and maintain throughput.

Note: All the concurrent collectors tend to function collector works, run your load tests with different
more efficiently when sufficient space is allocated. configurations as experiments until you reach your
As a starting-point rule of thumb, you should budget throughput and latency targets. It is important to
a heap at least two to three times the size of the measure latency from the end-user perspective.
live set for efficient operation. However, space
requirements for maintaining concurrent operation This can be achieved by capturing the response
grow with application throughput and the associated time of every test request in a histogram, and you
allocation and promotion rates. So higher throughput can read more about that here. If you have latency
applications may warrant a higher heap-size to live- spikes that are outside your acceptable range, try
set ratio. Given the huge memory spaces available to and correlate these with the GC logs to determine
today’s systems, footprint is seldom an issue on the if GC is the issue. It is possible other issues may be
server side. causing latency spikes. Another useful tool to consider
is jHiccup which can be used to track pauses within
Garbage-Collection Monitoring the JVM and across a system as a whole.
and Tuning
To understand how your application and garbage If latency spikes are due to GC then invest in tuning
collector are behaving, start your JVM with at least CMS or G1 to see if your latency targets can be met.
the following settings: Sometimes this may not be possible because of high
allocation and promotion rates combined with very-
-verbose:gc low-latency requirements. GC tuning can become a

Contents Page 26
Java Performance / Issue 7 - Dec 2013

highly skilled exercise that often requires application


changes to reduce object-allocation rates or object
lifetimes. If this is the case then a commercial trade-
off between time and resources spent on GC tuning
and application changes versus purchasing one of the
commercial concurrent-compacting JVMs such as
JRockit Real Time or Azul Zing may be required.

About the Author


Martin Thompson is a high-performance and low-
latency specialist, with experience gained over two
decades working on large-scale transactional and big-
data systems. He believes in mechanical sympathy,
i.e. applying an understanding of the hardware to
the creation of software as being fundamental to
delivering elegant high-performance solutions. The
Disruptor framework is just one example of what his
mechanical sympathy has created. Martin was the
co-founder and CTO of LMAX. He blogs here, and can
be found giving training courses on performance and
concurrency, or hacking code to make systems better.

READ THIS ARTICLE ONLINE ON InfoQ


http://www.infoq.com/articles/Java_Garbage_Collection_Distilled

Contents Page 27

You might also like