Recently, I have unfairly blamed promotion local allocation buffers (PLAB) for
fragmentation of old space using concurrent mark sweep garbage collector. I was
very wrong. In this article, I'm going to explain how PLABs really work with
all details.
PLABs
PLAB stand for promotion local allocation buffer. PLABs are
used during young
collection. Young collection in CMS (and all other garbage collectors in
HotSpot JVM) is a stop-the-world copy collection. CMS may use multiple threads
for young collection, each of these threads may need to allocate space for
objects being copied either in survivor or old space. PLABs are required to
avoid competition of threads for shared data structures managing free memory. Each
thread have one PLAB for survival space and one for old space. Free memory in
survivor space are continuous, so do survivor PLABs, which are simply continuous
blocks. On other hand, free memory in old space (using CMS collector) is
fragmented and managed via sophisticated dictionary or free chunks ...
Free list space(FLS)
CMS collector cannot compact old space (actually it can, but
compaction involves long stop-the-world pause, often referred as GC freeze). Memory
manager operates with lists of free chunks to manage fragmented free space. As
a counter measure from fragmentation, chunks of free space are grouped by size.В If available, free chunk of exact required
size will be used to serve allocation request. If chunks of given size are
exhausted, memory manager will split larger chunk into several smaller to
satisfy demand. Consecutive free chunk can also be coalesced to create larger
ones (coalescence is made along with sweeping during concurrent GC cycle). This
splitting/coalesce logic is controlled by complex heuristics and chunk demand
per size statistics.
Old space PLABs
Naturally old space PLABs mimic structure of indexed free
list space. Each thread preallocates certain number of chunk of each size below
257 heap words (large chunk allocated from global space). Number of chunks of
each size to be preallocated is controlled by statistics. Following JVM flag
will enabled verbose reporting of old space PLAB sizing (too verbose for
production though).
-XX:+PrintOldPLAB
At the beginning of each young collection we will see
following lines in GC log
6.347: [ParNew ...
...
0[10]: 722/5239/897
0[12]: 846/5922/987
0[14]: 666/5100/850
...
1[12]: 229/3296/987
1[14]: 2/2621/850
1[16]: 69/1812/564
1[18]: 247/1160/290
...
[10]: 905
[12]: 1002
[14]: 865
[16]: 567
...
...
0[10]: 722/5239/897
0[12]: 846/5922/987
0[14]: 666/5100/850
...
1[12]: 229/3296/987
1[14]: 2/2621/850
1[16]: 69/1812/564
1[18]: 247/1160/290
...
[10]: 905
[12]: 1002
[14]: 865
[16]: 567
...
First lines are statistics from each scavenger (young collector) thread
in following format:
<tid>[<chunk size>]: <num_retire>/<num_blocks>/<blocks_to_claim>
tid - GC thread ID,
chunk size - chunk size in heap words,
num_retire - number of free chunks in PLAB at the end of young GC,
num_blocks - number of chunks allocated from FLS to PLAB
during young GC,
blocks_to_claim - desired number of blocks to refill PLAB.
Next few lines show estimated number of chunks (per size) to be
preallocated (per GC thread) at beginning of next young collection.
[<chunk size>]: <blocks_to_claim>
Calculating
desired block to claim
Initial number of blocks (chunks) per chunk size is configured via -XX:+CMSParPromoteBlocksToClaim JVM command line option
(-XX:+OldPLABSize is alias for this
option if CMS GC is used). If resizing of old PLAB is not disabled by -XX:-ResizeOldPLAB
option, then
desired PLAB size will be adjusted after each young GC.
Ideal desired number per chunk size is calculated by following formula:
block_to_claimideal = MIN(-XX:CMSOldPLABMax, MAX(-XX:CMSOldPLABMin, num_blocks / (-XX:ParallelGCThreads • -XX:CMSOldPLABNumRefills)))
,but effective value is exponentially smoothed over time
blocks_to_claimnext = (1 - w) • blocks_to_claimprev
+ w • block_to_claimideal
,there w is configured via -XX:OldPLABWeight (0.5 by default).
On-the-fly
PLAB resizing
During young collection, if chunk list of certain size will get
exhausted, thread will refill it from global free space pool (allocating same
number of chunks as at the beginning of collection). Normally thread will have to
refill chunk list few times during collection (-XX:CMSOldPLABNumRefills
sets desired number of refills). Though, if initial estimate was too small, GC
thread will refill its chunk list too often (refill requires global lock for
memory managed, so it may be slow). If on-the-fly PLAB resizing is enabled JVM
will try to detect such conditions as resize PLAB in the middle of young
collection.
-XX:+CMSOldPLABResizeQuicker
will enable on-the-fly PLAB resizing (disabled by default).
Few more options offer additional tuning:
-XX:CMSOldPLABToleranceFactor=4 –
tolerance of the phase-change detector for on-the-fly PLAB resizing during a
scavenge.
-XX:CMSOldPLABReactivityFactor=2 –
gain in the feedback loop for on-the-fly PLAB resizing В during a scavenge.
-XX:CMSOldPLABReactivityCeiling=10 –
clamping of the gain in the feedback loop for on-the-fly PLAB resizing during a
scavenge.
Conclusion
I have spent some time digging though OpenJDK code to make
sure, that I'm getting that thing now. It was educating. This article has
brought up and explained few more arcane JVM options,В though I doubt that I will ever use them in
practice. Problem with heap fragmentation is that you have to run application
for really long time before fragmentation will manifest itself. Most of options
above require trial and error path (even though -XX:+PrintOldPLAB
might give you some insights about your application) . It much easier just to
give damn JVM little more memory (hey, RAM is cheap nowadays) than spend day
tuning arcane options.
Anyway, I hope it was as education for you as it was for me.
The equations are all mangled. Display chars are incorrect.
ReplyDeleteWhen you said "Young collection in CMS (and all other garbage collectors in HotSpot JVM) is a stop-the-world copy collection." didn't you mean the ParNewGC's parallel copy collection?
Correct, ParNewGC is a parallel stop-the-world copy collection. In GC land "parallel" means stop-the-world + multithreaded.
ReplyDelete(Thanks for pointing out mangled chars)
Hello, thank you for this informative article.
ReplyDeleteI have a question here, in your previous article - "Java GC, HotSpot's CMS and heap fragmentation", you said that "Concurrent Mark Sweep is used only to collect old space.", then what do you mean by "Young collection in CMS"? Do you mean the ParNew or Serial GC algorithm used together with CMS?
Old space is collected concurrently. Young space is collected by stop-the-world copy collector either in parallel (ParNew) or in single thread (DefNew).
ReplyDeleteSee http://aragozin.blogspot.com/2011/09/hotspot-jvm-garbage-collection-options.html
For a list of possible combination for young and old space collectors.
Thanks for your sharing, really arcane. This article helps me to understand the PLAB etc.
ReplyDelete" I have unfairly blamed promotion local allocation buffers (PLAB) for fragmentation of old space using concurrent mark sweep garbage collector. I was very wrong." hmmm... can you elaborate wht you mean with you was wrong? I think I missed that aspect of the article.
ReplyDeleteFollow link just before that phrase. In that article there are few paragraph, where is explaining how PLAB destroys effect of FLS. It turns out to be wrong and remarks to article explain why.
Deletei want to ask the difference between about this two parameter: num_blocks blocks_to_claim, i think one is the really number of chunks and the other is just an expected number
ReplyDeleteIf you want accurate answer, only way for you is to consult OpenJDK source.
ReplyDeleteIt was sometime since I have investigated this code and I'm afraid to be wrong.