Recently, I have unfairly blamed promotion local allocation buffers (PLAB) for fragmentation of old space using concurrent mark sweep garbage collector. I was very wrong. In this article, I'm going to explain how PLABs really work with all details.
PLAB stand for promotion local allocation buffer. PLABs are used during young collection. Young collection in CMS (and all other garbage collectors in HotSpot JVM) is a stop-the-world copy collection. CMS may use multiple threads for young collection, each of these threads may need to allocate space for objects being copied either in survivor or old space. PLABs are required to avoid competition of threads for shared data structures managing free memory. Each thread have one PLAB for survival space and one for old space. Free memory in survivor space are continuous, so do survivor PLABs, which are simply continuous blocks. On other hand, free memory in old space (using CMS collector) is fragmented and managed via sophisticated dictionary or free chunks ...
Free list space(FLS)
CMS collector cannot compact old space (actually it can, but compaction involves long stop-the-world pause, often referred as GC freeze). Memory manager operates with lists of free chunks to manage fragmented free space. As a counter measure from fragmentation, chunks of free space are grouped by size.В If available, free chunk of exact required size will be used to serve allocation request. If chunks of given size are exhausted, memory manager will split larger chunk into several smaller to satisfy demand. Consecutive free chunk can also be coalesced to create larger ones (coalescence is made along with sweeping during concurrent GC cycle). This splitting/coalesce logic is controlled by complex heuristics and chunk demand per size statistics.
Old space PLABs
Naturally old space PLABs mimic structure of indexed free list space. Each thread preallocates certain number of chunk of each size below 257 heap words (large chunk allocated from global space). Number of chunks of each size to be preallocated is controlled by statistics. Following JVM flag will enabled verbose reporting of old space PLAB sizing (too verbose for production though).
At the beginning of each young collection we will see following lines in GC log
First lines are statistics from each scavenger (young collector) thread in following format:
tid - GC thread ID,
chunk size - chunk size in heap words,
num_retire - number of free chunks in PLAB at the end of young GC,
num_blocks - number of chunks allocated from FLS to PLAB during young GC,
blocks_to_claim - desired number of blocks to refill PLAB.
Next few lines show estimated number of chunks (per size) to be preallocated (per GC thread) at beginning of next young collection.
Calculating desired block to claim
Initial number of blocks (chunks) per chunk size is configured via -XX:+CMSParPromoteBlocksToClaim JVM command line option (-XX:+OldPLABSize is alias for this option if CMS GC is used). If resizing of old PLAB is not disabled by -XX:-ResizeOldPLAB option, then desired PLAB size will be adjusted after each young GC.
Ideal desired number per chunk size is calculated by following formula:
block_to_claimideal = MIN(-XX:CMSOldPLABMax, MAX(-XX:CMSOldPLABMin, num_blocks / (-XX:ParallelGCThreads • -XX:CMSOldPLABNumRefills)))
,but effective value is exponentially smoothed over time
blocks_to_claimnext = (1 - w) • blocks_to_claimprev + w • block_to_claimideal
,there w is configured via -XX:OldPLABWeight (0.5 by default).
On-the-fly PLAB resizing
During young collection, if chunk list of certain size will get exhausted, thread will refill it from global free space pool (allocating same number of chunks as at the beginning of collection). Normally thread will have to refill chunk list few times during collection (-XX:CMSOldPLABNumRefills sets desired number of refills). Though, if initial estimate was too small, GC thread will refill its chunk list too often (refill requires global lock for memory managed, so it may be slow). If on-the-fly PLAB resizing is enabled JVM will try to detect such conditions as resize PLAB in the middle of young collection.
-XX:+CMSOldPLABResizeQuicker will enable on-the-fly PLAB resizing (disabled by default).
Few more options offer additional tuning:
-XX:CMSOldPLABToleranceFactor=4 – tolerance of the phase-change detector for on-the-fly PLAB resizing during a scavenge.
-XX:CMSOldPLABReactivityFactor=2 – gain in the feedback loop for on-the-fly PLAB resizing В during a scavenge.
-XX:CMSOldPLABReactivityCeiling=10 – clamping of the gain in the feedback loop for on-the-fly PLAB resizing during a scavenge.
I have spent some time digging though OpenJDK code to make sure, that I'm getting that thing now. It was educating. This article has brought up and explained few more arcane JVM options,В though I doubt that I will ever use them in practice. Problem with heap fragmentation is that you have to run application for really long time before fragmentation will manifest itself. Most of options above require trial and error path (even though -XX:+PrintOldPLAB might give you some insights about your application) . It much easier just to give damn JVM little more memory (hey, RAM is cheap nowadays) than spend day tuning arcane options.
Anyway, I hope it was as education for you as it was for me.