Wednesday, December 14, 2011

Data Grid Pattern - Canary key set

Most of modern in-memory-data-grid products have grown from distributed caching. For traditional cache, loss of data is not a big deal, missing data could always be recovered from backend storage via read through. Main focus of distributed cache is data coherence between nodes (preventing stale reads, etc).
Advanced patterns like all-in-memory and proactive caching may provide considerable benefits over traditional read through cache. But simple fall back to read-through as a strategy for data recovery may not be an option for these advanced cache usages. Read-through have two prerequisite to be implemented:
  • data should be accessed by primary key,
  • for given primary key, you should known its master data source.
Both these prerequisite may be broken in advanced solution. Having all-data in memory will allow you to execute queries in cache layer instead of backend database. Products like Oracle Coherence, GemStone GemFire and GigaSpaces have advanced queering capabilities (including non-primary indexes support). Offloading queries from database is a huge win, but the price is that you cannot rely on read-through any more. If some data is missing in cache, queries will produce incomplete results without warning.
Second read-through prerequisite also may be sacrificed, i.e. by using multiple backends (cache acting as aggregator for data set scattered across multiple databases). You can more details in my previous article.

Data loss imminent

Please mention, that loss of data in modern in-memory-data-grid is an exceptional event. Data is usually protected by multiple replicas and grid can tolerate server failure. But it still possible and you cannot ignore this aspect as you cannot ignore e.g. backing up your database.

Through all reliability provided by data grid technology data may be lost and it means they will be lost eventually. Next question, what is your desired strategy to cope with incomplete data set?
It depends on type of application.
  •  For some applications, it is ok to have incomplete results from application during recovery window.
  • For some application, incomplete response is worst than no response. Application should guaranty that every response is complete and if it cannot provide complete response (e.g. part of data set is missing in cache) it should raise an error.
First strategy is rather simple, you should monitor your grid and automatically trigger recovery procedure if disastrous event is detected.
Second type of strategy is more tricky in implementation. Just monitoring is not an option.  There would be a gap between data loss event and reaction of monitoring system (which e.g. can switch service to offline mode for duration recovery process). Some application are totally intolerant to inconsistent data. This data, for example, could be used in complex batch of financial risk calculations (running for few hours in large HPC grid) and single inconsistent piece of input data could invalidate whole batch of work.
We need a solution better than monitoring for such kind of applications.

Canary keys to detect missing data

We must guaranty that result of each query is consistent (i.e. all data that has to be processed has been processed, often this means - whole dataset). Here we have a paradox at hands: we must check presence of  certain key/value pairs (data grid is a key/value storage) in cache, but we cannot know keys of these pairs or even their total number.
Solution to this paradox lies in base approach of data distribution used by data grids. Technique described in this article has been used by me with Oracle Coherence grid, but idea can also be applied to other products using similar type of DHT. Oracle Coherence is using partitioned distributed hash table (DHT). In practice this means that key/value pairs are not distributed individually, but whole partition is assigned to particular node. It also means, that you cannot lose individual key/value pair but only whole partition at once (if you lose all replicas of that partition at the same time). Number of partitions is fixed for live time of grid (changing that number requires rebuild of DHT).
How this could be helpful to ensure data completeness?
We may not be caring about presence of individual key/value pairs, instead we may check presence of all partitions (and we know their exact IDs and total number). But how we can check presence of partition (technically partition cannot be missing, it will be just empty)? Also we should join data completeness check with queering of data in the same operation otherwise we will always have a gap of uncertainty.
Canary keys is a trick to solve this problem. Canary keys are synthetic keys, you put just on key to each partition. Every partition should have a canary key. So if your grid is configured to have N partitions, it should contain exactly N canary keys. If number of canary keys is less than N, that means portion of data has been lost (poor canary has perished) and is not recovered yet. Of cause your data loading/recovery procedure should put canary keys back in cache once data is restored.
It is also possible (though quite awkward to be honest) to integrate canary keys check in single request with data queering. In each query you should select canary keys along with actual data you need to retrieve. Once query is executed you should check presence of all canaries and then strip them from result set. If all of them are there, you can be sure that your result set in complete and no SLA would be broken. If some canaries are missing, you should trigger data recovery and/or raise an error indicating that request cannot be completed until recovery is done.

Conclusion

Technique described in this article is very advanced. Most applications do not require such rigorous consistency checks for every operations. But few do, and for them this approach may be useful (still implementation may have its quirks).
On high level canary keys technique demonstrate how understanding of  core principles of DHT can help solving challenging task. Understanding of low level grid operations, their guaranties and limitation is a cornerstone in engineering of complex data processing solution using distributed ACID-less data grid as a storage.

See also

Open source implementation of canary keys framework - http://code.google.com/p/gridkit/wiki/DataLossListener.

Sunday, December 4, 2011

Garbage collection in HotSpot JVM


In this blog you may find few in-depth articles related to garbage collection in Oracle's HotSpot and other JVMs. But if you a novice in field of garbage collection, you may feel lost due to level of details. So I decided that high level overview of GC modes in HotSpot would add nicely to existing fairly detailed content.

Foreword

Java (and dominating majority of other modern languages) features automatic memory management aka garbage collection. Once instance of object becomes unreachable from executing program, it is classified as garbage and, eventually, its memory should be added to free pool.
One of simplest approaches to automatic memory management is reference counting. But it has serious limitation though, inability to handle cyclic links in object graph.
If we want to find all effectively unreachable objects, the only way is to find all reachable via recursive object graph traversing. It may sound like a complicated task and it really is.
Garbage collection in modern JVMs including Oracle's HotSpot is a result of lengthy evolution. Huge amounts  hard work was put into them, to make them as efficient as they is now. Of cause they may have limitations, deficiencies but in most time this due to tradeoffs which was made for reason. So, please think twice before blaming JVM garbage collector being slow or stupid. Most likely you just underestimating complexity of its work.
Ok, now, let me get straight to garbage collection in HotSpot JVM circa 2011.

HotSpot JVM circa 2011

HotSpot has several GC modes. Modes are controlled with JVM command line options.  Default GC mode depends on JVM version, client/server mode of JVM and hardware you are running on (JVM distinguish server and desktop grade hardware by set of heuristics).

Serial GC

JVM switch: -XX:+UseSerialGC
Serial GC is generational garage collection algorithm (if you wander what generation means, read this article).
Actually all GC modes in HotSpot are using generational approach, so won't repeat it for every GC mode.
Young collection is a copy collection. Old space is collected using an implementation of mark/sweep/compact (MSC) algorithm. Both, young and old, collections require stop-the-world (STW) pause and, as name suggests, are executed by single thread. During old space collection, all live objects are moved to beginning of space. This allows JVM to return unused memory to OS.

If you enable GC logging by -XX:+PrintGCDetails you will see following indicators of GC pauses in log:
Young collection
41.614 [GC 41.614: [DefNew: 130716K->7953K(138240K), 0.0525908 secs] 890546K->771614K(906240K), 0.0527947 secs] [Times: user=0.05 sys=0.00, real=0.05 secs]
Full (young + old + perm) collection
41.908 [GC 41.908: [DefNew: 130833K->130833K(138240K), 0.0000257 secs]41.909: [Tenured: 763660K->648667K(768000K), 1.4323505 secs] 894494K->648667K(906240K), [Perm : 1850K->1850K(12288K)], 1.4326801 secs] [Times: user=1.42 sys=0.00, real=1.43 secs]

Parallel scavenge

JVM switch: -XX:+UseParallelGC
Some phases of garbage collection could be naturally parallelized between multiple threads. Parallel processing can reduce time required for GC and thus STW pause duration by keeping multiple physical CPU cores busy. Adoption of multiprocessor/multicore hardware have made parallelization of GC a must for modern VM.
Parallel scavenge GC mode is using parallel implementation of young collection algorithm. Old space is still collected by one thread. Thus, using this mode may shorten young collection pauses (which are more frequent), but still suffers from long full collection freezes.

Log output samples for parallel scavenge.
Young collection
59.821: [GC [PSYoungGen: 147904K->4783K(148288K)] 907842K->769258K(916288K), 0.2382801 secs] [Times: user=0.31 sys=0.00, real=0.24 secs]
Full collection
60.060: [Full GC [PSYoungGen: 4783K->0K(148288K)] [PSOldGen: 764475K->660316K(768000K)] 769258K->660316K(916288K) [PSPermGen: 1850K->1850K(12288K)], 1.2817061 secs] [Times: user=1.26 sys=0.00, real=1.28 secs]

Parallel old GC

JVM switch: -XX:+UseParallelOldGC
This mode is a incremental improvement over parallel scavenge mode. It adds parallel processing (parallel mark-sweep-compact (MSC) algorithm) for old space collection. Young space is using same algorithm as mode above. Old space collection still requires quite long STW pause, but now multiple cores could be employed to make it shorter. Unlike serial MSC, parallel version does not create single continuous free memory region at the end of heap, so JVM cannot return memory to OS after full GC in this mode.

Log output samples for parallel old GC.
Young collection
65.411: [GC [PSYoungGen: 147878K->5434K(144576K)] 908129K->770314K(912576K), 0.2734699 secs] [Times: user=0.41 sys=0.00, real=0.27 secs]
Full collection
65.685: [Full GC [PSYoungGen: 5434K->0K(144576K)] [ParOldGen: 764879K->623094K(768000K)] 770314K->623094K(912576K) [PSPermGen: 1850K->1849K(12288K)], 2.5954844 secs] [Times: user=3.95 sys=0.03, real=2.60 secs]

Adaptive size policy

JVM switch: -XX:+UseAdaptiveSizePolicy
This is a special mode of parallel scavenge collector in which it can dynamically adjust configuration of young space to adapt for an application. IMHO it does not bring much benefits. Never seriously tried to use this option.

Concurrent mark sweep

JVM switch: -XX:+UseConcMarkSweepGC
While collectors above are generally called throughput collectors. Concurrent mark sweep (CMS) is low pause collector. It is designed to minimize stop-the-world JVM pauses and thus keep application responsive. For young space collection it may use either serial copy collector or parallel one (parallel alhorithm is similar to algorithm in parallel scavenge mode, but they are two totally different code bases and they may use slightly different configuration options, e.g. adaptive size policy is not implemented for CMS).
Old (and if enabled permanent) space a collected by mostly concurrently. As name suggests CMS  is mark sweep algorithm (notice lack of compact in its name). CMS requires only two short pauses during each old space collection cycle. But unlike its stop-the-world counterparts, CMS cannot do compaction (relocate objects in memory) and this makes it prone to fragmentation. CMS is using some tricks to fight fragmentation, but it is still a threat.
If concurrent collector fails to reclaim memory fast enough to keep with application needs, JVM will fall back to serial stop-the-world mark- sweep-compact algorithm to defragment (and compact) old space (notice serial word, usually such pause would be 50-500 times longer than normal CMS pause).

Log output samples for CMS.
Young collection
613.154: [GC 13.154: [DefNew: 130821K->8230K(138240K), 0.0507238 secs] 507428K->388797K(906240K), 0.0509611 secs] [Times: user=0.06 sys=0.00, real=0.05 secs]
Concurrent old space collection
13.433: [GC [1 CMS-initial-mark: 384529K(768000K)] 395044K(906240K), 0.0045952 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
13.438: [CMS-concurrent-mark-start]
...
14.345: [CMS-concurrent-mark: 0.412/0.907 secs] [Times: user=1.20 sys=0.00, real=0.91 secs]
14.345: [CMS-concurrent-preclean-start]
14.366: [CMS-concurrent-preclean: 0.020/0.021 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
14.366: [CMS-concurrent-abortable-preclean-start]
...
14.707: [CMS-concurrent-abortable-preclean: 0.064/0.340 secs] [Times: user=0.36 sys=0.02, real=0.34 secs]
14.707: [GC[YG occupancy: 77441 K (138240 K)]14.708: [Rescan (non-parallel) 14.708: [grey object rescan, 0.0058016 secs]14.714: [root rescan, 0.0424011 secs], 0.0485593 secs]14.756: [weak refs processing, 0.0000109 secs] [1 CMS-remark: 404346K(768000K)] 481787K(906240K), 0.0487607 secs] [Times: user=0.05 sys=0.00, real=0.05 secs]
14.756: [CMS-concurrent-sweep-start]
...
14.927: [CMS-concurrent-sweep: 0.116/0.171 secs] [Times: user=0.23 sys=0.02, real=0.17 secs]
14.927: [CMS-concurrent-reset-start]
14.953: [CMS-concurrent-reset: 0.026/0.026 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
Times marked with green are times of concurrent phases – CMS do its work in parallel with application. You can find out more about CMS pauses here.
CMS failure and fallback to mark-sweep-compact
557.079: [GC 557.079: [DefNew557.097: [CMS-concurrent-abortable-preclean: 0.010/0.109 secs] [Times: user=0.12 sys=0.00, real=0.11 secs]
 (promotion failed) : 130817K->130813K(138240K), 0.1401674 secs]557.219: [CMS (concurrent mode failure): 731771K->584338K(768000K), 2.4659665 secs] 858916K->584338K(906240K), [CMS Perm : 1841K->1835K(12288K)], 2.6065527 secs] [Times: user=2.48 sys=0.03, real=2.61 secs]
You can read more about failures here.

CMS incremental mode

JVM switch: -XX:+CMSIncrementalMode
CMS is using one of more back ground threads to do GC in parallel with application. These thread will compete with application threads for CPU cores. Incremental mode is limiting amount of CPU time consumed by background GC  thread. This helps to improve application responsiveness if you have just 1 or 2 physical cores. Of cause old space collection cycles would be longer and risk of full collection fall back higher.

G1 garbage collector

JVM switch: -XX:+UseG1GC
G1 (garbage first) is a new garbage collection mode in HotSpot JVM. It was introduced in late versions of JDK6. G1 is low pause collector implementing  incremental version of mark-sweep-compact algorithm. G1 breaks heap into regions of fixed size and can collects only subset (partial collection) of them during stop-the-world (STW) pause (unlike CMS, G1 have to do most of its work during STW). Incremental approach allow G1 to employ larger number of shorter pauses instead of fewer number longer JVM freeze (cumulative amount of pauses will still be much higher compared to concurrent collector like CMS) . To be accurate, G1 also employs background threads to do heap marking concurrently with application (similar to CMS), but most of work is still done during STW.
 G1 is using copy collection algorithm for its partial collections. Thus each collection produces several completely empty regions which can be returned to OS.
G1 is also exploiting generational principle. Set of regions is considered young space and treated accordingly.
G1 has a lot of hype as a garbage collection silver bullet, but I'm personally quite skeptical about it. Here are few reasons:
  • G1 have to maintain few additional data structures to make partial GC possible, it taxes performance.
  • It is still doing most of heavy lifting during STW pause unlike CMS which is mostly concurrent. That IMHO will hinder G1's ability to scale as well as CMS with growing heap sizes.
  • Large objects (comparable to region size) are problematic for G1 (due to fragmentation).
G1 collections in logs:
G1 young collection
[GC pause (young), 0.00176242 secs]
   [Parallel Time:   1.6 ms]
      [GC Worker Start Time (ms):  15751.4  15751.4]
      [Update RS (ms):  0.1  0.3
       Avg:   0.2, Min:   0.1, Max:   0.3]
         [Processed Buffers : 2 1
          Sum: 3, Avg: 1, Min: 1, Max: 2]
      [Ext Root Scanning (ms):  1.0  0.9
       Avg:   0.9, Min:   0.9, Max:   1.0]
      [Mark Stack Scanning (ms):  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Scan RS (ms):  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Object Copy (ms):  0.3  0.3
       Avg:   0.3, Min:   0.3, Max:   0.3]
      [Termination (ms):  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
         [Termination Attempts : 1 1
          Sum: 2, Avg: 1, Min: 1, Max: 1]
      [GC Worker End Time (ms):  15752.9  15752.9]
      [Other:   0.1 ms]
   [Clear CT:   0.0 ms]
   [Other:   0.1 ms]
      [Choose CSet:   0.0 ms]
   [ 18M->12M(26M)]
 [Times: user=0.00 sys=0.02, real=0.00 secs]  
G1 partial collection
[GC pause (partial), 0.01589707 secs]
   [Parallel Time:  15.6 ms]
      [GC Worker Start Time (ms):  15774.1  15774.2]
      [Update RS (ms):  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
         [Processed Buffers : 0 3
          Sum: 3, Avg: 1, Min: 0, Max: 3]
      [Ext Root Scanning (ms):  1.0  0.7
       Avg:   0.8, Min:   0.7, Max:   1.0]
      [Mark Stack Scanning (ms):  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
      [Scan RS (ms):  0.0  0.1
       Avg:   0.0, Min:   0.0, Max:   0.1]
      [Object Copy (ms):  14.3  14.5
       Avg:  14.4, Min:  14.3, Max:  14.5]
      [Termination (ms):  0.0  0.0
       Avg:   0.0, Min:   0.0, Max:   0.0]
         [Termination Attempts : 3 3
          Sum: 6, Avg: 3, Min: 3, Max: 3]
      [GC Worker End Time (ms):  15789.5  15789.5]
      [Other:   0.4 ms]
   [Clear CT:   0.0 ms]
   [Other:   0.2 ms]
      [Choose CSet:   0.0 ms]
   [ 13M->12M(26M)]
 [Times: user=0.03 sys=0.00, real=0.02 secs]
G1 full collection (incremental mode failure)
32.940: [Full GC 772M->578M(900M), 1.9597901 secs]
 [Times: user=2.29 sys=0.08, real=1.96 secs]

Train GC

Train GC was removed from HotSpot JVM long time ago. But due to most articles about GC are fairly out dated you may find references to it sometimes. It is gone, period.

Permanent space

In case you are wondering. Permanent space is a part of old space use by JVM for internal data structures (mostly related to class loading and JIT). Permanent space is not necessary cleaned on every old space collection iteration and sometimes you may need to use additional switches to make JVM collect unused data in PermGen. Normally data in permanent space are, well ..., immortal,  but ability of JVM to unload classes makes things complicated.

More reading

Intent of this article was to give you an introduction to garbage collection in HotSpot JVM. Other my article on topic I would recomend:



Monday, November 21, 2011

Using co-segments to dynamically adapt Lucene for frequent queries

Apache Lucene is magnificent library for working with inverted indexes. While full text search is its primary use case, you may often find application of Lucene in other areas too. In particular, in data mining, it can be useful for categorizing large volumes of textual information (think about finding trends in news feeds). Data mining is traditional domain for map/reduce style of distributed processing, but it you want to mine data interactively,  changing rules on fly, map/reduce is not an option. Inverted indexes are more suited for interactive use case and Lucene is excellent in that role.

Query complexity problem

Typically "patterns" to be mined (e.g. positive words about certain movie or description matching certain camera model) are represented by fairly complex Lucene queries having large numbers of subqueries.  Another problem is having terms with low selectivity in index which may be participate as a part of complex pattern  (for full text search you usually can just drop them out of index).
These two factors make mining job very hungry for CPU.

Reducing CPU load by adding "synthetic" terms and query rewriting

Different patterns usually share a lot of common subqueries. These subqueries themselves may be quite computation expensive (e.g. phrase and span queries). Instead of evaluating each subquery each time for any pattern query, it is possible to precalculate them. First, "synthetic" terms, are added to index (I call them hints for short). These hints mark documents matching particular subquery. Then all queries to be executed should be rewritten with "synthetic" terms (hints) to let Lucene use that additional information.
Such optimization may increase query throughput per CPU by substantial factor. In my practice, for complex queries, CPU utilization can be reduced down by 3-10 times.
Though applying this technique not so easy. There few problems to be solved:
  • matching documents against set of queries to generate synthetic terms,
  • adding synthetic terms to existing index,
  •  rewrite queries to benefit from precalculated information.
All of them are tricky. Normally to add new attribute to document you have to rebuild whole Lucene index which is heavy and resource intensive task. This is a very bad solution. It is critical to add new subqueries interactively along with users exploring new patterns. Using this technique does not make sense if additional synthetic terms cannot be added to index relatively cheaply on the fly.

Generating co-segment

Straight forward approach to generate co-segment for a number of subquery terms would be - testing these subquery for each document in index. Lucene has MemoryIndex which is quite handy for that. It allows to create single-document index in memory and test any queries against it. So far so good, but then you realize that you have to load and parse each of documents, it turns out to be prohibitively slow (just 2-5 times faster, than rebuild whole index).
There is a much, much better way. We may query main index, get all document ID matching subquery and encode this information into co-segement.
Querying index is blazing fast (compared to scanning  through documents).

Adding co-segments to existing index

Lucene index organized into segments, which are stored in file system. Each specific group of files (segment) is an inverted index for subset of documents. Whole index may be divided into several segments. Inverted - means data in file is sorted by terms. This means that if all our synthetic term would be "greater" than normal terms, we could just append them at end of existing index file (instead of rebuilding whole file).  After second though, we do not need to do anything with real files at all. Making Lucene think that there are few more terms in segment is enough (read - implement own Directory merging data behind scenes making Lucene to "see" segment + all co-segments as single file).

Rewriting queries

Once we have "synthetic" terms or hints in index, we have to rewrite query somehow. Queries can be very sophisticated (and they are, otherwise we wouldn't need all these tricks).
Simple idea is just to add hint clause and additional top level MUST condition. E.g. query Q will be written to H & Q, there H is condition for "synthetic" terms. This way we do not need to care of internal Q of and ensure correctness of rewrite.
Unfortunately, using BooleanQuery to join original query with hint clause will not produce any speed improvements. Actually the way queries are executed in Lucene is making whole idea of "hinting" very non-trivial for implementation.
Solution was writing own HintAwareQuery to wrap around "pattern" query (vanilla Lucene query). This wrapper do all dirty work. First it analyzes existing hints and chooses ones to be used. Second, it optimizes execution of nested query making parts of search index, masked-out by chosen set of hints, "invisible" to query execution.

Conclusion

So, what was achieved?
  • Hints can be added/removed to "live" index in matter of seconds,
  • Transparent from application- just wrap everything into  HintAwareQuery,
  • Order of magnitude speed up for complex queries.
Thanks to Lucene flexibility, which made such optimization ever possible!

More ideas

So far, hints are created manually using requests statistics from application search service. Interesting idea would be to automate this process, let search service itself profile requests and create hints using own statistics.
Another idea is using index masking technique for optimizing queries without hints - e.g. if MUST clauses of top level BooleanQuery could be used instead of "synthetic" hints if they are simple enough (e.g. as simple as TermQuery). Such trick could bring comparable bust for vanilla  BooleanQuery without any need of precalculation at all.

Friday, November 4, 2011

Coherence SIG: Advanced usage of indexes

Another my presentation Coherence SIG, this time from London.
Main theme of presentation was internal mechanics of indexes in Coherence. How indexes are stored, how queries are executed, how create custom filters and indexes - all these topics were covered.

Wednesday, November 2, 2011

Java GC, HotSpot's CMS promotion buffers

Recently, I have unfairly blamed promotion local allocation buffers (PLAB) for fragmentation of old space using concurrent mark sweep garbage collector. I was very wrong. In this article, I'm going to explain how PLABs really work with all details.

PLABs

PLAB stand for promotion local allocation buffer. PLABs are used during young collection. Young collection in CMS (and all other garbage collectors in HotSpot JVM) is a stop-the-world copy collection. CMS may use multiple threads for young collection, each of these threads may need to allocate space for objects being copied either in survivor or old space. PLABs are required to avoid competition of threads for shared data structures managing free memory. Each thread have one PLAB for survival space and one for old space. Free memory in survivor space are continuous, so do survivor PLABs, which are simply continuous blocks. On other hand, free memory in old space (using CMS collector) is fragmented and managed via sophisticated dictionary or free chunks ...

Free list space(FLS)

CMS collector cannot compact old space (actually it can, but compaction involves long stop-the-world pause, often referred as GC freeze). Memory manager operates with lists of free chunks to manage fragmented free space. As a counter measure from fragmentation, chunks of free space are grouped by size.В  If available, free chunk of exact required size will be used to serve allocation request. If chunks of given size are exhausted, memory manager will split larger chunk into several smaller to satisfy demand. Consecutive free chunk can also be coalesced to create larger ones (coalescence is made along with sweeping during concurrent GC cycle). This splitting/coalesce logic is controlled by complex heuristics and chunk demand per size statistics.

Old space PLABs

Naturally old space PLABs mimic structure of indexed free list space. Each thread preallocates certain number of chunk of each size below 257 heap words (large chunk allocated from global space). Number of chunks of each size to be preallocated is controlled by statistics. Following JVM flag will enabled verbose reporting of old space PLAB sizing (too verbose for production though).
-XX:+PrintOldPLAB
At the beginning of each young collection we will see following lines in GC log
6.347: [ParNew ...
...
0[10]: 722/5239/897
0[12]: 846/5922/987
0[14]: 666/5100/850
...
1[12]: 229/3296/987
1[14]: 2/2621/850
1[16]: 69/1812/564
1[18]: 247/1160/290
...
[10]: 905
[12]: 1002
[14]: 865
[16]: 567
...
First lines are statistics from each scavenger (young collector) thread in following format:
<tid>[<chunk size>]: <num_retire>/<num_blocks>/<blocks_to_claim>
tid - GC thread ID,
chunk size - chunk size in heap words,
num_retire - number of free chunks in PLAB at the end of young GC,
num_blocks - number of chunks allocated from FLS to PLAB during young GC,
blocks_to_claim - desired number of blocks to refill PLAB.
Next few lines show estimated number of chunks (per size) to be preallocated (per GC thread) at beginning of next young collection.
[<chunk size>]: <blocks_to_claim>

Calculating desired block to claim

Initial number of blocks (chunks) per chunk size is configured via -XX:+CMSParPromoteBlocksToClaim JVM command line option (-XX:+OldPLABSize is alias for this option if CMS GC is used). If resizing of old PLAB is not disabled by -XX:-ResizeOldPLAB option, then desired PLAB size will be adjusted after each young GC.
Ideal desired number per chunk size is calculated by following formula:
block_to_claimideal = MIN(-XX:CMSOldPLABMax, MAX(-XX:CMSOldPLABMin, num_blocks / (-XX:ParallelGCThreads -XX:CMSOldPLABNumRefills)))
,but effective value is exponentially smoothed over time
blocks_to_claimnext = (1 - w) blocks_to_claimprev + w block_to_claimideal
,there w is configured via -XX:OldPLABWeight (0.5 by default).

On-the-fly PLAB resizing

During young collection, if chunk list of certain size will get exhausted, thread will refill it from global free space pool (allocating same number of chunks as at the beginning of collection). Normally thread will have to refill chunk list few times during collection (-XX:CMSOldPLABNumRefills sets desired number of refills). Though, if initial estimate was too small, GC thread will refill its chunk list too often (refill requires global lock for memory managed, so it may be slow). If on-the-fly PLAB resizing is enabled JVM will try to detect such conditions as resize PLAB in the middle of young collection.
-XX:+CMSOldPLABResizeQuicker will enable on-the-fly PLAB resizing (disabled by default).
Few more options offer additional tuning:
-XX:CMSOldPLABToleranceFactor=4 tolerance of the phase-change detector for on-the-fly PLAB resizing during a scavenge.
-XX:CMSOldPLABReactivityFactor=2 gain in the feedback loop for on-the-fly PLAB resizing В during a scavenge.
-XX:CMSOldPLABReactivityCeiling=10 clamping of the gain in the feedback loop for on-the-fly PLAB resizing during a scavenge.

Conclusion

I have spent some time digging though OpenJDK code to make sure, that I'm getting that thing now. It was educating. This article has brought up and explained few more arcane JVM options,В  though I doubt that I will ever use them in practice. Problem with heap fragmentation is that you have to run application for really long time before fragmentation will manifest itself. Most of options above require trial and error path (even though -XX:+PrintOldPLAB might give you some insights about your application) . It much easier just to give damn JVM little more memory (hey, RAM is cheap nowadays) than spend day tuning arcane options.
Anyway, I hope it was as education for you as it was for me.

See also