Tuesday, August 4, 2015

SJK - missing link in Java profiling tool chain

Sometimes it just happens. You have a bloated Java application at your hand and it does not perform well. You may have built this application yourself or just got it as it is now. It doesn't matter, thing is - you do not have a slightest idea what is wrong here.

Java ecosystem have abundance of diagnostic tools (thank for interfaces exposed at JVM itself), but they are mostly focused on some specific narrow kinds of problems. Despite calling themselves intuitive, they assume you have a lot of background knowledge about JVM and profiling techniques. Honestly, even seasoned Java (I'm speaking for myself here) developer can feel lost first time looking at JProfiler, YourKit of Mission Control.

If you have a performance problem at your hand, first you need is to classify problem: is it in Java or database or somewhere else? is CPU or memory kind of problem? Once you know what kind of problem you have, you can choose next diagnostic approach consciously.

Are we CPU bound?

One of first thing you would naturally do is to check CPU usage of your process. OS can show you process CPU usage. Which is useful, but the next question is which threads are consuming it. OS can show you threads usage too, you can even get OS IDs for your Java threads using jstack and correlate them ... manually (sick).

A simple tool showing CPU usage per Java thread is the thing I wanted badly for the years.

Surprisingly, all information is already in JMX Threading MBean. All is left is to do trivial math and report per thread CPU usage. So I just did it and ttop command become first in SJK tool set.

Besides CPU usage JMX have another invaluable metric - per thread allocation counter.

Collecting information from JMX is safe and can be done on live application instance (in case if you do not have JMX port open, SJK can connect using process ID).

Below is example of ttop command output.

2014-10-01T19:27:22.825+0400 Process summary
  process cpu=101.80%
  application cpu=100.50% (user=86.21% sys=14.29%)
  other: cpu=1.30%
  GC cpu=0.00% (young=0.00%, old=0.00%)
  heap allocation rate 123mb/s
  safe point rate: 1.5 (events/s) avg. safe point pause: 0.14ms
  safe point sync time: 0.00% processing time: 0.02% (wallclock time)
[000037] user=83.66% sys=14.02% alloc=  121mb/s - Proxy:ExtendTcpProxyService1:TcpAcceptor:TcpProcessor
[000075] user= 0.97% sys= 0.08% alloc=  411kb/s - RMI TCP Connection(35)-10.139.200.51
[000029] user= 0.61% sys=-0.00% alloc=  697kb/s - Invocation:Management
[000073] user= 0.49% sys=-0.01% alloc=  343kb/s - RMI TCP Connection(33)-10.128.46.114
[000023] user= 0.24% sys=-0.01% alloc=   10kb/s - PacketPublisher
[000022] user= 0.00% sys= 0.10% alloc=   11kb/s - PacketReceiver
[000072] user= 0.00% sys= 0.07% alloc=   22kb/s - RMI TCP Connection(31)-10.139.207.76
[000056] user= 0.00% sys= 0.05% alloc=   20kb/s - RMI TCP Connection(25)-10.139.207.76
[000026] user= 0.12% sys=-0.07% alloc=  2217b/s - Cluster|Member(Id=18, Timestamp=2014-10-01 15:58:3 ...
[000076] user= 0.00% sys= 0.04% alloc=  6657b/s - JMX server connection timeout 76
[000021] user= 0.00% sys= 0.03% alloc=   526b/s - PacketListener1P
[000034] user= 0.00% sys= 0.02% alloc=  1537b/s - Proxy:ExtendTcpProxyService1
[000049] user= 0.00% sys= 0.02% alloc=  6011b/s - JMX server connection timeout 49
[000032] user= 0.00% sys= 0.01% alloc=     0b/s - DistributedCache

Besides CPU and allocation, it also collect "true" GC usage and safe point statistics. Later two metrics are not available via JMX so they are available only for process ID connections.

CPU usage picture will give you good insight what to do next: should you profile your Java hot spots or all time is spent waiting result from DB.

Garbage analysis

Another common class of Java problems is related to garbage collection. If this is a case GC logs is first place to look at.

Do you have them enabled? If not, that is not a big deal, you can enable GC logging on running JVM process using jinfo command. You can also use SJK's gc command to peek GC activity for your java process (it is not as full as GC logs tough).

If GC logs confirm what GC is causing you problems, next step is to identify where that garbage comes from.

Commercial profilers are good at memory profiling, but this kind of analysis slows down target application dramatically.

Mission Control stands out of pack here, it can profile by sampling TLAB allocation failures. This technique is cheap and generally produce good results, though it is inherently biased and may mislead you sometimes.

For long time jmap and class histogram were main memory profiling instrument for me. Class histogram is simple and accurate.

In SJK] toolset, I have augmented vanila jmap command a little to make it more useful (SJK's [hh command).

Beware that jmap (and thus hh command) required Stop the World pause on target JVM while heap is being walked, so it may not be a good idea to execute it against live application under load.

Dead heap histogram is calculated as difference between object population before and after forced GC (using jmap class histogram command under hood).

Dead young heap histogram enforces full GC then wait 10 seconds (by default) then produce dead object histogram by technique describe above. Thus you see a summary freshly allocated garbage.

This methods cannot not tell you where in your code that garbage was allocated (this is job for Mission Control et al ). Though, if you know that is your top garbage objects, you may already know there they are allocated.

SJK have a few more tools but these two ttop and hh are always in front lines when I need to tackle another performance related problem.

Monday, February 23, 2015

So, you have dumped 150 GiB of JVM heap, now what?

150 GiB worth of JVM heap dump is laying on hard drive and I need analyze specific problem detected in that process.

This is a dump of proprietary hybrid of in-memory RDBMS and CEP system, I'm responsible for. All data are stored in Java heap, so heap size of some installation is huge (400 GiB heap is largest to the date).

Problem of analyzing huge heap dumps were on my radar for some time, so I wasn't unprepared.

To be honest, I haven't tried to open this file in Eclipse Memory Analyzer, but I doubt it could handle it.

For me, for some time, most useful tool in heap analyzers was JavaScript based queries. Clicking through millions objects is not fun. It is much better to walk object graph with code, not with mouse.

Heap dump is just a serialized graph of objects, my goal is to extract specific information from this graph. I do not really need a fancy UI, API to heap graph would be even better.

How I can analyze heap dump programmatically?

I have started my research with NetBeans profiler (it was a year ago). NetBeans is open source and have visual heap dump analyzer (same component is also used in JVisualVM). It turns out, what heap dump processing code is separate module and API it provides is suitable for custom analysis logic.

NetBeans heap analyzer has a critical limitation, though. It is using temporary file to keep internal index of heap dump. This file is typically around 25% of heap dump itself. But most important it takes a time to build this file, before any query to heap graph is possible.

After taking better look, I decided, I could remove this temporary file. I have forked library (my fork is available at GitHub). Some functions was lost together with temporary file (e.g. backward reference traversing), but they are not need for my kind of tasks.

Another important change to original library, was implementing HeapPath.
HeapPath is an expression language for object graph. It is useful both as generic predicate language in graph traversal algorithms and as simple tool to extract data from object dump. HeapPath automatically converts strings, primitives and few other simple types from heap dump structures to normal objects.

This library proved itself very useful in our daily job. One of its application was memory reporting tool for our database/CEP system which automatically report actual memory consumption of every relational transformation node (there could be few hundred nodes in single instance).

For interactive exploring API + Java is not best set of tools, tough. But it lets me do my job (and 150 GiB of dump leave me no alternatives).

Should I be adding some JVM scripting language to the mix ...

BTW: Single pass through 150 GiB is taking about 5 minutes. Meaning full analysis usually employ multiple iterations, but processing times are fairly reasonable even for that heap size.

Wednesday, February 11, 2015

Binary search - is it still most optimal?

If you have a sorted collection of elements, how would you find index of specific value?
"Binary search" is likely to be your answer.
Algorithms theory is teaching us what binary search is most optimal algorithm for this task with log(N) complexity.
Well, hash table can do better, if you need to find key by exact match. In many cases, though, you have reasons to have your collection sorted, not hashed.

On my job, I'm working on sophisticated in-memory database tailored for streaming data processing. We have a lot of places where we deal with sorted collection of integers (data row references, etc).

Algorithms theory is good, but in reality there are things like cache hierarchy, branch prediction, super scalar execution which may skew performance at edge cases.

Question is - where lie borders between reality ruled by CPU quirks and lawful space of classic algorithms theory?

If you have a doubt - do an experiment.

Experiment is simple: I'm generating a large number of sorted arrays of 32 bit integers. When I search random key in random array multiple times. In each experiment average size of array is fixed. Large number of arrays used to ensure cold memory access. Average time search time is measured.

All code written in Java and measured using JMH tool.

Participants are

  • Binary search - java.util.Arrays.binarySearch()
  • Linear search - simple loop over array until key is found
  • Linear search 2 - looping over every second element in array, if greater key is found, check i - 1 index too

X axis is average array length
Y axis is average time of single search in microseconds
Measurments have been done on 3 different types CPU.

Results speak for themselves.

I was surprised a little, as I were expecting binary search to outperform linear at length of 32 or 64, but it seems that modern processors are very good at optimizing linear memory access.

Provided that 8 - 128 is a practical range for BTree like structures, I will likely to reconsider some of data structures used in our database.