Alexey Ragozin

Curse of the JMX

2023-09-21T09:55:00.004+01:00

JMX stands for Java Management Extension, it was introduced as part Java Enterprise Edition (JEE) and later has become an integral part of JVM.

JVM exposes a handful of useful information on diagnostic tooling through the JMX interface.

Many popular tools such as Visual VM and Mission Control are heavily based on JXM. Event Java Flight Record is exposed for remote connection via JMX.

Middleware and libraries are also exploiting JMX to expose custom MBeans with helpful information.

So if you are in the business of JVM monitoring or diagnostic tooling you cannot avoid dealing with JMX.

JMX is a remote access protocol, it is using TCP sockets and requires some upfront configuration for JVM to start listening for network connections (though tools such as VisualVM can enable JMX at runtime, provided they have access to the JVM process).

You can find details about JMX agent configuration in official documentation, but below is minimal configuration (add snippet below to JVM start command).

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=55555

JVM will start listening on port 5555. You would be able to use this port in Visual VM and other tools.

Configuration above is minimal, access control and TLS encryption are disabled. You should consult documentation mentioned above to add security (which would be typically required in a real environment).

JMX is a capable protocol, but it has some idiosyncrasies due to its JEE lineage. In particular, it has specific requirements for network topology.

JVM is based on Java RMI protocol. Access to JMX agent has a two step handshake.

On the first step, the client makes a request to the RMI registry and receives a serialized remote interface stub. JXM agent has a built-in single object registry which is exposed on port 5555 in our example.

On the second step, client to accessing remote interface via network address embedded in this stub object received on the first step.

In a trivial network, this is not an issue, but if there are any form of NAT or proxy between JMX client and JVM things are likely to break.

So we have two issues here:

1. Stub could be exposed on different port number, which is not whitelisted

2. Stub may provide some kind of internal IP, not routable for client host

First issue is easily solvable with com.sun.management.jmxremote.rmi.port property, which can be set to the same value as registry port (5555 in our example).

Second issue is much more tricky as JVM may be totally unaware of IP visible from outside, even worse such IP could be dynamic so it cannot be configured via JVM command line.

In this article, I would describe a few recipes for dealing with JMX in the modern container/cloud world. None of them is ideal, but I hope at least one could be useful for you.

Configuring JMX for known external IP address

In case if you know a routable IP address, the solution is to configure the JVM to provide specific IP inside of the remote interface stub. Example for this situation would be running a JVM in a local Docker container.

JVM parameter -Djava.rmi.server.hostname=<MyHost> can be used to override IP in remote stubs provided by JMX agent. This parameter affects all RMI communication, but RMI is rarely used nowadays besides the JXM protocol.

Resulting communication scheme is outlined on the diagram below.

JVM options

-Djava.rmi.server.hostname=1.2.3.4
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=55555
-Dcom.sun.management.jmxremote.rmi.port=5555

Communication diagram

Configuring JMX for tunneling

In some situations, the IP address of the JVM host may not be even reachable from the JMX client host. Here is a couple of typical examples

● You are using SSH to access the internal network through a bastion host.

● JVM is in Kubernetes POD.

In both situations you can use port forwarding to establish a network connectivity between JMX client and JVM.

Again, you would need to override IP in remote service stub, but now you will have to set it to 127.0.0.1.

Communication diagram is shown below.

JVM options

-Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=55555
-Dcom.sun.management.jmxremote.rmi.port=5555

Communication diagram

In the case of SSH, you can use port forwarding.
In Kubernetes, there is also a handy kubectl port-forward command which allows to communicate with POD directly.

You can even chain port-forwarding multiple times.

Though this approach has its own limitations.

● JMX will not be available for remote hosts without port forwarding any more, so this configuration may interfere with monitoring agents running in your cluster and collecting JMX metrics.

● You cannot connect to multiple JVMs using the same JMX port (e.g. PODs from single deployment), as your port on client host is bound to a particular remote destination. Remapping ports will break the JMX.

Using HTTP JMX connector

Root of the problem is the RMI protocol which is archaic and doesn’t evolve to support modem network topologies. JMX is flexible enough to use alternative transport layers and one of them is HTTP (using Jolokia open source project).

Though implementation doesn’t come out of the box. You will have to ship a Jolokia agent jar with your application and introduce it via JVM command like Java agent (see details here).

Good news is that nowadays tools such as VisualVM and Mission Control fully support Jolokia JMX transport. Below are few demo videos for Jolokia project:

● Jolokia from JMC

● Connect Visual VM to a JVM running in Kubernetes using Jolokia

● Connect Java Mission Control to a JVM in Kubernetes

In addition to classic tools, Jolokia HTTP endpoint is accessible from client side JavaScript so web client is also possible. See Hawt.IO project implement diagnostic web console for Java on top of Jolokia.

Using SJK JMX proxy

Dealing with JMX over the years, at some point I have decided to make a diagnostic tool specifically for JMX connectivity troubleshooting.

It is part of SJK - my jack-of-all-knives solution for dealing with JVM diagnostics. mxping command can help to identify, which part of JMX handshake is broken.

While implementing mxping, I have realized that I can solve the root cause of RMI network sensitivity by messing with JMX client code. As I am not eager to patch all JMX tools around, I have introduced JMX Proxy (mxprx), which can be used between JMX Client and remote JVM.

Using JMX proxy may eliminate issues with port forwarding scenario mention above as

● It does require -Djava.rmi.server.hostname=127.0.0.1 on the JVM side.

● Allow you remap ports and thus keep multiple ports forwarded at the same time.

Below is a communication diagram using JMX proxy from SJK.

In addition, with JMX proxy ad hoc configuration of JMX endpoint without JVM restart becomes possible.

JMX agent could be started and configured at runtime via jcmd, but java.rmi.server.hostname can only be set in the command line of the JVM. But with JMX proxy we do not rely on java.rmi.server.hostname anymore!

Below are steps to connect to the JVM in the Kubernetes POD even if JMX was not configured upfront.

1. Enter the container shell using the kubectl exec command.

2. In the container, use jcmd ManagementAgent.start to start JMX agent (see more details here).

3. Forward port from container to your local host.

4. Start JMX proxy on your host pointing it on localhost:<port forwarded from container> and provide some outbound port (see more details here).

5. Now you can connect with any JMX aware tool via locally running JMX proxy.

Conclusion

I have listed four alternative approaches for JMX setup. None of them are universal unfortunately and you have to pick one which is most suitable for your case.

While JMX is kind of archaic it is still essential for JVM monitoring and you are likely to have to deal with it for any serious Java based system.

I hope someday HTTP will become built-in and default for JVM and all this trickery will become a horror story from the old days.

Lies, darn lies and sampling bias

2019-03-11T09:08:00.002+00:00

Sampling profiling is very powerful technique widely used across various platforms for identifying hot code (execution bottlenecks).

In Java world sampling profiling (thread stack sampling to be precise) is supported by every serious profiler.

While being powerful and very handy in practice, sampling has well known weakness – sampling bias. It is real and well-known problem, though its practical impact is often being over exaggerated.

A picture is worth a thousand of words, so let me jump start with example.

Case 1

Below is a simple snippet of code. This snippet is doing cryptographic hash calculation over a bunch of random strings.

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.concurrent.TimeUnit;

public class CryptoBench {

 private static final boolean trackTime = Boolean.getBoolean("trackTime");
 
 public static void main(String[] args) {
  CryptoBench test = new CryptoBench();
  while(true) {
   test.execute();
  }  
 }
 
 public void execute() {
        long N = 5 * 1000 * 1000;
        RandomStringUtils randomStringUtils = new RandomStringUtils();
        long ts = 0,tf = 0;
        long timer1 = 0;
        long timer2 = 0;
        long bs = System.nanoTime();
        for (long i = 0; i < N; i++) {
         ts = trackTime ? System.nanoTime() : 0;
            String text = randomStringUtils.generate();
            tf = trackTime ? System.nanoTime() : 0;
            timer1 += tf - ts;
            ts = tf;
   crypt(text);
   tf = trackTime ? System.nanoTime() : 0;
   timer2 += tf - ts;
   ts = tf;
        }
        long bt = System.nanoTime() - bs;
        System.out.print(String.format("Hash rate: %.2f Mm/s", 0.01 * (N * TimeUnit.SECONDS.toNanos(1) / bt / 10000)));
        if (trackTime) {
         System.out.print(String.format(" | Generation: %.1f %%",  0.1 * (1000 * timer1 / (timer1 + timer2))));
         System.out.print(String.format(" | Hashing: %.1f %%", 0.1 * (1000 * timer2 / (timer1 + timer2))));
        }
        System.out.println();
 }

    public String crypt(String str) {
        if (str == null || str.length() == 0) {
            throw new IllegalArgumentException("String to encrypt cannot be null or zero length");
        }
        StringBuilder hexString = new StringBuilder();
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            md.update(str.getBytes());
            byte[] hash = md.digest();
            for (byte aHash : hash) {
                if ((0xff & aHash) < 0x10) {
                    hexString.append("0" + Integer.toHexString((0xFF & aHash)));
                } else {
                    hexString.append(Integer.toHexString(0xFF & aHash));
                }
            }
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        }
        return hexString.toString();
    }
}

code is available on github

Now let’s use a Visual VM (a profiler bundled with Java 8) and look how much time is actually spent in CryptoBench.crypt() method.

Something in definitely off in screenshot above!

CryptoBench.crypt(), method doing actual cryptography, is attributed only to 33% of execution time.
At same time, CryptoBench.execute() has 67% of self time, and that methods is doing nothing besides calling other methods.

Probably I just need a cooler profiler here. /s

Let’s use Java Flight Recorder for the very same case.
Below is screen shot from Mission Control.

That looks much better!

CryptoBench.crypt() is now 86% of time our budget. Rest of time code spends in random string generation.
These numbers are looking more belivable to me.

Wait, wait, wait!

Integer.toHexString() is taking as much time as actual MD5 calculation. I cannot belive that.

Numbers are better than ones produced by VisualVM but they are still fishy enough.

Flight recorder is not cool enough for that task! We need really cool profiler! /s

Ok, let me bring some sense into this discrepancy between tools.

We were using thread stack sampling in both tools (Visual VM and Flight Recorder). Though, these tools capture stack traces differently.

Visual VM is actually sampling thread dumps (via thread dump support in JVM). Thread dumps include stack traces for every application thread in JVM, regardless of whatever thread's state is (blocked, sleeping or actually executing code) and this dump is taken atomically. It reflects instant execution state of whole JVM (which is important for deadlock/contention analysis). In practice, that implies short Stop the World pause for each dump. Stop the World pause means safepoint in hotspot JVM. And safepoints brings some nuances.

When Visual VM requests thread dump, JVM notifies threads to suspend execution, but a thread executing Java code wouldn’t stop immediately (unless it is interpreted). The thread would continue to run until next safepoint check where it can suspend itself. Checks cost CPU cycles so they are sparse in JIT generated code.

Checks are placed inside of loops and after method returns. Though, checks are omitted for loops considered “fast” by JIT compiler (typically integer indexed loops). Small methods are aggressively inlined too, hence omiting safepoint check at return. As a consequence, a hot and calculation intensive code may be optimized by JIT into single chunk of machine code which is mostly free of safepoint checks.

If you are lucky, thread dump would show you a line invoking the method containing hot code. With less luck result would be even more misleading.

So in Visual VM call tree we see method CryptoBench.execute() at top of the stack for 66% of samples. If we would be able to see call tree at line number granularity is would be a line calling CryptoBench.crypt() method.

Bad, ugly safepoint bias I’ve caught you red handed! /s

So, how Flight Recorder does sample stacks and why numbers are different?

Flight Recorder sampling doesn’t involve full thread dumps. Instead it freezes threads one by one using OS provided facilities. Once thread is frozen; we can get address of next instruction to be executed out of stack memory area. Address of instruction is converted into line number of java source code via byte code to machine code symbol map. The map is generated during JIT compilation. This is how stack trace is reconstructed.

In case of Flight Recorder safepoint bias does not apply. Though results are still looking inaccurate. Why?

Below is another session with Flight Recorder for the very same code.

Picture is different now.

Integer.toHexString() is just 2.25% of out execution budget which is more trustworthy in my eyes.

Flight Recoder has to resolve memory addresses back to reference of bytecode instruction (which is further transalted into Java source line). Mapping generated by JIT compiler is used for that purpose.

Though compiler is aware that we can see thread stack trace only at safepoints. By default, only safepoint checks are mapped into bytecode instruction indexes. Flight Recorder takes execution address from stack, then it finds next address mapped to Java code in symbol table. In case of aggressive inlining, Flight Recorder can map address to whole wrong point in code.

Though sampling itself is not biased by safepoints, symbol map generated by JIT compiler is.

In second example, I’ve used two JVM options to force more detailed symbol maps to be generated by JIT compler. Options are below.

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

More accurate, free of bias, symbol map allows Flight Recorder to produce more accurate stack traces.

In our mental model, code is being executed line by line (bytecode instruction by instruction). But complier lumps bunch of methods together and generates single blob of machine code, aggressively reordering operations in the middle of process to make code faster.
Our mental model of line by line execution is totally broken by compiler optimization.

Though, in practice artifacts of operation reordering are not that striking as safepoint bias.

So Java Flight Recorder is cool, Visual VM is not. Should I make this conclusion?

Let me present a counter example.

Case 2

Below is profiling reports from a differnt case.

Now I’m using flame graph generated from data captured by Visual VM and Flight Recorder (with –XX:+DebugNonSafepoints).

Visual VM report

Flight Recorder report

Both graphs are showing InflaterInputStream to be a bottleneck. Though Visual VM assesses time spent as 98%, but in Flight Recorder it is just 47%.

Who is right?

Correct answer is 92% (which is approximated using differential analysis).

My heart is broken! Flight Recorder is not a silver bullet. /s

What have gone wrong?

In this example, hot spot was related to JNI overhead involved with calling native code in zlib. It seems like Flight Recorder were unable reconstruct stack trace for certain samples outside of Java code and dropped these samples. Sample population was biased by native code execution. That bias has played against Flight Recorder in this case.

Conclusion

Both profilers are doing that they intended to do. Some sort of bias is natural for almost any kind of sampling.

Each sampling profiler could be categorized by three aspects.

Blind spots bias – which samples are excluded from data set collected by profiler.
Attractor bias – how samples be attracted to specific discrete points (e.g. safe point).
Resolution – unit of code which profiling data is being aggregated to (e.g. method, line number etc).

Below is summary table for sampling methods mentioned in this article.

	Blind spot	Attractor	Resolution
JVM Thread Dump Sampling	non-java threads	safepoint bias	java frames only
Java Flight Recorder	non-java code execution	CPU pipeline bias + code to source mapping skew	java frames only
Java Flight Recorder + `DebugNonSafepoint`	non-java code execution	CPU pipeline bias + code to source mapping skew	java frames only

SJK is learning new tricks

2018-05-30T00:57:00.000+01:00

SJK or (Swiss Java Knife) was my secret weapon for firefighting various types of performance problems for long time.

A new version of SJK was released not too long ago and it contains а bunch of new and powerful features I would like to highlight.

ttop contention monitoring

SJK is living it's name by bundling a number of tool into single executable jar. Though, ttop is a likely single most commonly used tool under SJK roof.

ttop is a kind top for threads of JVM process. Besides CPU usage counter (provided by OS) and allocation rate (tracked by JVM), a new thread contention metrics was introduced in recent SJK release.

Thread contention metrics are calculated by JVM, which counts and times when Java threads enters into BLOCKED or WAITING state.

If enabled, SJK is using these metrics to display rates and percentage of time spent in either state.

2018-05-29T14:20:03.382+0300 Process summary 
  process cpu=231.09%
  application cpu=212.78% (user=195.86% sys=16.92%)
  other: cpu=18.31% 
  thread count: 157
  GC time=4.72% (young=4.72%, old=0.00%)
  heap allocation rate 976mb/s
  safe point rate: 6.3 (events/s) avg. safe point pause: 8.24ms
  safe point sync time: 0.07% processing time: 5.09% (wallclock time)
[000180] user=19.40% sys= 0.31% wait=183.6/s(75.77%) block=    0/s( 0.00%) alloc=  110mb/s - hz._hzInstance_2_dev.cached.thread-8
[000094] user=16.92% sys= 0.16% wait=58.50/s(81.54%) block=    0/s( 0.00%) alloc=   94mb/s - hz._hzInstance_3_dev.generic-operation.thread-0
[000057] user=15.05% sys= 0.62% wait=56.91/s(82.35%) block= 0.20/s( 0.01%) alloc=   91mb/s - hz._hzInstance_2_dev.generic-operation.thread-0
[000095] user=15.21% sys= 0.00% wait=55.61/s(82.32%) block= 0.30/s( 0.04%) alloc=   87mb/s - hz._hzInstance_3_dev.generic-operation.thread-1
[000022] user=14.59% sys= 0.00% wait=56.01/s(83.42%) block= 0.30/s( 0.08%) alloc=   86mb/s - hz._hzInstance_1_dev.generic-operation.thread-1
[000058] user=13.97% sys= 0.16% wait=56.91/s(84.13%) block= 0.10/s( 0.02%) alloc=   81mb/s - hz._hzInstance_2_dev.generic-operation.thread-1

An important fact about these metrics is - CPU time + WAITING + BLOCKED should be 100% in ideal world.

In reality, you a likely to see a gap. A few reason why equation above is not holding:

GC pauses are freezing thread execution, but not accounted by thread contention monitoring,
thread may be waiting for IO operation, but it is not accounted as BLOCKED or WAITING state by JVM,
system may starve on CPU resource and thread is waiting for CPU core on OS level (which is also not accounted by JVM).

Contention monitoring is not enabled by default, use -c flag with ttop command to enabled it.

HTML5 based flame graph

SJK was able to produce flame graphs for sometime already. Though, old flame graphs were generated as svg with limited interactivity.

New version offers a new type of flame graphs based on HTML5 and interactive. Right in browser it allows:

filtering data by threads,
zoom into specific paths or by presence of specific frame,
filtering data by thread state (if state information is available).

HTML5 report is 100% self contained file with no dependencies, it can sent it by email and open on any machine. Here is an example of new flame graph you can play right now.

New flame command is used to generate HTML5 flame graphs.

`jstack` dump support

SJK is accepting a number of input data formats for thread sampling data, which is used for flame graphs and other types of performance analysis.

A new format added in 0.10 version is text thread dump formats produced by jstack. Full list of input formats now:

SJK native thread sampling format
JVisualVM sampling snapshots (.nps)
Java Flight Recorder recording (.jfr)
jstack produced text thread dumps

HeapUnit - Test your Java heap content

2017-06-15T19:01:00.000+01:00

There are usually a number of tests which you would like to run for each build to make sure what your code does make sense. Typically, such tests would be focusing on business function of your code.

Though, on a rare occasion, you would really like to test certain non-functional aspects. A memory/resource would be a good example.

How would you test memory leak?

This is quite a challenge, right?

You can use debugger or profiler to inspect internal state of your system. Though, that approach assumes manual testing.

You can write test which would stress your system provoking OutOfMemoryError which would fail your test if code has defect. That generally works, though adding a stress test to mostly functional automatic test pack may not be a best idea. That approach may not work for other kind of resource leaks.

You can exploit weak or phantom reference to trace garbage collector work. This approach makes test more lightweight compared to fully fledged stress testing, but it is not applicable in many cases. E.g. you may not have a reference to leak suspected objects.

For some time I was actively practising automated inspection of JVM heap dumps for diagnostic purposes. JVM could easily produce its own heap dump (using JVM attach interface) and that dump can be inspected via API to assert certain invariants (e.g. number of live instances of particular type). Why not use it for resource leak testing and similar cases?

Resurrecting object from dump

Heap dump API allows you to inspect fields of dumped objects; there is also heap path notation for writing sophisticated selectors. Though, you cannot invoke methods, not even toString() or equals(), on objects from dump. For quantitative analysis of, this is ok. But for asserting complex conditions typical to test scenario, dealing with Java objects may be much more convenient, though.

Heap dump doesn?t contain full class information. But if dump is produced from JVM we are running in we can relay on class metedata available through reflection.

Objenesis library and Java reflection is used to convert instance data from heap dump back to normal Java objects.

At the end, usage of HeapUnit is fairly simple. Using API you can

take heap dump
select certain types of instance from dump by class or heap path notation
inspect instance?s fields using symbolic names
or rehydrate instance into Java object

Example

Below is a simple example listing Socket objects in JVM

@Test
public void printSockets() throws IOException {

    ServerSocket ss = new ServerSocket();
    ss.bind(sock(5000));

    Socket s1 = new Socket();
    Socket s2 = new Socket();

    s1.connect(sock(5000));
    s2.connect(sock(5000));

    ss.close();
    s1.close();
    // s2 remains unclosed

    HeapImage hi = HeapUnit.captureHeap();

    for(HeapInstance i: hi.instances(SocketImpl.class)) {
        // fd field in SocketImpl class is nullified when socket gets closed
        boolean open = i.value("fd") != null;
        System.out.println(i.rehydrate() + (open ? " - open" : " - closed"));
    }
}

HeapUnit library is available in Maven Central repo. You can bring it to your project using Maven coordinates below.

<dependency>
    <groupId>org.gridkit.heapunit</groupId>
    <artifactId>heapunit</artifactId>
    <version>0.2</version>
</dependency>

HotSpot JVM garbage collection options cheat sheet (v4)

2016-10-25T04:04:00.000+01:00

After three years, I have decided to update my GC cheat sheet.

New version finally includes G1 options, thankfully there are not very many of them. There are also few useful options introduced to CMS including parallel inital mark and initiating concurrent cycles by timer.

Finally, I made separate cheat sheet versions for Java 7 and Java 8.

Below are links to PDF versions

Java 8 GC cheat sheet

Java 7 GC cheat sheet

How to measure object size in Java?

2016-09-16T23:27:00.000+01:00

You define fields, their names and types, in source of Java class, but it is JVM the one who decides how they will be stored in physical memory.

Sometimes you want to know exactly how much Java object weights in Java. Answering this question is surprisingly complicated.

Challenge

Pointer size and Java object header size varies.
JVM could be build for 32 or 64 bit architecture. On 64 bit architectures JVM may or may not use compressed pointers (-XX:+UseCompressedOops).
Object padding may be different (-XX:ObjectAlignmentInBytes=X).
Different field types may have different alignment rules.
JVM may reorder fields in object layout as it likes.

Figure below illustrates how JVM may rearrange fields in memory.

Guessing object layout

You can scrap class fields via reflection and try to guess layout chosen by JVM taking into account platform pointer size and other factors.

... at least you can try.

Using the Unsafe

sun.misc.Unsafe is internal helper class used by JVM code. You should not use it, but you can (with some help from reflection). Unsafe is popular among people doing weird things with JVM.

Unsafe can let you query information about physical layout of Java object. Though, it would not tell you directly real size of object in memory. You would still have to do some error-prone math to calculate object's size.

Here is example of such code.

Instrumentation agent

java.lang.instrument.Instrumentation is an API for profilers and other performance tools. You need to install agent into JVM to get instance of this class. This class has handy getObjectSize(...) method which would tell you real object size.

There is library jamm which exploit this option. You should use special JVM start options though.

Threading MBean

Threading MBean in JVM has a handy allocation counter. Using this counter you can easily measure object size by allocating new instance and checking delta of counter. Snippet below is doing just that.

import java.lang.management.ManagementFactory;

public class MemMeter {

    private static long OFFSET = measure(new Runnable() {
        @Override
        public void run() {
        }
    });

    /**
     * @return amount of memory allocated while executing provided {@link Runnable}
     */
    public static long measure(Runnable x) {
       long now = getCurrentThreadAllocatedBytes();
       x.run();
       long diff = getCurrentThreadAllocatedBytes() - now;
       return diff - OFFSET;
    }

    @SuppressWarnings("restriction")
    private static long getCurrentThreadAllocatedBytes() {
        return ((com.sun.management.ThreadMXBean)ManagementFactory.getThreadMXBean()).getThreadAllocatedBytes(Thread.currentThread().getId());
    }
}

Below is simple usage example

System.out.println("size of java.lang.Object is " 
+ MemMeter.measure(new Runnable() {

    Object x;

    @Override
    public void run() {
        x = new Object();
    }
}));

Though, this approach require you to create new instance of object to measure its size. That may be an obstacle.

jmap

jmap is a one of JDK tools. With jmap -histo PID command you can print histogram of your heap objects.

num     #instances         #bytes  class name
---------------------------------------------
  1:       1413317      111961288  [C
  2:        272969       39059504  <constMethodKlass>
  3:       1013137       24315288  java.lang.String
  4:        245685       22715744  [I
  5:        272969       19670848  <methodKlass>
  6:        206682       17868464  [B
  7:         29355       17722320  <constantPoolKlass>
  8:        659710       15833040  java.util.HashMap$Entry
  9:         29355       12580904  <instanceKlassKlass>
 10:        105637       12545112  [Ljava.util.HashMap$Entry;
 11:        170894       11797400  [Ljava.lang.Object;

For objects, you can divide byte size by instance count to get individual instance size for class. This would not work for arrays, though.

Java Object Layout tool

Java Object Layout tool is using number of different approaches for introspecting physical layout of Java object in memory.

Rust, JNI, Java

2016-07-21T22:53:00.000+01:00

Recently, I had a necessity to do some calls to kernel32.dll from my Java code. Just a few system calls on Windows platform, as simple as it sounds. Plus I wanted to keep resulting size of binary as small as possible.

Later requirement has added a fair challenge to that task.

How to call platform code for Java?

JNI - Java Native Interface

JNI is built in JVM and is part of Java standard. Sounds good, there is a catch though. To call native code from Java via JNI, you have to write native code (e.g. using C language). That is it, JNI requires some glue code (aka bindings) between native calls and Java methods.

m... do we have other alternatives?

JNA - Java Native Access

JNA is an alternative to JNI. You can call native code from Java, no glue code. Cool, what is the cost?

JNA jar has size of 1.1 MiB. Extra megabyte just to do couple of simple calls to Windows kernel - not a deal.

Back to JNI

Ok, I need to write some glue code for JNI. What language to choose?

C/C++ - no, just no. C/C++ tool chain, compiler, headers, build tools, is an abomination, especially on Windows. Please, I just need literally half screen of code compiled to dll binary. I do not want 10 GiB worth Visual Studio to pollute my desktop.

Die hard Java guy is speaking :)

Free Pascal

Pascal is an ancient language. It was programming language of my youth. MS DOS, Turbo Pascal ... colors were so bright these days.

Twenty years later, I was surprised to find Pascal in pretty good shape. Free Pascal has impressive list of supported platforms. Pascal compiler is lighting fast. Produced binaries have no dependency on libc / msvcrt.

Using Free Pascal I get my kernel32-to-JNI dll with size of 33 KiB. That sounds much, much better.

Can we do better?

Rust

Rust is a new kid in a language block. It has a strong ambition to replace C/C++ as system level language. It gives you all powers of C plus memory safety, modernized build system, language level modules (crates).

Sounds promising, let's try Rust for little JNI glue dll.

Calling

rustc -C debuginfo=0 --crate-type dylib myjni.rs

result is disappointing 2.5 MiB binary.

Rust dylib is a dll which can be used by other Rust code, so it is exposing a lot of language specific metadata. cdylib is a new packaging introduced in Rust 1.10, which is more suitable for JNI bindings.

Command line

rustc -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

has produced 1.6 MiB binary. -C lto option instructs compiler to do "link time optimization". For some reason cdylib was not compiling without lto option for me.

Ok, direction is right, but we need to move much further. Let's try more compiler options.

Command line

rustc -C opt-level=3 -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

has produced 200 KiB binary. Optimization allow compiler to throw away a big portion of standard library which I will never need for my simple JNI binding.

Though, a large portion of standard library is still there.

In Rust you can fully turn off standard library (e.g. to run on bare metal).

Normally you would need at least memory management, but for simple JNI binding you can get away using stack allocation only.

At the moment, using Rust with no_std option requires nightly build of compiler. I have also rewrite some portion of kernel32 and JNI declarations to avoid dependency on libc types.

rustc -C opt-level=3 -C panic=abort -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

Binary size is 22.5 KiB.

Cool, we have beaten Free Pascal.

One more tweak, execute strip -s on resulting dll and final binary size is 16.9 KiB.

Honestly, 16.9 KiB for couple of calls is still overkill. But, I'm not desperate enough to try assembly for JNI binding, at least not today.

Conclusion

Free Pascal IMHO, Free Pascal a good choice if you need simple JNI bindings. As a bonus, Free Pascal on Linux has no dependency on platform's dynamic libraries, so you can build cross-Linux-distro binaries.

Rust. I believe Rust have a great potential. Rust has unique memory safety model yet it let you to get as close to bare metal as C does. Besides other features, Rust has really promising cross compiling capabilities, which gives it a very strong position in embedded / IoT space.

Yet, Rust needs to get more stable. no_std feature is not available in latest (1.10) stable. cdynlib is not supported by latest stable cargo tool. Rust tool chain on Windows depends either on MS Visual Studio or MSys. Resulting binaries are slightly incompatible to each other (Oracle JMV is build with Visual Studio, so using MSys built JNI bindings leads to process crash in certain cases).

Finalizers and References in Java

2016-03-16T21:49:00.000+00:00

Automatic memory management (garbage collection) is one of essential aspects of Java platform. Garbage collection relieves developers from pain of memory management and protects them from whole range of memory related issues. Though, working with external resources (e.g. files and socket) from Java becomes tricky, because garbage collector alone is not enough to manage such resources.

Originally Java had finalizers facility. Later special reference classes were added to deal with same problem.

If we have some external resource which should be deallocated explicitly (common case with native libraries), this task could be solved either using finalizer or phantom reference. What is the difference?

Finalizer approach

Code below is implementing resource housekeeping using Java finalizer.

public class Resource implements ResourceFacade {

    public static AtomicLong GLOBAL_ALLOCATED = new AtomicLong(); 
    public static AtomicLong GLOBAL_RELEASED = new AtomicLong(); 

    int[] data = new int[1 << 10];
    protected boolean disposed;

    public Resource() {
        GLOBAL_ALLOCATED.incrementAndGet();
    }

    public synchronized void dispose() {
        if (!disposed) {
            disposed = true;
            releaseResources();
        }
    }

    protected void releaseResources() {
        GLOBAL_RELEASED.incrementAndGet();
    }    
}

public class FinalizerHandle extends Resource {

    protected void finalize() {
        dispose();
    }
}

public class FinalizedResourceFactory {

    public static ResourceFacade newResource() {
        return new FinalizerHandle();
    }    
}

Phantom reference approach

public class PhantomHandle implements ResourceFacade {

    private final Resource resource;

    public PhantomHandle(Resource resource) {
        this.resource = resource;
    }

    public void dispose() {
        resource.dispose();
    }    

    Resource getResource() {
        return resource;
    }
}

public class PhantomResourceRef extends PhantomReference<PhantomHandle> {

    private Resource resource;

    public PhantomResourceRef(PhantomHandle referent, ReferenceQueue<? super PhantomHandle> q) {
        super(referent, q);
        this.resource = referent.getResource();
    }

    public void dispose() {
        Resource r = resource;
        if (r != null) {
            r.dispose();
        }        
    }    
}

public class PhantomResourceFactory {

    private static Set<Resource> GLOBAL_RESOURCES = Collections.synchronizedSet(new HashSet<Resource>());
    private static ResourceDisposalQueue REF_QUEUE = new ResourceDisposalQueue();
    private static ResourceDisposalThread REF_THREAD = new ResourceDisposalThread(REF_QUEUE);

    public static ResourceFacade newResource() {
        ReferedResource resource = new ReferedResource();
        GLOBAL_RESOURCES.add(resource);
        PhantomHandle handle = new PhantomHandle(resource);
        PhantomResourceRef ref = new PhantomResourceRef(handle, REF_QUEUE);
        resource.setPhantomReference(ref);
        return handle;
    }

    private static class ReferedResource extends Resource {

        @SuppressWarnings("unused")
        private PhantomResourceRef handle;

        void setPhantomReference(PhantomResourceRef ref) {
            this.handle = ref;
        }

        @Override
        public synchronized void dispose() {
            handle = null;
            GLOBAL_RESOURCES.remove(this);
            super.dispose();
        }
    }

    private static class ResourceDisposalQueue extends ReferenceQueue<PhantomHandle> {

    }

    private static class ResourceDisposalThread extends Thread {

        private ResourceDisposalQueue queue;

        public ResourceDisposalThread(ResourceDisposalQueue queue) {
            this.queue = queue;
            setDaemon(true);
            setName("ReferenceDisposalThread");
            start();
        }

        @Override
        public void run() {
            while(true) {
                try {
                    PhantomResourceRef ref = (PhantomResourceRef) queue.remove();
                    ref.dispose();
                    ref.clear();
                } catch (InterruptedException e) {
                    // ignore
                }
            }
        }
    }
}

Implementing same task using phantom reference requires more boilerplate. We need separate thread to handle reference queue, in addition, we need to keep strong references to allocated reference objects.

How finilaizers work in Java

Under the hood, finilizers work very similarly to our phantom reference implementation, though, JVM is hiding boilerplate from us.

Each time instance of object with finalizer is created, JVM creates instance of FinalReference class to track it. Once object becomes unreachable, FinalReference is triggered and added to global final reference queue, which is being processed by system finalizer thread.

So finalizes and phantom reference approach work very similar. Why should you bother with phantom references?

Comparing GC impact

Let's have simple test: resource object is allocated then added to the queue, once queue size hits limit oldest reference is evicted and thrown away. For this test we will monitor reference processing via GC logs.

Running finalizer based implementation.

[GC [ParNew[ ... [FinalReference, 5718 refs, 0.0063374 secs] ... 
Released: 6937 In use: 59498

Running phantom based implementation.

[GC [ParNew[ ... [PhantomReference, 5532 refs, 0.0037622 secs] ... 
Released: 5468 In use: 38897

As you can see, once object becomes unreachable, it needs to be handled in GC reference processing phase. Reference processing is a part of Stop-the-World pause. If, between collections, too many references becomes eligible for processing it may prolong Stop-the-World pause significantly.

In case above, there is no much difference between finalizers and phantom references. But let's change workflow a little. Now we would explicitly dispose 99% of handles and rely on GC only for 1% of references (i.e. semiautomatic resource management).

Running finalizer based implementation.

[GC [ParNew[ ... [FinalReference, 6295 refs, 0.0070033 secs] ...
Released: 6707 In use: 1457

Running phantom based implementation.

[GC [ParNew[ ... [PhantomReference, 625 refs, 0.0001551 secs] ... 
Released: 21682 In use: 1217

For finalizer based implementation there is no difference. Explicit resource disposal doesn't help reduce GC overhead. But with phantoms, we can see what GC do not need to handle explicitly disposed references (so number of references process by GC is reduced by order of magnitude).

Why this is happening? When resource handle is disposed we drop reference to phantom reference object. Once phantom reference is unreachable, it would never be queued for processing by GC, thus saving time in reference processing phase. It is quite opposite with final references, once created it will be strong referenced by JVM until being processed by finalizer thread.

Conclusion

Using phantom references for resources housekeeping requires more work compared to plain finalizer approach. But using phantom references you have far more granular control over whole process and implement number of optimizations such as hybrid (manual + automatic) resource management.

Full source code used for this article is available at https://github.com/aragozin/example-finalization.

Flame Graphs Vs. Cold Numbers

2016-01-24T18:40:00.000+00:00

Stack trace sampling is very powerful technique for performance troubleshooting. Advantages of stack trace sampling are

it doesn't require upfront configuration
cost added by sampling is small and controllable
it is easy to compare analysis result from different experiments

Unfortunately, tools offered for stack trace analysis by widespread Java profilers are very limited.

Solving performance problem in complex applications (a lot of business logic etc) is one of my regular challenges. Let's assume I have another misbehaving application at my hands. First step would be to localize bottleneck to specific part of stack.

Meet call tree

Call tree is built by digesting large number of stack traces. Each node in tree has a frequency - number of traces passing though this node.

Usually tools allow you to navigate through call tree reconstructed from stack trace population.

There is also flame graphs visualization (shown at right top of page) which is fancier but is just the same tree.

Looking at these visualization what can I see? - Not too much.

Why? Business logic somewhere in the middle of call tree produces too many branches. Tree beneath business logic is blurred beyond point of usability.

Dissecting call tree

Application is build using frameworks. For the sake of this article, I'm using example based on JBoss, JSF, Seam, Hibernate.

Now, if 13% of traces in our dump contain JDBC we can conclude what 13% of time is spent in JDBC / database calls.
13% is reasonable number, so database is not to blame here.

Let's go down the stack, Hibernate is next layer. Now we need to calculate all traces containing Hibernate classes excluding ones containing JDBC. This way we can attribute traces to particular framework and quickly get a picture where time is spent at runtime.

I didn't find any tool that can do it kind of analysis for me, so I build one for myself few years ago. SJK is my universal Java troubleshooting toolkit.

Below is command doing analysis explained above.

sjk ssa -f tracedump.std  --categorize -tf **.CoyoteAdapter.service -nc
JDBC=**.jdbc 
Hibernate=org.hibernate
"Facelets compile=com.sun.faces.facelets.compiler.Compiler.compile"
"Seam bijection=org.jboss.seam.**.aroundInvoke/!**.proceed"
JSF.execute=com.sun.faces.lifecycle.LifecycleImpl.execute
JSF.render=com.sun.faces.lifecycle.LifecycleImpl.render
Other=**

Below is output of this command.

Total samples    2732050 100.00%
JDBC              405439  14.84%
Hibernate         802932  29.39%
Facelets compile  395784  14.49%
Seam bijection    385491  14.11%
JSF.execute       290355  10.63%
JSF.render        297868  10.90%
Other             154181   5.64%

Well, we clearly see a large amount of time spent in Hibernate. This is very wrong, so it is first candidate for investigation. We also see that a lot of CPU is spent on JSF compilation, though pages should be compiled just once and cached (it turned out to be configuration issue). Actual application logic falls in JFS life cycle calls (execute(), render()). I would be possible to introduce additional category to isolate pure application logic execution time, but looking at numbers, I would say it is not necessary until other problems are solved.

Hibernate is our primary suspect, how to look inside? Let's look at method histogram for traces attributed to Hibernate trimming away all frames up to first Hibernate method call.

Below is command to do this.

sjk ssa -f --histo -tf **!**.jdbc -tt ogr.hibernate

Here is top of histogram produced by command

Trc     (%)  Frm  N  Term    (%)  Frame                                                                                                                                                                                  
699506  87%  699506       0   0%  org.hibernate.internal.SessionImpl.autoFlushIfRequired(SessionImpl.java:1204)                                                                                                          
689370  85%  689370      10   0%  org.hibernate.internal.QueryImpl.list(QueryImpl.java:101)                                                                                                                              
676524  84%  676524       0   0%  org.hibernate.event.internal.DefaultAutoFlushEventListener.onAutoFlush(DefaultAutoFlushEventListener.java:58)                                                                          
675136  84%  675136       0   0%  org.hibernate.internal.SessionImpl.list(SessionImpl.java:1261)                                                                                                                         
573836  71%  573836       4   0%  org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:264)                                                                                                                          
550968  68%  550968       1   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:99)                                                          
533892  66%  533892     132   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEntities(AbstractFlushingEventListener.java:227)                                                                       
381514  47%  381514     882   0%  org.hibernate.event.internal.AbstractVisitor.processEntityPropertyValues(AbstractVisitor.java:76)                                                                                      
271018  33%  271018       0   0%  org.hibernate.event.internal.DefaultFlushEntityEventListener.onFlushEntity(DefaultFlushEntityEventListener.java:161)

Here is our suspect. We spent 87% of Hibernate time in autoFlushIfRequired() call (and JDBC time is already excluded).

Using few commands we have narrowed down one performance bottleneck. Fixing it is another topic though.

In a case, I'm using as example, CPU usage of application were reduced by 10 times. Few problems found and addressed during that case were

optimization of Hibernate usage
facelets compilation caching were properly configure
work around performance bug in Seam framework was implemented
JSF layouts were optimized to reduce number of Seam injections / outjections

Limitations of this approach

During statistical analysis of stack traces you deal with wallclock time, you cannot guest real CPU time using this method. If CPU on host is saturated, your number will be skewed by the threads idle time due to CPU starvation.

Normally you can get stack trace only at JVM safepoints. So if some methods are inlined by JIT compiler, they may never appear at trace even if they are really busy. In other words, tip of stack trace may be skewed by JIT effects. Practically, it was never an obstacle for me, but you should be keep in mind possibility of such effect.

What about flame graphs?

Well, despite being not so useful, they look good on presentations. Support for flame graphs was added to SJK recently.

Update

After some time, I've found my self using flame graphs very actively. Yes, for certain situation this type of visualization doesn't make sense, but as a first bird eye look at the problem flame graphs are indispensable.

Does Linux hate Java?

2015-10-12T20:53:00.000+01:00

Recently, I have discovered a fancy bug affecting few version of Linux kernel. Without any warnings JVM just hangs in GC pause forever. Root cause is a improper memory access in kernel code. This post by Gil Tene gives a good technical explanation with deep emotional coloring.

While this bug is not JVM specific, there are few other multithreaded processes you can find on typical Linux box.

This recent bug make me remember few other cases there Linux screws Java badly.

Transparent huge pages

Transparent huge pages feature was introduced in 2.6.38 version of kernel. While it was intended to improve performance, a lot of people reports negative effects related to this feature, especially for memory intensive processes such as JVM and some database engines.

Oracle - Performance Issues with Transparent Huge Pages

Transparent Huge Pages and Hadoop workloads

Why TokuDB Hates Transparent Huge Pages

Leap seconds bug

Famous leap second bug in Linux has produced a whole plague across data centers in 2012. Java and MySQL were affected most badly. What a common between Java and MySQL, both are using threads extensively.

So, Linux, could you be a little more gentle with Java, please ;)

SJK - missing link in Java profiling tool chain

2015-08-04T05:52:00.000+01:00

Sometimes it just happens. You have a bloated Java application at your hand and it does not perform well. You may have built this application yourself or just got it as it is now. It doesn't matter, thing is - you do not have a slightest idea what is wrong here.

Java ecosystem have abundance of diagnostic tools (thank for interfaces exposed at JVM itself), but they are mostly focused on some specific narrow kinds of problems. Despite calling themselves intuitive, they assume you have a lot of background knowledge about JVM and profiling techniques. Honestly, even seasoned Java (I'm speaking for myself here) developer can feel lost first time looking at JProfiler, YourKit of Mission Control.

If you have a performance problem at your hand, first you need is to classify problem: is it in Java or database or somewhere else? is CPU or memory kind of problem? Once you know what kind of problem you have, you can choose next diagnostic approach consciously.

Are we CPU bound?

One of first thing you would naturally do is to check CPU usage of your process. OS can show you process CPU usage. Which is useful, but the next question is which threads are consuming it. OS can show you threads usage too, you can even get OS IDs for your Java threads using jstack and correlate them ... manually (sick).

A simple tool showing CPU usage per Java thread is the thing I wanted badly for the years.

Surprisingly, all information is already in JMX Threading MBean. All is left is to do trivial math and report per thread CPU usage. So I just did it and ttop command become first in SJK tool set.

Besides CPU usage JMX have another invaluable metric - per thread allocation counter.

Collecting information from JMX is safe and can be done on live application instance (in case if you do not have JMX port open, SJK can connect using process ID).

Below is example of ttop command output.

2014-10-01T19:27:22.825+0400 Process summary
  process cpu=101.80%
  application cpu=100.50% (user=86.21% sys=14.29%)
  other: cpu=1.30%
  GC cpu=0.00% (young=0.00%, old=0.00%)
  heap allocation rate 123mb/s
  safe point rate: 1.5 (events/s) avg. safe point pause: 0.14ms
  safe point sync time: 0.00% processing time: 0.02% (wallclock time)
[000037] user=83.66% sys=14.02% alloc=  121mb/s - Proxy:ExtendTcpProxyService1:TcpAcceptor:TcpProcessor
[000075] user= 0.97% sys= 0.08% alloc=  411kb/s - RMI TCP Connection(35)-10.139.200.51
[000029] user= 0.61% sys=-0.00% alloc=  697kb/s - Invocation:Management
[000073] user= 0.49% sys=-0.01% alloc=  343kb/s - RMI TCP Connection(33)-10.128.46.114
[000023] user= 0.24% sys=-0.01% alloc=   10kb/s - PacketPublisher
[000022] user= 0.00% sys= 0.10% alloc=   11kb/s - PacketReceiver
[000072] user= 0.00% sys= 0.07% alloc=   22kb/s - RMI TCP Connection(31)-10.139.207.76
[000056] user= 0.00% sys= 0.05% alloc=   20kb/s - RMI TCP Connection(25)-10.139.207.76
[000026] user= 0.12% sys=-0.07% alloc=  2217b/s - Cluster|Member(Id=18, Timestamp=2014-10-01 15:58:3 ...
[000076] user= 0.00% sys= 0.04% alloc=  6657b/s - JMX server connection timeout 76
[000021] user= 0.00% sys= 0.03% alloc=   526b/s - PacketListener1P
[000034] user= 0.00% sys= 0.02% alloc=  1537b/s - Proxy:ExtendTcpProxyService1
[000049] user= 0.00% sys= 0.02% alloc=  6011b/s - JMX server connection timeout 49
[000032] user= 0.00% sys= 0.01% alloc=     0b/s - DistributedCache

Besides CPU and allocation, it also collect "true" GC usage and safe point statistics. Later two metrics are not available via JMX so they are available only for process ID connections.

CPU usage picture will give you good insight what to do next: should you profile your Java hot spots or all time is spent waiting result from DB.

Garbage analysis

Another common class of Java problems is related to garbage collection. If this is a case GC logs is first place to look at.

Do you have them enabled? If not, that is not a big deal, you can enable GC logging on running JVM process using jinfo command. You can also use SJK's gc command to peek GC activity for your java process (it is not as full as GC logs tough).

If GC logs confirm what GC is causing you problems, next step is to identify where that garbage comes from.

Commercial profilers are good at memory profiling, but this kind of analysis slows down target application dramatically.

Mission Control stands out of pack here, it can profile by sampling TLAB allocation failures. This technique is cheap and generally produce good results, though it is inherently biased and may mislead you sometimes.

For long time jmap and class histogram were main memory profiling instrument for me. Class histogram is simple and accurate.

In SJK] toolset, I have augmented vanila jmap command a little to make it more useful (SJK's [hh command).

Beware that jmap (and thus hh command) required Stop the World pause on target JVM while heap is being walked, so it may not be a good idea to execute it against live application under load.

Dead heap histogram is calculated as difference between object population before and after forced GC (using jmap class histogram command under hood).

Dead young heap histogram enforces full GC then wait 10 seconds (by default) then produce dead object histogram by technique describe above. Thus you see a summary freshly allocated garbage.

This methods cannot not tell you where in your code that garbage was allocated (this is job for Mission Control et al ). Though, if you know that is your top garbage objects, you may already know there they are allocated.

SJK have a few more tools but these two ttop and hh are always in front lines when I need to tackle another performance related problem.

So, you have dumped 150 GiB of JVM heap, now what?

2015-02-23T12:14:00.001+00:00

150 GiB worth of JVM heap dump is laying on hard drive and I need analyze specific problem detected in that process.

This is a dump of proprietary hybrid of in-memory RDBMS and CEP system, I'm responsible for. All data are stored in Java heap, so heap size of some installation is huge (400 GiB heap is largest to the date).

Problem of analyzing huge heap dumps were on my radar for some time, so I wasn't unprepared.

To be honest, I haven't tried to open this file in Eclipse Memory Analyzer, but I doubt it could handle it.

For me, for some time, most useful tool in heap analyzers was JavaScript based queries. Clicking through millions objects is not fun. It is much better to walk object graph with code, not with mouse.

Heap dump is just a serialized graph of objects, my goal is to extract specific information from this graph. I do not really need a fancy UI, API to heap graph would be even better.

How I can analyze heap dump programmatically?

I have started my research with NetBeans profiler (it was a year ago). NetBeans is open source and have visual heap dump analyzer (same component is also used in JVisualVM). It turns out, what heap dump processing code is separate module and API it provides is suitable for custom analysis logic.

NetBeans heap analyzer has a critical limitation, though. It is using temporary file to keep internal index of heap dump. This file is typically around 25% of heap dump itself. But most important it takes a time to build this file, before any query to heap graph is possible.

After taking better look, I decided, I could remove this temporary file. I have forked library (my fork is available at GitHub). Some functions was lost together with temporary file (e.g. backward reference traversing), but they are not need for my kind of tasks.

Another important change to original library, was implementing HeapPath.
HeapPath is an expression language for object graph. It is useful both as generic predicate language in graph traversal algorithms and as simple tool to extract data from object dump. HeapPath automatically converts strings, primitives and few other simple types from heap dump structures to normal objects.

This library proved itself very useful in our daily job. One of its application was memory reporting tool for our database/CEP system which automatically report actual memory consumption of every relational transformation node (there could be few hundred nodes in single instance).

For interactive exploring API + Java is not best set of tools, tough. But it lets me do my job (and 150 GiB of dump leave me no alternatives).

Should I be adding some JVM scripting language to the mix ...

BTW: Single pass through 150 GiB is taking about 5 minutes. Meaning full analysis usually employ multiple iterations, but processing times are fairly reasonable even for that heap size.

Binary search - is it still most optimal?

2015-02-11T21:19:00.001+00:00

If you have a sorted collection of elements, how would you find index of specific value?
"Binary search" is likely to be your answer.
Algorithms theory is teaching us what binary search is most optimal algorithm for this task with log(N) complexity.
Well, hash table can do better, if you need to find key by exact match. In many cases, though, you have reasons to have your collection sorted, not hashed.

On my job, I'm working on sophisticated in-memory database tailored for streaming data processing. We have a lot of places where we deal with sorted collection of integers (data row references, etc).

Algorithms theory is good, but in reality there are things like cache hierarchy, branch prediction, super scalar execution which may skew performance at edge cases.

Question is - where lie borders between reality ruled by CPU quirks and lawful space of classic algorithms theory?

If you have a doubt - do an experiment.

Experiment is simple: I'm generating a large number of sorted arrays of 32 bit integers. When I search random key in random array multiple times. In each experiment average size of array is fixed. Large number of arrays used to ensure cold memory access. Average time search time is measured.

All code written in Java and measured using JMH tool.

Participants are

Binary search - java.util.Arrays.binarySearch()
Linear search - simple loop over array until key is found
Linear search 2 - looping over every second element in array, if greater key is found, check i - 1 index too

X axis is average array length
Y axis is average time of single search in microseconds
Measurments have been done on 3 different types CPU.

Results speak for themselves.

I was surprised a little, as I were expecting binary search to outperform linear at length of 32 or 64, but it seems that modern processors are very good at optimizing linear memory access.

Provided that 8 - 128 is a practical range for BTree like structures, I will likely to reconsider some of data structures used in our database.

Tech Talk: "Casual" mass parallel data processing in Java

2014-03-01T13:00:00.000+00:00

On March 1st, I was speaking on NoSQL day meet up in Minsk, Belarus.

"Casual" mass parallel data processing in Java may sound like a weird topic. Never less, sometimes you have to get job done and setting up computation grid infrastructure may not be a shortest path.

Below is slide deck from event.

TechTalk: Java Garbage Collection - Theory and Practice

2013-12-17T19:00:00.000+00:00

Below are slide decks for open event held in Moscow Technology Center of Deutsche Bank.

Topic of event was garbage collection in JVM.

Part 1 by Alexey Ragozin

Part 2 by Alexander Ashitkin

TechTalk: Virtualizing Java in Java

2013-12-12T15:00:00.000+00:00

On 12th December, I was speaking at JUG in Saint-Petersburg, Russia.

It was a long talk about using NanoCloud.

Below is video

and slide deck from event

Coherence SIG - Filtering 100M objects in cache

2013-12-05T21:10:00.000+00:00

Today I was speaking on Coherence SIG event in London.

My topic was "Filtering 100M objects. What can go wrong?". It was a story of solving particular problem and obstacles we have encountered. One noticeable thing about this project - out team was using Performance Test Driven Development approach.

We have started with simplest solution, then were focusing on problem identified by testing.

Slide deck from presentation is below.

Coherence 101 - Soothing the Guardian

2013-11-14T22:01:00.000+00:00

Guardian was introduced in Oracle Coherence 3.5 as uniform and reliable mean to detect and report various stalls and hangs on data grid members. In addition to monitoring internal components of Coherence, Guardian has an API accessible for application developer.

While out-of-box Guardian does its job pretty well, there are few aspects you can improve.

There 3 techniques to work with Coherence Guardian. Your can choose to employ all of them or just few.

Guardian heartbeats

Guardian is using heartbeat mechanics to detect thread stalls. Internally Coherence code explicitly heartbeat in appropriate points in code. Application code could use similar technique if long execution time is expected. CacheStores are good example of this.

GuardSupport.heartbeat() – sends normal heartbeat
GuardSupport.heartbeat(long) – allows you to pass expected time till next heartbeat (e.i. if you expect that SQL query to take several minutes, you could prevent log warning by passing reasonably long timeout before execution SQL statement)

Implementing guardable

Normally the guardian would try to "recover" thread if no heartbeats were received during timeout (eigther specified in configuration or last heartbeat(...) call).
This behavior can be overridden though. Application can register own Guardable and temporary disable monitoring of current thread. Below is a code snippet which wraps cache loader operations in Guardable preventing thread interruption (default way to "recover" worker thread).

public static class GuardianAwareCacheLoader implements CacheLoader {

    private CacheLoader loader;

    public GuardianAwareCacheLoader(CacheLoader loader) {
        this.loader = loader;
    }

    @Override
    public Object load(Object key) {
        GuardContext ctx = GuardSupport.getThreadContext();
        if (ctx != null) {
            KeyLoaderGuard guard = new KeyLoaderGuard(Collections.singleton(key));
            GuardContext klg = ctx.getGuardian().guard(guard); 
            GuardSupport.setThreadContext(klg);
        }
        try {
            return loader.load(key);
        }
        finally {
            if (ctx != null) {
                GuardContext klg = GuardSupport.getThreadContext();
                GuardSupport.setThreadContext(ctx);
                klg.release();
            }
        }
    }

    @Override
    @SuppressWarnings({ "rawtypes", "unchecked" })
    public Map loadAll(Collection keys) {
        GuardContext ctx = GuardSupport.getThreadContext();
        if (ctx != null) {
            KeyLoaderGuard guard = new KeyLoaderGuard(keys);
            GuardContext klg = ctx.getGuardian().guard(guard); 
            GuardSupport.setThreadContext(klg);
            // disable current context
            ctx.heartbeat(TimeUnit.DAYS.toMillis(365));
        }
        try {
            return loader.loadAll(keys);
        }
        finally {
            if (ctx != null) {
                GuardContext klg = GuardSupport.getThreadContext();
                GuardSupport.setThreadContext(ctx);
                klg.release();
                // reenable current context
                ctx.heartbeat();
            }
        }
    }
}

public static class KeyLoaderGuard implements Guardable {

    Collection<Object> keys;
    GuardContext context;

    public KeyLoaderGuard(Collection<Object> keys) {
        this.keys = keys;
    }

    @Override
    public GuardContext getContext() {
        return context;
    }

    @Override
    public void setContext(GuardContext context) {
        this.context = context;
    }

    @Override
    public void recover() {
        System.out.println("got RECOVER signal");
        context.heartbeat();
    }

    @Override
    public void terminate() {
        System.out.println("got TERMINATE signal");
    }

    @Override
    public String toString() {
        return "KeyLoaderGuard:" + keys;
    }
}

Using custom Guardable provides following advantages:

Additional context information is available and is logged for custom Guardable (e.g. SQL statement causing problems).
Custom code can choose how to react on timeout. You can choose to continue or try to cancel request somehow (e.g. closing JDBC connection).

Custom service failure policy

Service failure policy is responsible for reaction on guardian timeouts and critical service failures. Reaction is configurable, but for standalone Coherence processes I prefer to override this policy.

Below is example of service failure policy, which I find more reasonable for dedicated Coherence nodes.

public class ServiceFailureHandler implements ServiceFailurePolicy {

    private final static Logger LOGGER = LogManager.getLogger(ServiceFailureHandler.class);

    @Override
    public void onGuardableRecovery(Guardable guarable, Service service) {
        LOGGER.warn("Soft timeout detected. Service: " + service.getInfo().getServiceName() + " Task: " + guarable);
        guarable.recover();
    }

    @Override
    public void onGuardableTerminate(Guardable guarable, Service service) {
        LOGGER.error("Hard timeout detected. Service: " + service.getInfo().getServiceName()
                     + " Task: " + guarable + ". Node will be terminated.");
        halt();
    }

    @Override
    public void onServiceFailed(Cluster cluster) {
        LOGGER.error("Service failure detected. Node will be terminated.");
        halt();
    }

    private static void halt() {
        try {
            ThreadUtil.logThreadDump(LOGGER);
            LogManager.shutdown();
            System.out.flush();
            System.err.flush();
        } finally {
            Runtime.getRuntime().halt(1);
        }
    }
}

Compared to standard policy it has following advantages:

In case of service failure processes would be terminated quickly (without waiting for shutdown hooks etc). In my case, process would be restarted by external watch dog immediately then.
"Soft timeouts" will not pollute log with thread dumps. The only thread dump will be logged just before termination of process (which is especially important in case of implementing custom Guardable).

Conclusion

Integrating you application with Coherence Guardian doesn't require too much code, but could make your logs more clear and troubleshooting less painful. While it will not make your application work faster, it could save hours of digging though logs.

HotSpot JVM garbage collection options cheat sheet (v3)

2013-11-06T20:38:00.001+00:00

Updated version is available!

Two years ago I have published cheat sheet for garbage collection options in HotSpot JVM.

Recently I decided give that work some refreshing and today I'm publishing first HostSpot JVM options ref card covering generic GC options and CMS tuning. (G1 have got a plenty of tuning options during last two years so it will have dedicated ref card).

Content-wise GC log rotation options have been added and few esoteric CMS diagnostic options have been removed.

Two page PDF version

Single page PDF version

JVM deep dive at HighLoad++ 2013 (Moscow)

2013-10-29T20:00:00.000+00:00

Today was speaking at HighLoad++ 2013 Moscow. I had two presentation covering deep internals of JVM. One about JIT compilation and other concerning pauseless garbage collection algorithms.

Slide decks are below (in Russian)

JIT-компиляция в виртуальной машине Java (HighLoad++ 2013)

Cборка мусора в Java без пауз (HighLoad++ 2013)

Performance Test Driven Development (CEE SECR 2013 Moscow)

2013-10-25T20:00:00.000+01:00

Today I was speaking at CEE SECR 2013 at Moscow.

Below is a slide deck from presentation (in Russian)

Performance Test Driven Development

Coherence 101 - EntryProcessor traffic amplification

2013-09-10T19:00:00.000+01:00

Oracle Coherence data grid has a powerful tool for inplace data manipulation - EntryProcessor. Using entry processor you can get reasonable atomicity guarantees without locks or transactions (and without drastic performance fees associated).

One good example of entry processor would be built-in ConditionalPut processor, which will verify certain condition before overriding value. This, in turn, could be used for implementing optimistic locking and other patterns.

ConditionalPut could accept only one value, but ConditionalPutAll processor is also available. ConditionalPutAll accepts a map of key/values. Using it, we can update multiple cache entries with single call to NamedCache API.

But there is one caveat.

We have placed values for all keys in single map instance inside of entry processor object. On the other side, in distributed cache keys are distributed across different processes.
How right values would be transferred to right keys?

Answer is simple - every node, owning at least one of keys to be updated, will receive a copy of whole map of values.
In other words, in mid size cluster (i.e. 20 nodes) you may actually transfer 20 times more data over network than really needed.

Modern networks are quite good and you may not notice this traffic amplification effect for some time (as long as you network bandwidth can handle it). But once traffic has reached network limit things are starting to break apart.

Coherence TCMP protocol is very aggressive at grabbing as much of network bandwidth as it can, so other communications protocols will likely perish first.
JDBC connections are likely victim of bandwidth shortage.
Coherence*Extend connection may also suffer (it is using TCP) and proxy nodes may start to fail in unusual ways (e.g. with OutOfMemoryError due transmission backlog overflow).

This problem may be hard to diagnose. TCP is much more vulnerable to bandwidth shortage and you will be kept distracted with TCP communication problems while root cause is excessive TCMP cluster traffic.

Monitoring TCMP statistics (available via MBean) could give you an insight about network bandwidth consumption by TCMP and network health and help to find root cause.

Isolating TCMP in separate switch is also a good practice, BTW

But how to fix it?

Manual data splitting

Simple solution is to split keys set by owning nodes, and then invoke entry processor for each subset individually. Coherence API allows you to find node owning particular key.
This approach is far from ideal though:

it will not work for Extend clients,
you either have to process all subset sequentially or use threads to do several parallel calls to Coherence API,
splitting of key set complicates application logic.

Triggers

Another option is relocating your logic from entry processor to trigger and replacing invokeAll() by putAll() (putAll() does not suffer from traffic amplification). This solution is fairly good and fast, but has certain drawbacks too:

it is less transparent (put() is not just put() now),
trigger is configured once for all cache operations (not just one putAll() call),
you can only have one trigger and it should handle all your data update needs.

Synthetic data keys

Finally you can use DataSplittingProcessor from CohKit project. This utility class is using virtual cache keys to transfer data associated with keys, then it is using backing map API to access real entries.

This solution has its PROs and CONs too:

good drop-in replacement for ConditionalPutAll and alike,
prone to deadlocks if running concurrently with other bulk updates (it is partially mitigated by sorting keys before locking).

Choosing right solution

In practice I was using all three technique listed above.

Sometimes triggers fit overall cache design quite good.
Sometimes manual data split has its advantages.
And sometimes DataSplittingProcessor is just right remedy for existing entry processors.

SJK (JVM diagnostic/troubleshoting tools) is learning new tricks.

2013-09-09T04:30:00.000+01:00

SJK is small command line tool implementing number of helpful commands for JMV troubleshooting. Internally SJK is using same diagnostic APIs as standard JDK tools (e.g. jps, jstack, jmap, jconsole).

Recently I've made few noteworthy additions to SJK package and would like to announce them here.

Memory allocation rates for Java threads

ttop command now displays memory allocation per thread and cumulative memory allocation for whole JVM process.
Memory allocation rate is key information for GC tuning, in past I was using GC log to derive these numbers. On contrast, per thread allocation counters give you more precise information in real time.
Process allocation rate is calculated by aggregating thread allocation rate.

more details about ttop

Support for remote JMX connections

Historically SJK were using PID to connect to JVM's MBean server. Using PID does not require you to explicitly enable JMX in JVM's command line and offers you OS level security.
Sometime you already have JMX port up and running (e.g. for other monitoring tools) and connection using host and port is more convenient.
Now all JVM based commands (ttop, gcrep, mx, mxdump) support socket based JMX connections (with optional user/password security).

Invoking arbitrary MBean operation

New command (mx) allows to get/set arbitrary MBean attributes and call arbitrary MBean operations.
This one is paralytically useful for scripting (I didn't find to invoke operation for custom MBean from command line, so I have added it to SJK).

more details about ttop

Code and binaries are available at GitHub
https://github.com/aragozin/jvm-tools

Java GC in Numbers - Compressed OOPs

2013-07-28T18:00:00.000+01:00

Compressed OOPs (OOP – ordinary object pointer) is a technique reducing size of Java object in 64 bit environments. HotSpot wiki has a good article explaining details. Downside of this technique is what address uncompressing is required before accessing memory referenced by compressed OOPs. Instruction set (e.g. x86) may support such addressing type directly, but still, additional arithmetic would affect processing pipeline of CPU.

Young GC involves a lot of reference walking, so its time is expected to be affected by OOPs compression.

In this article, I’m comparing young GC pause time for 64 bit HotSpot JVM with and without OOPs compression. Methodic from previous article is used and benchmark code is available at github. There is one caveat though. With compressed OOPs size of object is smaller and same amount of heap could accommodate more objects. Benchmark is autoscaling number of entries to fill heap based entry footprint and old space size, thus with fixed old space size experiments with compression enabled have to deal with slightly larger number of objects (entry footprints are 288 uncompressed and 246 compressed).

Chart below shows absolute young GC pause times.

As you can see, compressed case is consistently slower, which is not a surprise.

Another char is showing relative difference between two cases (compressed GC pause mean / uncompressed GC pause mean for same case).

Fluctuating line suggests that I should probably increase number of runs for each data points. But, let’s try to make some conclusion from what we have.

For heaps below 4GiB JVM is using special strategy (32 address could be used without uncompressing in this case). This difference is visible from chart (please note that point with 4GiB of old space, means that total heap size is above 4GiB and this optimization is inapplicable).

Above 4 GiB we see 10-30% increase in pause times. You should also not to forget that compressed case have to deal with 17% more data.

Conclusions

Using compressed OOPs affects young GC pause time which is not a surprise (especially taking increase amount of data). Using compression for heaps below 4GiB seems to be a total win, for larger heaps it seems to be reasonable price for increase capacity.

But main conclusion is that experiment has not revealed any surprises neither bad nor good ones. This may be not very exciting but is useful information anyway.

Coherence SIG: Performance Test Driven Development

2013-07-18T21:03:00.000+01:00

Today was speaking at Oracle Coherence SIG at London.

Below you can find slide deck from my presentation.

http://www.slideshare.net/aragozin/coherence-sig-performance-test-driven-development