Monday, March 11, 2019

Lies, darn lies and sampling bias

Sampling profiling is very powerful technique widely used across various platforms for identifying hot code (execution bottlenecks).

In Java world sampling profiling (thread stack sampling to be precise) is supported by every serious profiler.

While being powerful and very handy in practice, sampling has well known weakness – sampling bias. It is real and well-known problem, though its practical impact is often being over exaggerated.

A picture is worth a thousand of words, so let me jump start with example.

Case 1

Below is a simple snippet of code. This snippet is doing cryptographic hash calculation over a bunch of random strings.

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.concurrent.TimeUnit;

public class CryptoBench {

 private static final boolean trackTime = Boolean.getBoolean("trackTime");
 
 public static void main(String[] args) {
  CryptoBench test = new CryptoBench();
  while(true) {
   test.execute();
  }  
 }
 
 public void execute() {
        long N = 5 * 1000 * 1000;
        RandomStringUtils randomStringUtils = new RandomStringUtils();
        long ts = 0,tf = 0;
        long timer1 = 0;
        long timer2 = 0;
        long bs = System.nanoTime();
        for (long i = 0; i < N; i++) {
         ts = trackTime ? System.nanoTime() : 0;
            String text = randomStringUtils.generate();
            tf = trackTime ? System.nanoTime() : 0;
            timer1 += tf - ts;
            ts = tf;
   crypt(text);
   tf = trackTime ? System.nanoTime() : 0;
   timer2 += tf - ts;
   ts = tf;
        }
        long bt = System.nanoTime() - bs;
        System.out.print(String.format("Hash rate: %.2f Mm/s", 0.01 * (N * TimeUnit.SECONDS.toNanos(1) / bt / 10000)));
        if (trackTime) {
         System.out.print(String.format(" | Generation: %.1f %%",  0.1 * (1000 * timer1 / (timer1 + timer2))));
         System.out.print(String.format(" | Hashing: %.1f %%", 0.1 * (1000 * timer2 / (timer1 + timer2))));
        }
        System.out.println();
 }

    public String crypt(String str) {
        if (str == null || str.length() == 0) {
            throw new IllegalArgumentException("String to encrypt cannot be null or zero length");
        }
        StringBuilder hexString = new StringBuilder();
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            md.update(str.getBytes());
            byte[] hash = md.digest();
            for (byte aHash : hash) {
                if ((0xff & aHash) < 0x10) {
                    hexString.append("0" + Integer.toHexString((0xFF & aHash)));
                } else {
                    hexString.append(Integer.toHexString(0xFF & aHash));
                }
            }
        } catch (NoSuchAlgorithmException e) {
            e.printStackTrace();
        }
        return hexString.toString();
    }
}
code is available on github

Now let’s use a Visual VM (a profiler bundled with Java 8) and look how much time is actually spent in CryptoBench.crypt() method.

Something in definitely off in screenshot above!

CryptoBench.crypt(), method doing actual cryptography, is attributed only to 33% of execution time.
At same time, CryptoBench.execute() has 67% of self time, and that methods is doing nothing besides calling other methods.

Probably I just need a cooler profiler here. /s

Let’s use Java Flight Recorder for the very same case.
Below is screen shot from Mission Control.

That looks much better!

CryptoBench.crypt() is now 86% of time our budget. Rest of time code spends in random string generation.
These numbers are looking more belivable to me.

Wait, wait, wait!

Integer.toHexString() is taking as much time as actual MD5 calculation. I cannot belive that.

Numbers are better than ones produced by VisualVM but they are still fishy enough.

Flight recorder is not cool enough for that task! We need really cool profiler! /s

Ok, let me bring some sense into this discrepancy between tools.

We were using thread stack sampling in both tools (Visual VM and Flight Recorder). Though, these tools capture stack traces differently.

Visual VM is actually sampling thread dumps (via thread dump support in JVM). Thread dumps include stack traces for every application thread in JVM, regardless of whatever thread's state is (blocked, sleeping or actually executing code) and this dump is taken atomically. It reflects instant execution state of whole JVM (which is important for deadlock/contention analysis). In practice, that implies short Stop the World pause for each dump. Stop the World pause means safepoint in hotspot JVM. And safepoints brings some nuances.

When Visual VM requests thread dump, JVM notifies threads to suspend execution, but a thread executing Java code wouldn’t stop immediately (unless it is interpreted). The thread would continue to run until next safepoint check where it can suspend itself. Checks cost CPU cycles so they are sparse in JIT generated code.

Checks are placed inside of loops and after method returns. Though, checks are omitted for loops considered “fast” by JIT compiler (typically integer indexed loops). Small methods are aggressively inlined too, hence omiting safepoint check at return. As a consequence, a hot and calculation intensive code may be optimized by JIT into single chunk of machine code which is mostly free of safepoint checks.

If you are lucky, thread dump would show you a line invoking the method containing hot code. With less luck result would be even more misleading.

So in Visual VM call tree we see method CryptoBench.execute() at top of the stack for 66% of samples. If we would be able to see call tree at line number granularity is would be a line calling CryptoBench.crypt() method.

Bad, ugly safepoint bias I’ve caught you red handed! /s

So, how Flight Recorder does sample stacks and why numbers are different?

Flight Recorder sampling doesn’t involve full thread dumps. Instead it freezes threads one by one using OS provided facilities. Once thread is frozen; we can get address of next instruction to be executed out of stack memory area. Address of instruction is converted into line number of java source code via byte code to machine code symbol map. The map is generated during JIT compilation. This is how stack trace is reconstructed.

In case of Flight Recorder safepoint bias does not apply. Though results are still looking inaccurate. Why?

Below is another session with Flight Recorder for the very same code.

Picture is different now.

Integer.toHexString() is just 2.25% of out execution budget which is more trustworthy in my eyes.

Flight Recoder has to resolve memory addresses back to reference of bytecode instruction (which is further transalted into Java source line). Mapping generated by JIT compiler is used for that purpose.

Though compiler is aware that we can see thread stack trace only at safepoints. By default, only safepoint checks are mapped into bytecode instruction indexes. Flight Recorder takes execution address from stack, then it finds next address mapped to Java code in symbol table. In case of aggressive inlining, Flight Recorder can map address to whole wrong point in code.

Though sampling itself is not biased by safepoints, symbol map generated by JIT compiler is.

In second example, I’ve used two JVM options to force more detailed symbol maps to be generated by JIT compler. Options are below.

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

More accurate, free of bias, symbol map allows Flight Recorder to produce more accurate stack traces.

In our mental model, code is being executed line by line (bytecode instruction by instruction). But complier lumps bunch of methods together and generates single blob of machine code, aggressively reordering operations in the middle of process to make code faster.
Our mental model of line by line execution is totally broken by compiler optimization.

Though, in practice artifacts of operation reordering are not that striking as safepoint bias.

So Java Flight Recorder is cool, Visual VM is not. Should I make this conclusion?

Let me present a counter example.

Case 2

Below is profiling reports from a differnt case.

Now I’m using flame graph generated from data captured by Visual VM and Flight Recorder (with –XX:+DebugNonSafepoints).

Visual VM report

Flight Recorder report

Both graphs are showing InflaterInputStream to be a bottleneck. Though Visual VM assesses time spent as 98%, but in Flight Recorder it is just 47%.

Who is right?

Correct answer is 92% (which is approximated using differential analysis).

My heart is broken! Flight Recorder is not a silver bullet. /s

What have gone wrong?

In this example, hot spot was related to JNI overhead involved with calling native code in zlib. It seems like Flight Recorder were unable reconstruct stack trace for certain samples outside of Java code and dropped these samples. Sample population was biased by native code execution. That bias has played against Flight Recorder in this case.

Conclusion

Both profilers are doing that they intended to do. Some sort of bias is natural for almost any kind of sampling.

Each sampling profiler could be categorized by three aspects.

  • Blind spots bias – which samples are excluded from data set collected by profiler.
  • Attractor bias – how samples be attracted to specific discrete points (e.g. safe point).
  • Resolution – unit of code which profiling data is being aggregated to (e.g. method, line number etc).

Below is summary table for sampling methods mentioned in this article.

Blind spot Attractor Resolution
JVM Thread Dump Sampling non-java threads safepoint bias java frames only
Java Flight Recorder non-java code execution CPU pipeline bias
+ code to source mapping skew
java frames only
Java Flight Recorder
+ DebugNonSafepoint
non-java code execution CPU pipeline bias
+ code to source mapping skew
java frames only

Wednesday, May 30, 2018

SJK is learning new tricks

SJK or (Swiss Java Knife) was my secret weapon for firefighting various types of performance problems for long time.

A new version of SJK was released not too long ago and it contains а bunch of new and powerful features I would like to highlight.

ttop contention monitoring

SJK is living it's name by bundling a number of tool into single executable jar. Though, ttop is a likely single most commonly used tool under SJK roof.

ttop is a kind top for threads of JVM process. Besides CPU usage counter (provided by OS) and allocation rate (tracked by JVM), a new thread contention metrics was introduced in recent SJK release.

Thread contention metrics are calculated by JVM, which counts and times when Java threads enters into BLOCKED or WAITING state.

If enabled, SJK is using these metrics to display rates and percentage of time spent in either state.

2018-05-29T14:20:03.382+0300 Process summary 
  process cpu=231.09%
  application cpu=212.78% (user=195.86% sys=16.92%)
  other: cpu=18.31% 
  thread count: 157
  GC time=4.72% (young=4.72%, old=0.00%)
  heap allocation rate 976mb/s
  safe point rate: 6.3 (events/s) avg. safe point pause: 8.24ms
  safe point sync time: 0.07% processing time: 5.09% (wallclock time)
[000180] user=19.40% sys= 0.31% wait=183.6/s(75.77%) block=    0/s( 0.00%) alloc=  110mb/s - hz._hzInstance_2_dev.cached.thread-8
[000094] user=16.92% sys= 0.16% wait=58.50/s(81.54%) block=    0/s( 0.00%) alloc=   94mb/s - hz._hzInstance_3_dev.generic-operation.thread-0
[000057] user=15.05% sys= 0.62% wait=56.91/s(82.35%) block= 0.20/s( 0.01%) alloc=   91mb/s - hz._hzInstance_2_dev.generic-operation.thread-0
[000095] user=15.21% sys= 0.00% wait=55.61/s(82.32%) block= 0.30/s( 0.04%) alloc=   87mb/s - hz._hzInstance_3_dev.generic-operation.thread-1
[000022] user=14.59% sys= 0.00% wait=56.01/s(83.42%) block= 0.30/s( 0.08%) alloc=   86mb/s - hz._hzInstance_1_dev.generic-operation.thread-1
[000058] user=13.97% sys= 0.16% wait=56.91/s(84.13%) block= 0.10/s( 0.02%) alloc=   81mb/s - hz._hzInstance_2_dev.generic-operation.thread-1

An important fact about these metrics is - CPU time + WAITING + BLOCKED should be 100% in ideal world.

In reality, you a likely to see a gap. A few reason why equation above is not holding:

  • GC pauses are freezing thread execution, but not accounted by thread contention monitoring,
  • thread may be waiting for IO operation, but it is not accounted as BLOCKED or WAITING state by JVM,
  • system may starve on CPU resource and thread is waiting for CPU core on OS level (which is also not accounted by JVM).

Contention monitoring is not enabled by default, use -c flag with ttop command to enabled it.

HTML5 based flame graph

SJK was able to produce flame graphs for sometime already. Though, old flame graphs were generated as svg with limited interactivity.

New version offers a new type of flame graphs based on HTML5 and interactive. Right in browser it allows:

  • filtering data by threads,
  • zoom into specific paths or by presence of specific frame,
  • filtering data by thread state (if state information is available).

HTML5 report is 100% self contained file with no dependencies, it can sent it by email and open on any machine. Here is an example of new flame graph you can play right now.

New flame command is used to generate HTML5 flame graphs.

`jstack` dump support

SJK is accepting a number of input data formats for thread sampling data, which is used for flame graphs and other types of performance analysis.

A new format added in 0.10 version is text thread dump formats produced by jstack. Full list of input formats now:

  • SJK native thread sampling format
  • JVisualVM sampling snapshots (.nps)
  • Java Flight Recorder recording (.jfr)
  • jstack produced text thread dumps

Thursday, June 15, 2017

HeapUnit - Test your Java heap content

There are usually a number of tests which you would like to run for each build to make sure what your code does make sense. Typically, such tests would be focusing on business function of your code.

Though, on a rare occasion, you would really like to test certain non-functional aspects. A memory/resource would be a good example.

How would you test memory leak?

This is quite a challenge, right?

You can use debugger or profiler to inspect internal state of your system. Though, that approach assumes manual testing.

You can write test which would stress your system provoking OutOfMemoryError which would fail your test if code has defect. That generally works, though adding a stress test to mostly functional automatic test pack may not be a best idea. That approach may not work for other kind of resource leaks.

You can exploit weak or phantom reference to trace garbage collector work. This approach makes test more lightweight compared to fully fledged stress testing, but it is not applicable in many cases. E.g. you may not have a reference to leak suspected objects.

For some time I was actively practising automated inspection of JVM heap dumps for diagnostic purposes. JVM could easily produce its own heap dump (using JVM attach interface) and that dump can be inspected via API to assert certain invariants (e.g. number of live instances of particular type). Why not use it for resource leak testing and similar cases?

Resurrecting object from dump

Heap dump API allows you to inspect fields of dumped objects; there is also heap path notation for writing sophisticated selectors. Though, you cannot invoke methods, not even toString() or equals(), on objects from dump. For quantitative analysis of, this is ok. But for asserting complex conditions typical to test scenario, dealing with Java objects may be much more convenient, though.

Heap dump doesn?t contain full class information. But if dump is produced from JVM we are running in we can relay on class metedata available through reflection.

Objenesis library and Java reflection is used to convert instance data from heap dump back to normal Java objects.

At the end, usage of HeapUnit is fairly simple. Using API you can

  • take heap dump
  • select certain types of instance from dump by class or heap path notation
  • inspect instance?s fields using symbolic names
  • or rehydrate instance into Java object

Example

Below is a simple example listing Socket objects in JVM

@Test
public void printSockets() throws IOException {

    ServerSocket ss = new ServerSocket();
    ss.bind(sock(5000));

    Socket s1 = new Socket();
    Socket s2 = new Socket();

    s1.connect(sock(5000));
    s2.connect(sock(5000));

    ss.close();
    s1.close();
    // s2 remains unclosed

    HeapImage hi = HeapUnit.captureHeap();

    for(HeapInstance i: hi.instances(SocketImpl.class)) {
        // fd field in SocketImpl class is nullified when socket gets closed
        boolean open = i.value("fd") != null;
        System.out.println(i.rehydrate() + (open ? " - open" : " - closed"));
    }
}

HeapUnit library is available in Maven Central repo. You can bring it to your project using Maven coordinates below.

<dependency>
    <groupId>org.gridkit.heapunit</groupId>
    <artifactId>heapunit</artifactId>
    <version>0.2</version>
</dependency>