Thursday, July 21, 2016

Rust, JNI, Java

Recently, I had a necessity to do some calls to kernel32.dll from my Java code. Just a few system calls on Windows platform, as simple as it sounds. Plus I wanted to keep resulting size of binary as small as possible.

Later requirement has added a fair challenge to that task.

How to call platform code for Java?

JNI - Java Native Interface

JNI is built in JVM and is part of Java standard. Sounds good, there is a catch though. To call native code from Java via JNI, you have to write native code (e.g. using C language). That is it, JNI requires some glue code (aka bindings) between native calls and Java methods.

m... do we have other alternatives?

JNA - Java Native Access

JNA is an alternative to JNI. You can call native code from Java, no glue code. Cool, what is the cost?

JNA jar has size of 1.1 MiB. Extra megabyte just to do couple of simple calls to Windows kernel - not a deal.

Back to JNI

Ok, I need to write some glue code for JNI. What language to choose?

C/C++ - no, just no. C/C++ tool chain, compiler, headers, build tools, is an abomination, especially on Windows. Please, I just need literally half screen of code compiled to dll binary. I do not want 10 GiB worth Visual Studio to pollute my desktop.

Die hard Java guy is speaking :)

Free Pascal

Pascal is an ancient language. It was programming language of my youth. MS DOS, Turbo Pascal ... colors were so bright these days.

Twenty years later, I was surprised to find Pascal in pretty good shape. Free Pascal has impressive list of supported platforms. Pascal compiler is lighting fast. Produced binaries have no dependency on libc / msvcrt.

Using Free Pascal I get my kernel32-to-JNI dll with size of 33 KiB. That sounds much, much better.

Can we do better?

Rust

Rust is a new kid in a language block. It has a strong ambition to replace C/C++ as system level language. It gives you all powers of C plus memory safety, modernized build system, language level modules (crates).

Sounds promising, let's try Rust for little JNI glue dll.

Calling

rustc -C debuginfo=0 --crate-type dylib myjni.rs

result is disappointing 2.5 MiB binary.

Rust dylib is a dll which can be used by other Rust code, so it is exposing a lot of language specific metadata. cdylib is a new packaging introduced in Rust 1.10, which is more suitable for JNI bindings.

Command line

rustc -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

has produced 1.6 MiB binary. -C lto option instructs compiler to do "link time optimization". For some reason cdylib was not compiling without lto option for me.

Ok, direction is right, but we need to move much further. Let's try more compiler options.

Command line

rustc -C opt-level=3 -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

has produced 200 KiB binary. Optimization allow compiler to throw away a big portion of standard library which I will never need for my simple JNI binding.

Though, a large portion of standard library is still there.

In Rust you can fully turn off standard library (e.g. to run on bare metal).

Normally you would need at least memory management, but for simple JNI binding you can get away using stack allocation only.

At the moment, using Rust with no_std option requires nightly build of compiler. I have also rewrite some portion of kernel32 and JNI declarations to avoid dependency on libc types.

rustc -C opt-level=3 -C panic=abort -C lto -C debuginfo=0 --crate-type cdylib myjni.rs

Binary size is 22.5 KiB.

Cool, we have beaten Free Pascal.

One more tweak, execute strip -s on resulting dll and final binary size is 16.9 KiB.

Honestly, 16.9 KiB for couple of calls is still overkill. But, I'm not desperate enough to try assembly for JNI binding, at least not today.

Conclusion

Free Pascal IMHO, Free Pascal a good choice if you need simple JNI bindings. As a bonus, Free Pascal on Linux has no dependency on platform's dynamic libraries, so you can build cross-Linux-distro binaries.

Rust. I believe Rust have a great potential. Rust has unique memory safety model yet it let you to get as close to bare metal as C does. Besides other features, Rust has really promising cross compiling capabilities, which gives it a very strong position in embedded / IoT space.

Yet, Rust needs to get more stable. no_std feature is not available in latest (1.10) stable. cdynlib is not supported by latest stable cargo tool. Rust tool chain on Windows depends either on MS Visual Studio or MSys. Resulting binaries are slightly incompatible to each other (Oracle JMV is build with Visual Studio, so using MSys built JNI bindings leads to process crash in certain cases).

Wednesday, March 16, 2016

Finalizers and References in Java

Automatic memory management (garbage collection) is one of essential aspects of Java platform. Garbage collection relieves developers from pain of memory management and protects them from whole range of memory related issues. Though, working with external resources (e.g. files and socket) from Java becomes tricky, because garbage collector alone is not enough to manage such resources.

Originally Java had finalizers facility. Later special reference classes were added to deal with same problem.

If we have some external resource which should be deallocated explicitly (common case with native libraries), this task could be solved either using finalizer or phantom reference. What is the difference?

Finalizer approach

Code below is implementing resource housekeeping using Java finalizer.

public class Resource implements ResourceFacade {

    public static AtomicLong GLOBAL_ALLOCATED = new AtomicLong(); 
    public static AtomicLong GLOBAL_RELEASED = new AtomicLong(); 

    int[] data = new int[1 << 10];
    protected boolean disposed;

    public Resource() {
        GLOBAL_ALLOCATED.incrementAndGet();
    }

    public synchronized void dispose() {
        if (!disposed) {
            disposed = true;
            releaseResources();
        }
    }

    protected void releaseResources() {
        GLOBAL_RELEASED.incrementAndGet();
    }    
}

public class FinalizerHandle extends Resource {

    protected void finalize() {
        dispose();
    }
}

public class FinalizedResourceFactory {

    public static ResourceFacade newResource() {
        return new FinalizerHandle();
    }    
}

Phantom reference approach

public class PhantomHandle implements ResourceFacade {

    private final Resource resource;

    public PhantomHandle(Resource resource) {
        this.resource = resource;
    }

    public void dispose() {
        resource.dispose();
    }    

    Resource getResource() {
        return resource;
    }
}

public class PhantomResourceRef extends PhantomReference<PhantomHandle> {

    private Resource resource;

    public PhantomResourceRef(PhantomHandle referent, ReferenceQueue<? super PhantomHandle> q) {
        super(referent, q);
        this.resource = referent.getResource();
    }

    public void dispose() {
        Resource r = resource;
        if (r != null) {
            r.dispose();
        }        
    }    
}

public class PhantomResourceFactory {

    private static Set<Resource> GLOBAL_RESOURCES = Collections.synchronizedSet(new HashSet<Resource>());
    private static ResourceDisposalQueue REF_QUEUE = new ResourceDisposalQueue();
    private static ResourceDisposalThread REF_THREAD = new ResourceDisposalThread(REF_QUEUE);

    public static ResourceFacade newResource() {
        ReferedResource resource = new ReferedResource();
        GLOBAL_RESOURCES.add(resource);
        PhantomHandle handle = new PhantomHandle(resource);
        PhantomResourceRef ref = new PhantomResourceRef(handle, REF_QUEUE);
        resource.setPhantomReference(ref);
        return handle;
    }

    private static class ReferedResource extends Resource {

        @SuppressWarnings("unused")
        private PhantomResourceRef handle;

        void setPhantomReference(PhantomResourceRef ref) {
            this.handle = ref;
        }

        @Override
        public synchronized void dispose() {
            handle = null;
            GLOBAL_RESOURCES.remove(this);
            super.dispose();
        }
    }

    private static class ResourceDisposalQueue extends ReferenceQueue<PhantomHandle> {

    }

    private static class ResourceDisposalThread extends Thread {

        private ResourceDisposalQueue queue;

        public ResourceDisposalThread(ResourceDisposalQueue queue) {
            this.queue = queue;
            setDaemon(true);
            setName("ReferenceDisposalThread");
            start();
        }

        @Override
        public void run() {
            while(true) {
                try {
                    PhantomResourceRef ref = (PhantomResourceRef) queue.remove();
                    ref.dispose();
                    ref.clear();
                } catch (InterruptedException e) {
                    // ignore
                }
            }
        }
    }
}

Implementing same task using phantom reference requires more boilerplate. We need separate thread to handle reference queue, in addition, we need to keep strong references to allocated reference objects.

How finilaizers work in Java

Under the hood, finilizers work very similarly to our phantom reference implementation, though, JVM is hiding boilerplate from us.

Each time instance of object with finalizer is created, JVM creates instance of FinalReference class to track it. Once object becomes unreachable, FinalReference is triggered and added to global final reference queue, which is being processed by system finalizer thread.

So finalizes and phantom reference approach work very similar. Why should you bother with phantom references?

Comparing GC impact

Let's have simple test: resource object is allocated then added to the queue, once queue size hits limit oldest reference is evicted and thrown away. For this test we will monitor reference processing via GC logs.

Running finalizer based implementation.

[GC [ParNew[ ... [FinalReference, 5718 refs, 0.0063374 secs] ... 
Released: 6937 In use: 59498

Running phantom based implementation.

[GC [ParNew[ ... [PhantomReference, 5532 refs, 0.0037622 secs] ... 
Released: 5468 In use: 38897

As you can see, once object becomes unreachable, it needs to be handled in GC reference processing phase. Reference processing is a part of Stop-the-World pause. If, between collections, too many references becomes eligible for processing it may prolong Stop-the-World pause significantly.

In case above, there is no much difference between finalizers and phantom references. But let's change workflow a little. Now we would explicitly dispose 99% of handles and rely on GC only for 1% of references (i.e. semiautomatic resource management).

Running finalizer based implementation.

[GC [ParNew[ ... [FinalReference, 6295 refs, 0.0070033 secs] ...
Released: 6707 In use: 1457

Running phantom based implementation.

[GC [ParNew[ ... [PhantomReference, 625 refs, 0.0001551 secs] ... 
Released: 21682 In use: 1217

For finalizer based implementation there is no difference. Explicit resource disposal doesn't help reduce GC overhead. But with phantoms, we can see what GC do not need to handle explicitly disposed references (so number of references process by GC is reduced by order of magnitude).

Why this is happening? When resource handle is disposed we drop reference to phantom reference object. Once phantom reference is unreachable, it would never be queued for processing by GC, thus saving time in reference processing phase. It is quite opposite with final references, once created it will be strong referenced by JVM until being processed by finalizer thread.

Conclusion

Using phantom references for resources housekeeping requires more work compared to plain finalizer approach. But using phantom references you have far more granular control over whole process and implement number of optimizations such as hybrid (manual + automatic) resource management.

Full source code used for this article is available at https://github.com/aragozin/example-finalization.

Sunday, January 24, 2016

Flame Graphs Vs. Cold Numbers

Stack trace sampling is very powerful technique for performance troubleshooting. Advantages of stack trace sampling are

  • it doesn't require upfront configuration
  • cost added by sampling is small and controllable
  • it is easy to compare analysis result from different experiments

Unfortunately, tools offered for stack trace analysis by widespread Java profilers are very limited.

Solving performance problem in complex applications (a lot of business logic etc) is one of my regular challenges. Let's assume I have another misbehaving application at my hands. First step would be to localize bottleneck to specific part of stack.

Meet call tree

Call tree is built by digesting large number of stack traces. Each node in tree has a frequency - number of traces passing though this node.

Usually tools allow you to navigate through call tree reconstructed from stack trace population.

There is also flame graphs visualization (shown at right top of page) which is fancier but is just the same tree.

Looking at these visualization what can I see? - Not too much.

Why? Business logic somewhere in the middle of call tree produces too many branches. Tree beneath business logic is blurred beyond point of usability.

Dissecting call tree

Application is build using frameworks. For the sake of this article, I'm using example based on JBoss, JSF, Seam, Hibernate.

Now, if 13% of traces in our dump contain JDBC we can conclude what 13% of time is spent in JDBC / database calls.
13% is reasonable number, so database is not to blame here.

Let's go down the stack, Hibernate is next layer. Now we need to calculate all traces containing Hibernate classes excluding ones containing JDBC. This way we can attribute traces to particular framework and quickly get a picture where time is spent at runtime.

I didn't find any tool that can do it kind of analysis for me, so I build one for myself few years ago. SJK is my universal Java troubleshooting toolkit.

Below is command doing analysis explained above.

sjk ssa -f tracedump.std  --categorize -tf **.CoyoteAdapter.service -nc
JDBC=**.jdbc 
Hibernate=org.hibernate
"Facelets compile=com.sun.faces.facelets.compiler.Compiler.compile"
"Seam bijection=org.jboss.seam.**.aroundInvoke/!**.proceed"
JSF.execute=com.sun.faces.lifecycle.LifecycleImpl.execute
JSF.render=com.sun.faces.lifecycle.LifecycleImpl.render
Other=**

Below is output of this command.

Total samples    2732050 100.00%
JDBC              405439  14.84%
Hibernate         802932  29.39%
Facelets compile  395784  14.49%
Seam bijection    385491  14.11%
JSF.execute       290355  10.63%
JSF.render        297868  10.90%
Other             154181   5.64%

Well, we clearly see a large amount of time spent in Hibernate. This is very wrong, so it is first candidate for investigation. We also see that a lot of CPU is spent on JSF compilation, though pages should be compiled just once and cached (it turned out to be configuration issue). Actual application logic falls in JFS life cycle calls (execute(), render()). I would be possible to introduce additional category to isolate pure application logic execution time, but looking at numbers, I would say it is not necessary until other problems are solved.

Hibernate is our primary suspect, how to look inside? Let's look at method histogram for traces attributed to Hibernate trimming away all frames up to first Hibernate method call.

Below is command to do this.

sjk ssa -f --histo -tf **!**.jdbc -tt ogr.hibernate

Here is top of histogram produced by command

Trc     (%)  Frm  N  Term    (%)  Frame                                                                                                                                                                                  
699506  87%  699506       0   0%  org.hibernate.internal.SessionImpl.autoFlushIfRequired(SessionImpl.java:1204)                                                                                                          
689370  85%  689370      10   0%  org.hibernate.internal.QueryImpl.list(QueryImpl.java:101)                                                                                                                              
676524  84%  676524       0   0%  org.hibernate.event.internal.DefaultAutoFlushEventListener.onAutoFlush(DefaultAutoFlushEventListener.java:58)                                                                          
675136  84%  675136       0   0%  org.hibernate.internal.SessionImpl.list(SessionImpl.java:1261)                                                                                                                         
573836  71%  573836       4   0%  org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:264)                                                                                                                          
550968  68%  550968       1   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:99)                                                          
533892  66%  533892     132   0%  org.hibernate.event.internal.AbstractFlushingEventListener.flushEntities(AbstractFlushingEventListener.java:227)                                                                       
381514  47%  381514     882   0%  org.hibernate.event.internal.AbstractVisitor.processEntityPropertyValues(AbstractVisitor.java:76)                                                                                      
271018  33%  271018       0   0%  org.hibernate.event.internal.DefaultFlushEntityEventListener.onFlushEntity(DefaultFlushEntityEventListener.java:161)

Here is our suspect. We spent 87% of Hibernate time in autoFlushIfRequired() call (and JDBC time is already excluded).

Using few commands we have narrowed down one performance bottleneck. Fixing it is another topic though.

In a case, I'm using as example, CPU usage of application were reduced by 10 times. Few problems found and addressed during that case were

  • optimization of Hibernate usage
  • facelets compilation caching were properly configure
  • work around performance bug in Seam framework was implemented
  • JSF layouts were optimized to reduce number of Seam injections / outjections

Limitations of this approach

During statistical analysis of stack traces you deal with wallclock time, you cannot guest real CPU time using this method. If CPU on host is saturated, your number will be skewed by the threads idle time due to CPU starvation.

Normally you can get stack trace only at JVM safepoints. So if some methods are inlined by JIT compiler, they may never appear at trace even if they are really busy. In other words, tip of stack trace may be skewed by JIT effects. Practically, it was never an obstacle for me, but you should be keep in mind possibility of such effect.

What about flame graphs?

Well, despite being not so useful, they look good on presentations. Support for flame graphs was added to SJK recently.