Friday, February 1, 2013

How to simulate Coherence node failure in JUnit test

Recently, I was working on renewed version of Coherence data loss listener. New version provides simple utility to attach partition loader to any cache. Such partition loader is guaranteed to be called for every partition of newly created cache or after partition has been lost. Unlike CacheStore, partition loader will be called on the node there it has been added. This way you could have dedicated set loader processes, which are not involved at storing a data. Also it is guaranties that

  • only one instance of partition loader are executed for particular partition in cluster,
  • if there is at least one live node with registered partition loader it would be invoked for empty partition.
  • But let’s get back to a topic of post. I’m actively using JUnit and my test-utils library to automate testing of my Coherence stuff. Test-utils is using class loader isolation to run multiple Coherence nodes in single JVM.

    But, unlike many other tests, here I need to test disaster case. Coherence should think that one of its nodes has died. Normally, I’m using CacheFactory.shutdown() to kill virtualized Coherence node, but this way it would be a graceful shutdown.

    For data loss listener, I really want to test disaster case.

    How Coherence track node liveness?

    Naïve approach using timeout is not working well with data grid prioritizing resilience and performance such as Coherence.

    What is problem with timeout?

    If you let it be too short, there will be too many false positives making grid unstable (JVM may do a GC, OS may start swapping, etc).

    If you let it be too long, time of recovery from disaster would be too long.

    How this can be improved? Let’s see that kind of disaster could possibly happen with your cluster:

  • JVM process could be killed, crushed or just exited without shutdown.
  • Sever could crush or become unreachable via network.
  • Death of process could be easily tracked if you keep open TCP connection open. OS will close all TCP connections for dead process, so you could make very good assumption that remote process is dead.

    Coherence is using so called TCP ring for that purpose. Each cluster node keeps two open TCP connections to other cluster nodes (forming a ring). If cluster detects that both TCP connections have been closed, it has very good reason to disconnect node right now and start recovery procedure.

    In case of server/network failure, TCP connection will not be closed immediately. In addition to TCP ring, Coherence is using IP monitor to track reachability of IP addresses. If IP address cannot be reached by rest of nodes, cluster will not hesitate to disconnect all nodes from that IP.

    This two tricks allow Coherence to detect real failure very fast, yet to be very tolerant to long GC pauses and other non fatal slowdowns.

    Steps to kill node in JUnit test

    In JUnit test all nodes in cluster are sharing same JVM. I cannot really kill a process. To simulate node death, I’m calling Thread.suspend() on all threads related to victim node (a feature of test-utils). This is making node totally unresponsive.

    Two mechanisms above should be turned off in Coherence operational configuration. Disconnect timeout also should be set to smaller value (otherwise each test will take too long).

    That is it, now I can test disaster cases for Coherence using JUnit.

    Below is snippet of actual test:

    @Test public void verify_parallel_init_crash_case() throws InterruptedException { final int partitions = 2000; final int timeout = 15000; CacheTemplate.usePartitionedServicePartitionCount(cluster, partitions); CohHelper.setTCMPTimeout(server(0), timeout); CohHelper.disableTcpRing(server(0)); CohHelper.setTCMPTimeout(client(0), timeout); CohHelper.disableTcpRing(client(0)); ... server(0).getCache("a-cache1"); statics().initPartitionCounter(partitions); // init Coherence nodes client(0).getCache("a-cache1"); client(1).getCache("a-cache1"); ... // attaching test partition loader attachTouchMonitor(0, 20, "a-cache1"); attachTouchMonitor(1, 20, "a-cache1"); ... Thread.sleep(500); System.out.println("Simulating crash for 2,3,4,5"); // simulating client crash, verify lock revocation client(2).suspend(); client(3).suspend(); client(4).suspend(); client(5).suspend(); // waiting for all test listeners to finish statics().waitAllLatches(); System.out.println("Latches are open"); Thread.sleep(200); // checking cache state assertAllCanaries(0, "a-cache1"); }

    Here is a link to full java file in SVN.