Showing posts with label gridkit. Show all posts
Showing posts with label gridkit. Show all posts

Thursday, December 12, 2013

TechTalk: Virtualizing Java in Java

On 12th December, I was speaking at JUG in Saint-Petersburg, Russia.

It was a long talk about using NanoCloud.

Below is video

and slide deck from event

Friday, February 1, 2013

How to simulate Coherence node failure in JUnit test

Recently, I was working on renewed version of Coherence data loss listener. New version provides simple utility to attach partition loader to any cache. Such partition loader is guaranteed to be called for every partition of newly created cache or after partition has been lost. Unlike CacheStore, partition loader will be called on the node there it has been added. This way you could have dedicated set loader processes, which are not involved at storing a data. Also it is guaranties that

  • only one instance of partition loader are executed for particular partition in cluster,
  • if there is at least one live node with registered partition loader it would be invoked for empty partition.
  • But let’s get back to a topic of post. I’m actively using JUnit and my test-utils library to automate testing of my Coherence stuff. Test-utils is using class loader isolation to run multiple Coherence nodes in single JVM.

    But, unlike many other tests, here I need to test disaster case. Coherence should think that one of its nodes has died. Normally, I’m using CacheFactory.shutdown() to kill virtualized Coherence node, but this way it would be a graceful shutdown.

    For data loss listener, I really want to test disaster case.

    How Coherence track node liveness?

    Naïve approach using timeout is not working well with data grid prioritizing resilience and performance such as Coherence.

    What is problem with timeout?

    If you let it be too short, there will be too many false positives making grid unstable (JVM may do a GC, OS may start swapping, etc).

    If you let it be too long, time of recovery from disaster would be too long.

    How this can be improved? Let’s see that kind of disaster could possibly happen with your cluster:

  • JVM process could be killed, crushed or just exited without shutdown.
  • Sever could crush or become unreachable via network.
  • Death of process could be easily tracked if you keep open TCP connection open. OS will close all TCP connections for dead process, so you could make very good assumption that remote process is dead.

    Coherence is using so called TCP ring for that purpose. Each cluster node keeps two open TCP connections to other cluster nodes (forming a ring). If cluster detects that both TCP connections have been closed, it has very good reason to disconnect node right now and start recovery procedure.

    In case of server/network failure, TCP connection will not be closed immediately. In addition to TCP ring, Coherence is using IP monitor to track reachability of IP addresses. If IP address cannot be reached by rest of nodes, cluster will not hesitate to disconnect all nodes from that IP.

    This two tricks allow Coherence to detect real failure very fast, yet to be very tolerant to long GC pauses and other non fatal slowdowns.

    Steps to kill node in JUnit test

    In JUnit test all nodes in cluster are sharing same JVM. I cannot really kill a process. To simulate node death, I’m calling Thread.suspend() on all threads related to victim node (a feature of test-utils). This is making node totally unresponsive.

    Two mechanisms above should be turned off in Coherence operational configuration. Disconnect timeout also should be set to smaller value (otherwise each test will take too long).

    That is it, now I can test disaster cases for Coherence using JUnit.

    Below is snippet of actual test:

    @Test public void verify_parallel_init_crash_case() throws InterruptedException { final int partitions = 2000; final int timeout = 15000; CacheTemplate.usePartitionedServicePartitionCount(cluster, partitions); CohHelper.setTCMPTimeout(server(0), timeout); CohHelper.disableTcpRing(server(0)); CohHelper.setTCMPTimeout(client(0), timeout); CohHelper.disableTcpRing(client(0)); ... server(0).getCache("a-cache1"); statics().initPartitionCounter(partitions); // init Coherence nodes client(0).getCache("a-cache1"); client(1).getCache("a-cache1"); ... // attaching test partition loader attachTouchMonitor(0, 20, "a-cache1"); attachTouchMonitor(1, 20, "a-cache1"); ... Thread.sleep(500); System.out.println("Simulating crash for 2,3,4,5"); // simulating client crash, verify lock revocation client(2).suspend(); client(3).suspend(); client(4).suspend(); client(5).suspend(); // waiting for all test listeners to finish statics().waitAllLatches(); System.out.println("Latches are open"); Thread.sleep(200); // checking cache state assertAllCanaries(0, "a-cache1"); }

    Here is a link to full java file in SVN.

    Monday, January 28, 2013

    Remote code execution in Java made trivial

    SSH offers a very convenient way to execute shell scripts remotely. Code like

    ssh myserver.acme.com << EOC
    cd $APP_HOME
    ./start_my_stuff.sh
    EOC

    are fairly easy to write and read.

    But while remote execution itself is easy, writing actual distributed code is total mess.

    I’m a Java guy. I wish, I could run Java code remotely as easy as I can do it with shell.
    And now, finally, I can.

    @Test public void remote_hello_world() throws InterruptedException { ViManager cloud = CloudFactory.createSimpleSshCloud(); cloud.node("myserver.acme.com"); cloud.node("**").exec(new Callable<Void>() { @Override public Void call() throws Exception { String localHost = InetAddress.getLocalHost().toString(); System.out.println("Hi! I'm running on " + localHost); return null; } }); // Console output is transfered asynchronously, // so it is better to wait few seconds for it to arrive Thread.sleep(1000); }

    And that snippet will work for you too, only two requirements are:

  • You should be able to SSH to target host without password (you have SSH key-pairs set up).
  • Remote host should have JDK installed, and java command on PATH.
  • That is it, there is no need to install anything special to target servers.

    You can find more details in tutorial at GridKit site.

    How it work?

    A lot of black magic is happening behind the scene. In particular:

  • Classpath of current Java processes is replicated and cached at remote host using SFTP.
  • SSH is used to start remote Java process.
  • All communications between master and slave are tunneled in stdIn and stdOut (thus you can care less about NAT and firewall).
  • Modified version of RMI/serialization is used to allow anonymous classes to be run remotely.
  • By despite all that internal complexity, it just works. I’m using it to run agents on Linux and Solaris. I was using that stuff on Amazon EC2 and I cloud have master process running on my Windows desktop while slave scattered across Unixes.
    It was long road. A lot of issues such as firewalls, bugs in JSch (java SSH client), subtle SSH limitation, etc has been solved along that path.
    But now, I believe, it will “just work” for you too.

    Java vs shell

    But what is the need for such library in a first place? Below are just few of my reasons:

  • I’m a Java guy, I do not want to invest much in shell scripting skills.
  • Java is "really" cross platform, heck I can even debug stuff on Windows then run them on Linux.
  • I have to do a lot to distributed orchestration (starting/stopping processes in right order etc), it is so much easier to do in java.
  • It is hard to write reusable shell scripts, but easy for java (heck, I can even unit test pieces of deployment logic).
  • Remotting is just a remotting

    Ability to effortlessly run remote java code is not much by itself. But it enables you to reach new levels of day-to-day automation (which was prohibitively expensive before).

    But that would be topic for another post.

    Friday, September 18, 2009

    Oracle Coherence using POF, without a single line of code

    People developing distributed Java applications know the importance of wire formats for objects. Native Java serialization has only one advantage—it is built in. It is relatively slow, not very compact, and has other quirks. Starting with version 3.2, Oracle Coherence is offering its own proprietary binary wire format for objects—POF serialization. POF is not only cross platform, but also much more compact and faster compared to built-in serialization. Both compactness and speed are extremely important for data grid application. The only disadvantage of POF is that you should write custom serialization/deserialization code for each of your mobile objects. Not only domain objects stored in cache should have serializers, but also entry processors, aggregators, etc. The amount of code you have to write may look daunting and force you to stick with built-in Java serialization.

    Full text of article is available at GridDynamics blog - http://blog.griddynamics.com/2009/09/oracle-coherence-using-pof-without.html