Chase The Devil: The End Of Rings Around Plain Java

In my previous post, I was wondering why single thread was faster. D Andreou gave the correct explanation: as we send only 1 start message and as each node only send 1 message to the next one, there is always only 1 message being processed. So the test is optimum on 1 thread. It does not make much sense to make a multithreading benchmark on a problem that is fundamentally single threaded.

His suggestion was to simple send N start messages where N >= number of processors. In theory, the performance will become optimal with N threads then. Unfortunately this is not what happened in real life. In real life the single threaded performance is still better if you send even 16 messages on a biprocessor machine.

    public static void main(String[] args) throws Exception {
        OptimizedRing ring = new OptimizedRing();
        RingNode node = ring.startRing(Integer.parseInt(args[0]));
        node.sendMessage(new StartMessage());
        node.sendMessage(new TokenMessage(node.nodeId,1));
        node.sendMessage(new TokenMessage(node.nodeId,1));
        node.sendMessage(new TokenMessage(node.nodeId,1));
        ring.executor.awaitTermination(10, TimeUnit.MINUTES);
    }

My idea was that it was related to the swiching from thread to thread overhead, which is precisely what I think the original author of the test had in mind to test. I am not 100% convinced it is really what's happening. I wanted a test that would actually be faster using N threads; so I decided to add a bit of computation at before processing each Token. Unfortunately I had the bad idea to compute Pi by Monte Carlo method to do that. Running my tests I was surprised it did not change the results, and made things worse the most computer intensive the computation was (increasing the number of monte carlo iterations). It scared me a bit wondering what the hell could be wrong there. The following class performs much worse with 2 threads compared to 1:

public class BadParallelPi {

    private static void startExecutors() throws Exception {
        long startTime = System.currentTimeMillis();
        System.out.println(startTime);
        ExecutorService executor1 = Executors.newFixedThreadPool(1);
        executor1.execute(new Computation());
        executor1.execute(new Computation());
        executor1.shutdown();
        executor1.awaitTermination(60, TimeUnit.SECONDS);
        long delay = System.currentTimeMillis() - startTime;
        System.out.println("finished single thread in "+(delay/1000.0));
        startTime = System.currentTimeMillis();
        System.out.println(startTime);
        executor1 = Executors.newFixedThreadPool(2);
        executor1.execute(new Computation());
        executor1.execute(new Computation());
        executor1.shutdown();
        executor1.awaitTermination(60, TimeUnit.SECONDS);
        delay = System.currentTimeMillis() - startTime;
        System.out.println("finished 2 threads in "+(delay/1000.0));
    }

    public static class Computation implements Runnable {
        public volatile int count = 0;
         private double computePi() {
            double pi = 0;
            double x,y;
            int n = 10000000;
            for (int i=0;i<n;i++) {
                x = Math.random();
                x *= x;
                y = Math.random();
                y *= y;
                if (x+y < 1) {
                    pi +=1;
                }
            }
            pi = 4*pi/n;
            return pi;
        }

        public void run() {
            double pi = computePi();
            long time = System.currentTimeMillis();
            System.out.println(time+" thread "+Thread.currentThread().getId()+" pi="+pi);
            count++;
        }
    }


    public static void main(String[] args) throws Exception {
        startExecutors();
    }
}

Did you figure out why?

It took me less time with this simple code than with the original ring test to find out why. It is simply because of the Math.random call. Math.random only creates one random number generator, and it will be shared among threads. So every thread will wait at the other one at this point. Creating one random generator per thread showed 2 threads were much faster than 1, finally.

Back to the original ring test. Adding the correct way to compute Pi by Monte Carlo, I now had decent test results as long as the number of iterations is not too small. 10 iterations is enough to show a real difference between N threads and 1. Adding a small computation helps figuring out what happens behind the scene. You can also verify D Andreou claim, using only 1 start message the single threaded version is faster. If computation is too weak (for example number of Monte Carlo iteration of 0, one only measures method calls between threads (context switching), which is obviously optimal for 1 thread. Measuring Actor libraries on it is dangerous: if I write a single threaded Actor library, it will be the fastest of this test, but it certainly is not what you want to use as Actor library.

Let's see now how Scala fares compared to the Plain Java solution, using computation:

Machine	Algorithm	Time for 100000 ring count, 10 mc, 4 messages	Time for 10000 ring count, 100 mc, 4 messages
Core2Duo	OptimizedRing 2 Threads	57s	37s
Core2Duo	OptimizedRing 4 Threads	78s	39s
Core2Duo	Scala Actors	82s	47s
Core2Duo	SimpleRing (100 Threads)	137s	58s
Core2Duo	OptimizedRing 1 Thread	89s	71s
Core2Quad	OptimizedRing 4 Threads	81s	25s
Core2Quad	Scala Actors	71s	30s
Core2Quad	OptimizedRing 2 Threads	61s	43s
Core2Quad	OptimizedRing 1 Threads	100s	80s

The Core2Duo is Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz
The Core2Quad is Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz

It is interesting to compare results of 4 threads on a biprocessor with monte carlo count of 10 and 100. We see a much higher thread overhead with fewer computation. With too few computation in monte carlo, the overhead of threads is too high over 2 concurrent threads. This explains why the very simple threading architecture fares much better in the last column compared to the previous one.

Scala Actors fares much better when it is not hindered in the creation of too many threads. It seem actually very good at abstracting multithreading intricacies, while still providing near Java performance in the real world where each actor does enough computation and multithreading is important.

2 comments :

Markus KohlerJanuary 16, 2009 9:09 AM
By the way, Java 7 will probably come with a way to create Random numbers efficiently in more than one thread.
Dimitris AndreouJanuary 18, 2009 8:18 AM
Yes. Here it is in its current form:

ThreadLocalRandom

Of course, it's functionally equivalent to a ThreadLocal < Random >, just easier to use, and is a subclass of Random (can be injected to existing code to remove that contention).

@Fabien, that was quite sneaky indeed. I had never noticed that Math.random introduces contention too, just as Random (I only now saw it is implemented over Random).

Doug Lea also claimed that even some SPEC benchmarks sufferred from this very oversight.

Thursday, January 15, 2009

The End Of Rings Around Plain Java - A Better Concurrency Test

2 comments :