His suggestion was to simple send N start messages where N >= number of processors. In theory, the performance will become optimal with N threads then. Unfortunately this is not what happened in real life. In real life the single threaded performance is still better if you send even 16 messages on a biprocessor machine.
public static void main(String[] args) throws Exception {OptimizedRing ring = new OptimizedRing();
RingNode node = ring.startRing(Integer.parseInt(args[0]));
node.sendMessage(new StartMessage());
node.sendMessage(new TokenMessage(node.nodeId,1));
node.sendMessage(new TokenMessage(node.nodeId,1));
node.sendMessage(new TokenMessage(node.nodeId,1));
ring.executor.awaitTermination(10, TimeUnit.MINUTES);
}
My idea was that it was related to the swiching from thread to thread overhead, which is precisely what I think the original author of the test had in mind to test. I am not 100% convinced it is really what's happening. I wanted a test that would actually be faster using N threads; so I decided to add a bit of computation at before processing each Token. Unfortunately I had the bad idea to compute Pi by Monte Carlo method to do that. Running my tests I was surprised it did not change the results, and made things worse the most computer intensive the computation was (increasing the number of monte carlo iterations). It scared me a bit wondering what the hell could be wrong there. The following class performs much worse with 2 threads compared to 1:
public class BadParallelPi {private static void startExecutors() throws Exception {
long startTime = System.currentTimeMillis();
System.out.println(startTime);
ExecutorService executor1 = Executors.newFixedThreadPool(1);
executor1.execute(new Computation());
executor1.execute(new Computation());
executor1.shutdown();
executor1.awaitTermination(60, TimeUnit.SECONDS);
long delay = System.currentTimeMillis() - startTime;
System.out.println("finished single thread in "+(delay/1000.0));
startTime = System.currentTimeMillis();
System.out.println(startTime);
executor1 = Executors.newFixedThreadPool(2);
executor1.execute(new Computation());
executor1.execute(new Computation());
executor1.shutdown();
executor1.awaitTermination(60, TimeUnit.SECONDS);
delay = System.currentTimeMillis() - startTime;
System.out.println("finished 2 threads in "+(delay/1000.0));
}
public static class Computation implements Runnable {
public volatile int count = 0;
private double computePi() {
double pi = 0;
double x,y;
int n = 10000000;
for (int i=0;i<n;i++) {
x = Math.random();
x *= x;
y = Math.random();
y *= y;
if (x+y < 1) {
pi +=1;
}
}
pi = 4*pi/n;
return pi;
}
public void run() {
double pi = computePi();
long time = System.currentTimeMillis();
System.out.println(time+" thread "+Thread.currentThread().getId()+" pi="+pi);
count++;
}
}
public static void main(String[] args) throws Exception {
startExecutors();
}
}
Did you figure out why?
It took me less time with this simple code than with the original ring test to find out why. It is simply because of the Math.random call. Math.random only creates one random number generator, and it will be shared among threads. So every thread will wait at the other one at this point. Creating one random generator per thread showed 2 threads were much faster than 1, finally.
Back to the original ring test. Adding the correct way to compute Pi by Monte Carlo, I now had decent test results as long as the number of iterations is not too small. 10 iterations is enough to show a real difference between N threads and 1. Adding a small computation helps figuring out what happens behind the scene. You can also verify D Andreou claim, using only 1 start message the single threaded version is faster. If computation is too weak (for example number of Monte Carlo iteration of 0, one only measures method calls between threads (context switching), which is obviously optimal for 1 thread. Measuring Actor libraries on it is dangerous: if I write a single threaded Actor library, it will be the fastest of this test, but it certainly is not what you want to use as Actor library.
Let's see now how Scala fares compared to the Plain Java solution, using computation:
Machine | Algorithm | Time for 100000 ring count, 10 mc, 4 messages | Time for 10000 ring count, 100 mc, 4 messages |
Core2Duo | OptimizedRing 2 Threads | 57s | 37s |
Core2Duo | OptimizedRing 4 Threads | 78s | 39s |
Core2Duo | Scala Actors | 82s | 47s |
Core2Duo | SimpleRing (100 Threads) | 137s | 58s |
Core2Duo | OptimizedRing 1 Thread | 89s | 71s |
Core2Quad | OptimizedRing 4 Threads | 81s | 25s |
Core2Quad | Scala Actors | 71s | 30s |
Core2Quad | OptimizedRing 2 Threads | 61s | 43s |
Core2Quad | OptimizedRing 1 Threads | 100s | 80s |
The Core2Quad is Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
It is interesting to compare results of 4 threads on a biprocessor with monte carlo count of 10 and 100. We see a much higher thread overhead with fewer computation. With too few computation in monte carlo, the overhead of threads is too high over 2 concurrent threads. This explains why the very simple threading architecture fares much better in the last column compared to the previous one.
Scala Actors fares much better when it is not hindered in the creation of too many threads. It seem actually very good at abstracting multithreading intricacies, while still providing near Java performance in the real world where each actor does enough computation and multithreading is important.
By the way, Java 7 will probably come with a way to create Random numbers efficiently in more than one thread.
ReplyDeleteYes. Here it is in its current form:
ReplyDeleteThreadLocalRandom
Of course, it's functionally equivalent to a ThreadLocal < Random >, just easier to use, and is a subclass of Random (can be injected to existing code to remove that contention).
@Fabien, that was quite sneaky indeed. I had never noticed that Math.random introduces contention too, just as Random (I only now saw it is implemented over Random).
Doug Lea also claimed that even some SPEC benchmarks sufferred from this very oversight.