I was just running some multithreaded code on a 4-core machine in the hopes that it would be faster than on a single-core machine. Here's the idea: I got a fixed number of threads (in my case one thread per core). Every thread executes a Runnable of the form:
private static int[] data; // data shared across all threads
public void run() {
    int i = 0;
    while (i++ < 5000) {
        // do some work
        for (int j = 0; j < 10000 / numberOfThreads) {
            // each thread performs calculations and reads from and
            // writes to a different part of the data array
        }
        // wait for the other threads
        barrier.await();
    }
}
On a quadcore machine, this code performs worse with 4 threads than it does with 1 thread. Even with the CyclicBarrier's overhead, I would have thought that the code should perform at least 2 times faster. Why does it run slower?
EDIT: Here's a busy wait implementation I tried. Unfortunately, it makes the program run slower on more cores (also being discussed in a separate question here):
public void run() {
    // do work
    synchronized (this) {
        if (atomicInt.decrementAndGet() == 0) {
            atomicInt.set(numberOfOperations);
            for (int i = 0; i < threads.length; i++)
                threads[i].interrupt();
        }
    }
    while (!Thread.interrupted()) {}
}