Is multiplication faster than array access?

Question

To my surprise I get a longer time (10 milliseconds) when "optimizing" multiplications by pregenerating the results in an array compared to the original 8 milliseconds. Is that just a Java quirk or is that general of the PC architecture? I have a Core i5 760 with Java 7, Windows 8 64 Bit.

public class Test {
    public static void main(String[] args)  {
        long start = System.currentTimeMillis();
        long sum=0;
        int[] sqr = new int[1000];
        for(int a=1;a<1000;a++) {sqr[a]=a*a;}

        for(int b=1;b<1000;b++)
//          for(int a=1;a<1000;a++) {sum+=a*a+b*b;}
            for(int a=1;a<1000;a++) {sum+=sqr[a]+sqr[b];}
        System.out.println(System.currentTimeMillis()-start+"ms");
        System.out.println(sum);
    }
}

That measurement is way too short to be anywhere near accurate. The numbers are essentially purely random. Don’t put stock in them. And the Java JIT optimiser probably does’t even start up for 1000 iterations so even if the clock were miraculously accurate, the measurement would be meaningless in the context of a real program. — Konrad Rudolph, Feb 03 '14 at 23:31
However: Remember that array access, in general, involves a multiplication (or at least a shift), an addition, and a fetch. It isn't at all clear that, in the simple integer case, this will perform better than a simple multiplication. — keshlam, Feb 03 '14 at 23:33
See also: http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — meriton, Feb 03 '14 at 23:34
In my case is 4ms(multy) vs. 7ms(add). I think the reason accessing to objects from Heap (java object memory) in first case, and operating with primitive types is - second case. A little processor spend time doesn't make time. — Ashot Karakhanyan, Feb 03 '14 at 23:37
Don't you want to start the timer after you've populated the array? Otherwise it's not a fair comparison. — Paul Hicks, Feb 04 '14 at 02:47
@KonradRudolph In fact it is 1 million iterations (nested loop) but I can try it again for higher numbers. — Konrad Höffner, Feb 04 '14 at 09:59
@PaulHicks Ah you are right. I originally wanted to measure squares by multiplications instead of array access and include the pregeneration time but the way I rephrased the question I should have left it out. Still with 1000 to 1 million iterations I hope it is not a big difference. — Konrad Höffner, Feb 04 '14 at 10:01
@kirdie Ah, good point. That does alleviate parts of my concerns. But the warming-up of the JIT is still pretty high (although I don’t have hard numbers) and the time thus too short for it to make much of a difference. And the clock accuracy issue also still stands. — Konrad Rudolph, Feb 04 '14 at 10:41

score 12 · Accepted Answer · edited May 23 '17 at 12:09

Konrad Rudolph commented on the issues with the benchmarking. So I am ignoring the benchmark and focus on the question:

Is multiplication faster than array access?

Yes, it is very likely. It used to be the other way around 20 or 30 years ago.

Roughly speaking, you can do an integer multiplication in 3 cycles (pessimistic, if you don't get vector instructions), and a memory access costs you 4 cycles if you get it straight from the L1 cache but it is straight downhill from there. For reference, see

Latencies and throughput in Appendix C of the Intel 64 and IA-32 Architectures Optimization Reference Manual
Approximate cost to access various caches and main memory?
Herb Sutter's presentation on this very subject: Machine Architecture: Things Your Programming Language Never Told You

One thing specific to Java was pointed out by Ingo in a comment below: You also get bounds checking in Java, which makes the already slower array access even slower...

@Ingo That's a good one, yes, I completely forgot that one :) Thanks for pointing this out, added that to the answer. — Ali, Feb 04 '14 at 01:03

meriton · Answer 2 · 2014-02-04T00:59:58.177

A more reasonable benchmark would be:

public abstract class Benchmark {

    final String name;

    public Benchmark(String name) {
        this.name = name;
    }

    abstract int run(int iterations) throws Throwable;

    private BigDecimal time() {
        try {
            int nextI = 1;
            int i;
            long duration;
            do {
                i = nextI;
                long start = System.nanoTime();
                run(i);
                duration = System.nanoTime() - start;
                nextI = (i << 1) | 1;
            } while (duration < 1000000000 && nextI > 0);
            return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
        } catch (Throwable e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public String toString() {
        return name + "\t" + time() + " ns";
    }

    private static void shuffle(int[] a) {
        Random chaos = new Random();
        for (int i = a.length; i > 0; i--) {
            int r = chaos.nextInt(i);
            int t = a[r];
            a[r] = a[i - 1];
            a[i - 1] = t;
        }
    }


    public static void main(String[] args) throws Exception {
        final int[] table = new int[1000];
        final int[] permutation = new int[1000];

        for (int i = 0; i < table.length; i++) {
            table[i] = i * i;
            permutation[i] = i;
        }
        shuffle(permutation);

        Benchmark[] marks = {
            new Benchmark("sequential multiply") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += i * i;
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("sequential lookup") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += table[i];
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("random order multiply") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += permutation[i] * permutation[i];
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("random order lookup") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += table[permutation[i]];
                        }
                    }
                    return sum;
                }
            }
        };

        for (Benchmark mark : marks) {
            System.out.println(mark);
        }
    }
}

which prints on my intel core duo (yes, it's old):

sequential multiply    2218.666 ns
sequential lookup      1081.220 ns
random order multiply  2416.923 ns
random order lookup    2351.293 ns

So, if I access the lookup array sequentially, which minimizes the number of cache misses, and permits the hotspot JVM to optimize bounds checking on array access, there is a slight improvement on an array of 1000 elements. If we do random access into the array, that advantage disappears. Also, if the table is larger, the lookup gets slower. For instance, for 10000 elements, I get:

sequential multiply    23192.236 ns
sequential lookup      12701.695 ns
random order multiply  24459.697 ns
random order lookup    31595.523 ns

So, array lookup is not faster than multiplication, unless the access pattern is (nearly) sequential and the lookup array small.

In any case, my measurements indicate that a multiplication (and addition) takes merely 4 processor cycles (2.3 ns per loop iteration on a 2GHz CPU). You're unlikely to get much faster than that. Also, unless you do half a billion multiplications per second, the multiplications are not your bottleneck, and optimizing other parts of the code will be more fruitful.

Is multiplication faster than array access?

2 Answers2