A more reasonable benchmark would be:
public abstract class Benchmark {
    final String name;
    public Benchmark(String name) {
        this.name = name;
    }
    abstract int run(int iterations) throws Throwable;
    private BigDecimal time() {
        try {
            int nextI = 1;
            int i;
            long duration;
            do {
                i = nextI;
                long start = System.nanoTime();
                run(i);
                duration = System.nanoTime() - start;
                nextI = (i << 1) | 1;
            } while (duration < 1000000000 && nextI > 0);
            return new BigDecimal((duration) * 1000 / i).movePointLeft(3);
        } catch (Throwable e) {
            throw new RuntimeException(e);
        }
    }
    @Override
    public String toString() {
        return name + "\t" + time() + " ns";
    }
    private static void shuffle(int[] a) {
        Random chaos = new Random();
        for (int i = a.length; i > 0; i--) {
            int r = chaos.nextInt(i);
            int t = a[r];
            a[r] = a[i - 1];
            a[i - 1] = t;
        }
    }
    public static void main(String[] args) throws Exception {
        final int[] table = new int[1000];
        final int[] permutation = new int[1000];
        for (int i = 0; i < table.length; i++) {
            table[i] = i * i;
            permutation[i] = i;
        }
        shuffle(permutation);
        Benchmark[] marks = {
            new Benchmark("sequential multiply") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += i * i;
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("sequential lookup") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += table[i];
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("random order multiply") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += permutation[i] * permutation[i];
                        }
                    }
                    return sum;
                }
            },
            new Benchmark("random order lookup") {
                @Override
                int run(int iterations) throws Throwable {
                    int sum = 0;
                    for (int j = 0; j < iterations; j++) {
                        for (int i = 0; i < table.length; i++) {
                            sum += table[permutation[i]];
                        }
                    }
                    return sum;
                }
            }
        };
        for (Benchmark mark : marks) {
            System.out.println(mark);
        }
    }
}
which prints on my intel core duo (yes, it's old):
sequential multiply    2218.666 ns
sequential lookup      1081.220 ns
random order multiply  2416.923 ns
random order lookup    2351.293 ns
So, if I access the lookup array sequentially, which minimizes the number of cache misses, and permits the hotspot JVM to optimize bounds checking on array access, there is a slight improvement on an array of 1000 elements. If we do random access into the array, that advantage disappears. Also, if the table is larger, the lookup gets slower. For instance, for 10000 elements, I get:
sequential multiply    23192.236 ns
sequential lookup      12701.695 ns
random order multiply  24459.697 ns
random order lookup    31595.523 ns
So, array lookup is not faster than multiplication, unless the access pattern is (nearly) sequential and the lookup array small. 
In any case, my measurements indicate that a multiplication (and addition) takes merely 4 processor cycles (2.3 ns per loop iteration on a 2GHz CPU). You're unlikely to get much faster than that. Also, unless you do half a billion multiplications per second, the multiplications are not your bottleneck, and optimizing other parts of the code will be more fruitful.