The performance comparison on Java8 of the code below is counter-intuitive.
import java.util.Arrays;
class Main {
    interface Dgemv {
        void dgemv(int n, double[] a, double[] x, double[] y);
    }
    static final class Dgemv1 implements Dgemv {
        public void dgemv(int n, double[] a, double[] x, double[] y) {
            Arrays.fill(y, 0.0);
            for (int j = 0; j < n; ++j)
                dgemvImpl(x[j], j * n, n, a, y);
        }
        private void dgemvImpl(final double xj, final int aoff,
                final int n, double[] a, double[] y) {
            for (int i = 0; i < n; ++i)
                y[i] += xj * a[i + aoff];
        }
    }
    static final class Dgemv2 implements Dgemv {
        public void dgemv(int n, double[] a, double[] x, double[] y) {
            Arrays.fill(y, 0.0);
            for (int j = 0; j < n; ++j)
                new DgemvImpl(x[j], j * n).dgemvImpl(n, a, y);
        }
        private static final class DgemvImpl {
            private final double xj;
            private final int aoff;
            DgemvImpl(double xj, int aoff) {
                this.xj = xj;
                this.aoff = aoff;
            }
            void dgemvImpl(final int n, double[] a, double[] y) {
                for (int i = 0; i < n; ++i)
                    y[i] += xj * a[i + aoff];
            }
        }
    }
    static long runDgemv(long niter, int n, Dgemv op) {
        double[] a = new double[n * n];
        double[] x = new double[n];
        double[] y = new double[n];
        long start = System.currentTimeMillis();
        for (long i = 0; i < niter; ++i) {
            op.dgemv(n, a, x, y);
        }
        return System.currentTimeMillis() - start;
    }
    static void testDgemv(long niter, int n, int mode) {
        Dgemv op = null;
        switch (mode) {
        case 1: op = new Dgemv1(); break;
        case 2: op = new Dgemv2(); break;
        }
        runDgemv(niter, n, op);
        double sec = runDgemv(niter, n, op) * 1e-3;
        double gflps = (2.0 * n * n) / sec * niter  * 1e-9;
        System.out.format("mode=%d,N=%d,%f sec,%f GFLPS\n", mode, n, sec, gflps);
    }
    public static void main(String[] args) {
        int n = Integer.parseInt(args[0]);
        long niter = ((long) 1L << 32) / (long) (2 * n * n);
        testDgemv(niter, n, 1);
        testDgemv(niter, n, 2);
    }
}
The result on Java8 (1.8.0_60) And Core i5 4570 (3.2GHz) is:
$ java -server Main
mode=1,N=500,1.239000 sec,3.466102 GFLPS
mode=2,N=500,1.100000 sec,3.904091 GFLPS
And the result of the same calculation on Java7 (1.7.0_80) is:
mode=1,N=500,1.291000 sec,3.326491 GFLPS
mode=2,N=500,1.491000 sec,2.880282 GFLPS
It seems as if HotSpot optimizes functors more eagerly than on static methods, regardless of the additional complexity.
Could anyone explain why Dgemv2 runs faster?
Edit:
More precise benchmark stats from openjdk/jmh. (Thank you Kayaman for your comment)
N=500 / 1sec x 20 warming-ups / 1sec x 20 iterations (10 sets)
Java 8 (1.8.0_60)
Benchmark               Mode  Cnt     Score   Error  Units
MyBenchmark.runDgemv1  thrpt  200  6965.459 ? 2.186  ops/s
MyBenchmark.runDgemv2  thrpt  200  7329.138 ? 1.598  ops/s
Java 7 (1.7.0_80)
Benchmark               Mode  Cnt     Score   Error  Units
MyBenchmark.runDgemv1  thrpt  200  7344.570 ? 1.994  ops/s
MyBenchmark.runDgemv2  thrpt  200  7358.988 ? 2.189  ops/s
From these stats, it seems Java 8 HotSpot does not optimize static methods. But one more thing I noticed is that the performance is 10% better on some warming-up section. Picking up the extreme case:
N=500 / 1sec x 8 warming-ups / 1sec x 8 iterations (10 sets)
Java 8 (1.8.0_60)
Benchmark               Mode  Cnt     Score    Error  Units
MyBenchmark.runDgemv1  thrpt   80  6952.315 ? 11.483  ops/s
MyBenchmark.runDgemv2  thrpt   80  7719.843 ? 66.773  ops/s
Iterations between 9 sec and 15 sec of Dgemv2 consistently outperforms the long-run average by about 5%. It seems like HotSpot does not always produce faster codes as the optimization procedure goes on.
My current guess is that the Functor object in Dgemv2 actually disturbs the HotSpot optimization procedure, resulting the faster execution code than the 'fully optimized code'.
Still I am not at all clear on why this happens. Any answers and comments are welcome.
 
    