Would you please give me reference why there is significant difference in the execution time between the following 2 factorial implementations using the Java Stream API:
- Serial implementation
- Parallel implementation (using Stream.parallel()) executed in a custom fork join pool with parallelism set to 1
My expectations were to have near execution times, however the parallel version has a speedup by a factor of 2. I did not run any specialized benchmarks however the execution time should not differ so much even in a cold start jvm. Bellow I attach the source code of the two implementations:
public class FastFactorialSupplier implements FactorialSupplier {
  private final ExecutorService executorService;
  public FastFactorialSupplier(ExecutorService executorService) {
      this.executorService = executorService;
  }
  @Override
  public BigInteger get(long k) {
      try {
          return executorService
                  .submit(
                          () -> LongStream.range(2, k + 1)
                                  .parallel()
                                  .mapToObj(BigInteger::valueOf)
                                  .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current))
                  )
                  .get();
      } catch (InterruptedException | ExecutionException e) {
          e.printStackTrace();
      }
      return BigInteger.ZERO;
  }
}
public class MathUtils {
  public static BigInteger factorial(long k) {
      return LongStream.range(2, k + 1)
              .mapToObj(BigInteger::valueOf)
              .reduce(BigInteger.ONE, (current, factSoFar) -> factSoFar.multiply(current));
  }
}
Here are the test cases with attached sample execution time as a comments based on what the intellij junit runner showed.
    @Test
    public void testWithoutParallel() {
        //2s 403
        runTest(new DummyFactorialSupplier()); // uses MathUtils.factorial
    }
    @Test
    public void testParallelismWorkStealing1() {
        //1s 43
        runTest(new FastFactorialSupplier(Executors.newWorkStealingPool(1)));
    }
    @Test
    public void testParallelismForkJoin1() {
        // 711ms
        runTest(new FastFactorialSupplier(new ForkJoinPool(1)));
    }
    @Test
    public void testExecutorForkJoin() {
        //85ms
        runTest(new FastFactorialSupplier(new ForkJoinPool()));
    }
    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
//        assertEquals(456574, result.toString().length());
    }
The tests were run using java 11 since there was a issue in java 8 with custom fork join pools - https://bugs.openjdk.java.net/browse/JDK-8190974
Can there be an optimisation related with the pseudo parallel processing and how the execution is scheduled whereas there is no such given the execution is purely sequential?
Edit:
I also run microbenchmark using jmh
Parallel:
public class FastFactorialSupplierP1Test {
    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new FastFactorialSupplier(new ForkJoinPool(1)));
    }
    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }
    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}
Serial:
public class SerialFactorialSupplierTest {
    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new DummyFactorialSupplier());
    }
    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }
    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}
public class IterativeFactorialTest {
    @Benchmark
    @BenchmarkMode({Mode.AverageTime, Mode.SampleTime, Mode.SingleShotTime, Mode.Throughput, Mode.All})
    @Fork(value = 1, warmups = 1)
    public void measure() {
        runTest(new IterativeFact());
    }
    private void runTest(FactorialSupplier factorialSupplier) {
        BigInteger result = factorialSupplier.get(100000);
        assertNotNull(result);
    }
    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
    class IterativeFact implements FactorialSupplier {
        @Override
        public BigInteger get(long k) {
            BigInteger result = BigInteger.ONE;
            while (k-- != 0) {
                result = result.multiply(BigInteger.valueOf(k));
            }
            return result;
        }
    }
}
Results:
FastFactorialSupplierP1Test.measure                    avgt    5  0.437 ± 0.006   s/op
IterativeFactorialTest.measure                         avgt    5  2.643 ± 0.383   s/op
SerialFactorialSupplierTest.measure                    avgt    5  2.226 ± 0.044   s/op
 
    