I want to compare the two Java8 stream terminal operations reduce() and collect() in terms of their parallel performance.
Let's have a look at the following Java8 parallel stream example:
import java.math.BigInteger;
import java.util.function.BiConsumer;
import java.util.function.Function;
import java.util.function.Supplier;
import java.util.stream.Stream;
import static java.math.BigInteger.ONE;
public class StartMe {
    static Function<Long, BigInteger> fac;
    static {
        fac = x -> x==0? ONE : BigInteger.valueOf(x).multiply(fac.apply(x - 1));
    }
    static long N = 2000;
    static Supplier<BigInteger[]> one() {
        BigInteger[] result = new BigInteger[1];
        result[0] = ONE;
        return () -> result;
    }
    static BiConsumer<BigInteger[], ? super BigInteger> accumulator() {
        return (BigInteger[] ba, BigInteger b) -> {
            synchronized (fac) {
                ba[0] = ba[0].multiply(b);
            }
        };
    }
    static BiConsumer<BigInteger[], BigInteger[]> combiner() {
        return (BigInteger[] b1, BigInteger[] b2) -> {};
    }
    public static void main(String[] args) throws Exception {
        long t0 = System.currentTimeMillis();
        BigInteger result1 = Stream.iterate(ONE, x -> x.add(ONE)).parallel().limit(N).reduce(ONE, BigInteger::multiply);
        long t1 = System.currentTimeMillis();
        BigInteger[] result2 = Stream.iterate(ONE, x -> x.add(ONE)).parallel().limit(N).collect(one(), accumulator(), combiner());
        long t2 = System.currentTimeMillis();
        BigInteger result3 = fac.apply(N);
        long t3 = System.currentTimeMillis();
        System.out.println("reduce():  deltaT = " + (t1-t0) + "ms, result 1 = " + result1);
        System.out.println("collect(): deltaT = " + (t2-t1) + "ms, result 2 = " + result2[0]);
        System.out.println("recursive: deltaT = " + (t3-t2) + "ms, result 3 = " + result3);
    }
}
It computes n! using some - admittedly weird ;-) - algorithms.
The performance results are however surprising:
 reduce():  deltaT = 44ms, result 1 = 3316275...
 collect(): deltaT = 22ms, result 2 = 3316275...
 recursive: deltaT = 11ms, result 3 = 3316275...
Some remarks:
- I had to synchronize the accumulator()because it accesses the same array in parallel.
- I expected reduce()andcollect()would yield the same performance butreduce()is ~2 times slower thancollect(), even ifcollect()must be synchronized!
- the fastest algorithm is the sequential and recursive one (which might show the huge overhead of the parallel stream management)
I didn't expect reduce()'s performance to be worse than collect()'s one. Why is this so? 
 
     
    