Running Numpy version 1.19.2, I get better performance cumulating the mean of every individual axis of an array than by calculating the mean over an already flattened array.
shape = (10000,32,32,3)
mat = np.random.random(shape)
# Call this Method A.
%%timeit
mat_means = mat.mean(axis=0).mean(axis=0).mean(axis=0)
14.6 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
mat_reshaped = mat.reshape(-1,3)
# Call this Method B
%%timeit
mat_means = mat_reshaped.mean(axis=0)
135 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is odd, since doing the mean multiple times has the same bad access pattern (perhaps even worse) than the one on the reshaped array. We also do more operations this way. As a sanity check, I converted the array to FORTRAN order:
mat_reshaped_fortran = mat.reshape(-1,3, order='F')
%%timeit
mat_means = mat_reshaped_fortran.mean(axis=0)
12.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This yields the performance improvement I expected.
For Method A, prun gives:
36 function calls in 0.019 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.018    0.006    0.018    0.006 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.019    0.019 {built-in method builtins.exec}
        3    0.000    0.000    0.019    0.006 _methods.py:143(_mean)
        3    0.000    0.000    0.000    0.000 _methods.py:59(_count_reduce_items)
        1    0.000    0.000    0.019    0.019 <string>:1(<module>)
        3    0.000    0.000    0.019    0.006 {method 'mean' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.000    0.000 _asarray.py:86(asanyarray)
        3    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        3    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
While for Method B:
    14 function calls in 0.166 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.166    0.166    0.166    0.166 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.166    0.166 {built-in method builtins.exec}
        1    0.000    0.000    0.166    0.166 _methods.py:143(_mean)
        1    0.000    0.000    0.000    0.000 _methods.py:59(_count_reduce_items)
        1    0.000    0.000    0.166    0.166 <string>:1(<module>)
        1    0.000    0.000    0.166    0.166 {method 'mean' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 _asarray.py:86(asanyarray)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
Note: np.setbufsize(1e7) doesn't seem to have any effect.
What is the reason for this performance difference?
 
    