Cython is slower than Numpy (example from Python Cookbook)

Question

The snippet comes from the book Python Cookbook. There are three files.

sample.pyx

cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)

cpdef clip(double[:] a, double min, double max, double[:] out):

    if min > max:
        raise ValueError('min must be <= max')

    if a.shape[0] != out.shape[0]:
        raise ValueError('input and output arrays must be the same size!')

    for i in range(a.shape[0]):
        if a[i] < min:
            out[i] = min
        elif a[i] > max:
            out[i] = max
        else:
            out[i] = a[i]

setup.py

from distutils.core import setup

from Cython.Build import cythonize

setup(ext_modules=cythonize("sample.pyx"))

and main.py as test file

b = np.random.uniform(-10, 10, size=1000000)

a = np.zeros_like(b)


since = time.time()
np.clip(b, -5, 5, a)
print(time.time() - since)

since = time.time()
sample.clip(b, -5, 5, a)
print(time.time() - since)

Surprisingly, the Numpy runs 2x faster than Cython code, while the book claims the opposite. The performance on my machine is:

0.0035216808319091797
0.00608062744140625

Why is that?

Thank you in advance.

It would be nice if you included the performance data for the record. — hilberts_drinking_problem, Mar 09 '19 at 01:36
If possible, would you please give me more details about the profiler?@KlausD. — Tengerye, Mar 09 '19 at 04:05
I personally use [kernprof](https://github.com/rkern/line_profiler) for line by line profiling. — Dillon Davis, Mar 09 '19 at 22:51

score 3 · Accepted Answer · answered Mar 09 '19 at 08:34

I can confirm your results (numpy 1.15 vs Cython 0.28.3 + gcc-5.4):

>>>  %timeit sample.clip(b, -5, 5, a)
20.5 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>>  %timeit np.clip(b, -5, 5, a)
11.7 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It is hard to tell, why the author of cookbook got other timings: other numpy version or maybe other compiler. In the case of np.clip there is not much room to improvement other than using SIMD-instructions.

However, your Cython-code isn't optimal. You can improve it by declaring, that the memory views are contiguous i.e double[::1] rather than double[:]. This results in a cythonized C-code which is easier to optimizer for the compiler (for more info see this SO-question):

cpdef clip2(double[::1] a, double min, double max, double[::1] out):
   ....

>>>  %timeit sample.clip2(b, -5, 5, a)
11.1 ms ± 69.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Which is about the as fast as the numpy version.

However, for getting best results I would recommend Numba: it is much easier to get better performance with Numba, than with Cython (see for example this SO-question):

import numba as  nb  
@nb.njit
def nb_clip(a, min, max, out):

    if min > max:
        raise ValueError('min must be <= max')

    if a.shape[0] != out.shape[0]:
        raise ValueError('input and output arrays must be the same size!')

    for i in range(a.shape[0]):
        if a[i] < min:
            out[i] = min
        elif a[i] > max:
            out[i] = max
        else:
            out[i] = a[i]

 ...
 %timeit nb_clip(b, -5, 5, a)
 4.7 ms ± 333 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The performance difference between Numba and the original Cython-version is here due to clang (which is what Numba uses for compilation) being able to generate better assembler than gcc in this particular case. When I switch to clang-5.0 in Cython, I can match (and even slightly beat) Numba.

Cython is slower than Numpy (example from Python Cookbook)

1 Answers1