Why is Cython slower than Python+numpy here?

Question

I want to implement some fast convex analysis operations - proximal operators and the like. I'm new to Cython and thought that this would be the right tool for the job. I have implementations both in pure Python and in Cython (mwe_py.py and mwe_c.pyx below). However, when I compare them, the Python + Numpy version is significantly faster than the Cython version. Why is this? I have tried using memoryviews, which are supposed to allow for faster indexing/operations; however, the performance difference is very pronounced! Any advice on how to fix mwe_c.pyx below to approach "optimal" Cython code would be greatly appreciated.

import pyximport; pyximport.install(language_level=3)

import mwe_c
import mwe_py
import numpy as np
from time import time

n = 100000
nreps = 10000
x = np.random.randn(n)
z = np.random.randn(n)
tau = 1.0

t0 = time()
for _ in range(nreps):
    out = mwe_c.prox_translation(mwe_c.prox_two_norm, x, z, tau)
t1 = time()
print(t1 - t0)


t0 = time()
for _ in range(nreps):
    out = mwe_py.prox_translation(mwe_py.prox_two_norm, x, z, tau)
t1 = time()
print(t1 - t0)

which gives the outputs, respectively:

10.76103401184082  # (seconds)
5.988733291625977  # (seconds)

Below is mwe_py.py:

import numpy.linalg as la

def proj_two_norm(x):
    """projection onto l2 unit ball"""
    X = la.norm(x)
    if X <= 1:
        return x
    return x / X

def prox_two_norm(x, tau):
    """proximal mapping of l2 norm with parameter tau"""
    return x - proj_two_norm(x / tau)

def prox_translation(prox_func, x, z, tau=None):
  """Compute prox(f(. - z))(x) where prox_func(x, tau) is prox(tau * f)(x)."""
  if tau is None:
      tau = 1.0
  return z + prox_func(x - z, tau)

And here, at last, is mwe_c.pyx:

import numpy as np
cimport numpy as np


cdef double [::1] aasubtract(double [::1] x, double [::1] z):
    cdef unsigned int i, m = len(x), n = len(z);
    assert m == n, f"vectors must have the same length"
    cdef double [::1] out = np.copy(x);
    for i in range(n):
        out[i] -= z[i]
    return out


cdef double [::1] vsdivide(double [::1] x, double tau):
    """Divide an array by a scalar element-wise."""
    cdef:
        unsigned int i, n = len(x);
        double [::1] out = np.copy(x);
    for i in range(n):
        out[i] /= tau
    return out


cdef double two_norm(double [::1] x):
    cdef:
        double out = 0.0;
        unsigned int i, n=len(x);
    for i in range(n):
        out = out + x[i]**2
    out = out **.5
    return out


cdef double [::1] proj_two_norm(double [::1] x):
    """project x onto the unit two ball."""
    cdef double x_norm = two_norm(x);
    cdef unsigned int i, n = len(x);
    cdef double [::1] p = np.copy(x);
    if x_norm <= 1:
        return p
    for i in range(n):
        p[i] = p[i] / x_norm
    return p


cpdef double [::1] prox_two_norm(double [::1] x, double tau):
    """double [::1] prox_two_norm(double [::1] x, double tau)"""
    cdef unsigned int i, n = len(x);
    cdef double [::1] out = x.copy(), Px = x.copy();
    Px = proj_two_norm(vsdivide(Px, tau));
    for i in range(n):
        out[i] = out[i] - Px[i]
    return out


cpdef prox_translation(
    prox_func,
    double [::1] x,
    double [::1] z,
    double tau=1.0
):
    cdef:
        unsigned int i, n = len(x);
        double [::1] out = prox_func(aasubtract(x, z), tau);
    for i in range(n):
        out[i] += z[i];
    return out

It might be worth using [Cython compiler directives](https://cython.readthedocs.io/en/latest/src/userguide/source_files_and_compilation.html#compiler-directives) like `wraparound`, `boundscheck` and `cdivision`. This should increase the performance of your Cython code. However, I don't expect it to be faster than your numpy code for reasons Jérôme gave in his nice answer. — joni, Aug 15 '22 at 10:16

Jérôme Richard · Accepted Answer · 2022-08-14T19:24:13.070

The main issue is that you compare an optimized Numpy code with a less-optimized Cython code. Indeed, Numpy makes use of SIMD instructions (like SSE and AVX/AVX2 on x86-64 processors) that are able to compute many items in a row. Cython use the default -O2 optimization level by default which do not enable any auto-vectorization strategy resulting in a slower scalar code (unless you use a very recent version of GCC). You can use -O3 to tell to most compilers (eg. old GCC and Clang) to enable auto-vectorization. Note that this is not sufficient to produce a very fast code. Indeed, compilers only use legacy SIMD instruction on x86-64 processors for sake of compatibility. -mavx and -mavx2 enable the AVX/AVX-2 instruction set so to produce a faster code assuming it is supported on your machine (otherwise it will simply crash). -mfma might also help. -march=native can also be used so to select the best instruction set available on the target platform. Note that Numpy does this check (partially) at runtime (thanks to GCC-specific C features).

The second main issue is that out = out + x[i]**2 results in a big loop-carried dependency chain that compilers cannot optimize without breaking the IEEE-754 standard. Indeed, there is a very long chain of additions to perform and the processor cannot execute this faster than executing each addition instruction serially with the current code. The thing is adding two floating-point number has a quite big latency (typically 3 to 4 cycles on quite-modern x86-64 processors). This means the processor cannot pipeline the instruction. In fact, modern processor can often execute two addition in parallel (per core) but the current loop prevent that. In the end, this loop is completely latency-bound. You can fix this problem by unrolling the loop manually.

Using -ffast-math can help compilers to do such optimizations but at the expense of breaking the IEEE-754 standard. If you use this option, then you need not to use special values like NaN numbers nor some operations. For more information, please read What does gcc's ffast-math actually do? .

Moreover, note that array copies are expensive and I am not sure all copies are needed. You can create a new empty array and fill it rather than operating on a copy of an array. This will be faster, especially for big arrays.

Finally, divisions are slow. Please consider a multiplication by the inverse. Compilers cannot do this optimization because of the IEEE-754 standard but you can easily do it. That being said, you need to make sure this is fine in your case as it can slightly change the result. Using -ffast-math should also fix this problem automatically.

Note that Numpy many developers know how compilers and processors works so they already do manual optimizations so to generate a fast code (like I did several times). You can hardly beat Numpy when dealing with large arrays except if you merge loops or you use multiple-threads. Indeed, the RAM is quite slow nowadays compared to computing units and Numpy create many temporary arrays. Cython can be used to avoid creating most of these temporary arrays.

Thank you for the detailed and clear answer. Since it seems like numpy must usually be the right choice for such applications, I'm left with the follow-up question: "When should someone [who is a non-expert with C compilers] use Cython?" — bashfuloctopus, Aug 15 '22 at 15:36
This is a complex question. A part of the answer is provided at the end of the post. Also note that for algorithm operating on scalar and small arrays, Cython should be significantly faster too. In fact, the dependency chain issue is a pretty unusual case (it typically arise for floating-point reduction loops). Using fastmath+O3 is sufficient to make Cython faster in most cases (assuming you can use them and the algorithm is efficient). In fact, with new version of GCC, O3 is not even really needed to make program fast in that case. — Jérôme Richard, Aug 15 '22 at 16:08
For the IEEE-754 issue, this is a tricky problem. There is fundamental treade-off: performance VS correctness+reproducibility+standard. Correctness matters more than performance (so `-ffast-math` it is disabled by default). The thing is people have pay attention to this since it impact the numerical stability of many high-performance algorithms. Providing high-level functions (like Numpy do) is a way to hide the problem, but it is not really fixed. — Jérôme Richard, Aug 15 '22 at 16:18
Note that it can also matter for some Numpy users as basic operation like `np.sum` are not perfect in term of accuracy which can be a source of numerical instability (eg. Numpy use a pair-wise summation algorithm with a `O(log n)` error while there are algorithm like Kahan summation providing a `O(1)` error). Thus, I would ague that if someone only has a basic knowledge in computer-science (CS), Numpy is good start. Cython/Numba is good to speed up a specific part of a slow Numpy code and you have an advanced knowledge in CS. Experts can optimize Cython code further if needed. — Jérôme Richard, Aug 15 '22 at 16:28

Why is Cython slower than Python+numpy here?

1 Answers1