What parts of a Numpy-heavy function can I accelerate with Cython

Question

^{Introductory notes: trying to accelerate Python+Numpy code with Cython is a common problem and this question is an attempt to create a canonical question about what types of operation you can accelerate effectively. Although I try to illustrate with a specific example, it is meant as an illustration - please don't focus too much on the fairly meaningless example.}

^{Also, I've contributed enough to Cython that I should declare an affiliation (given that I'm bringing the topic up)}

Actual question

Suppose I have a function that tries to do numeric calculations on Numpy arrays. It uses fairly typical operations:

a loop over array elements that can't easily be vectorized
calls to Numpy/Scipy functions (in this case np.sin).
Mathematical operations on the whole array (a-b)

import numpy as np

def some_func(a, b):
    """
    a and b are 1D arrays

    This is intended to be illustrative! Please don't focus on what it
    actually does!
    """
    transformed_a = np.zeros_like(a)
    last = 0
    for n in range(1, a.shape[0]):
        an = a[n]
        if an > 0:
            delta = an - a[n-1]
            transformed_a[n] = delta*last
        else:
            last = np.sin(an)
    return transformed_a * b

a = np.random.randn(100)
b = np.linspace(0, 100, a.shape[0])

print(some_func(a, b))

Can I speed this up with Cython, and which parts would I expect to be able to speed up?

Great - it is clear for quite some time that we need such a Q&A! — ead, Jan 23 '22 at 12:28
@ead Thanks - it's been clear for me too! Feel free to add an answer (or edit mine) if there's anything important you think I've missed! — DavidW, Jan 23 '22 at 12:30

DavidW · Answer 1 · 2023-01-20T20:37:51.623

Indexing individual array elements

This the main type of code where Cython can really help you. Indexing an individual element (e.g. an = a[n]) can be a fairly slow operation in Python. Partly because Python is not a hugely quick language and so running Python code lots of times within a loop can be slow, and partly because the array is stored as a tightly-packed array of C floats, but the indexing operation needs to return a Python object. Therefore, indexing a Numpy array requires new Python objects to be allocated.

In Cython you can declare the arrays as typed memoryviews, or as np.ndarray. (Typed memoryviews are the more modern approach and you should usually prefer them). Doing so allows you to directly access the tightly-packed C array directly and retrieve the C value, without creating a Python object.

The directives cython.boundscheck and cython.wraparound can be very worthwhile for further speed-ups to the index (but remember they do remove useful features, so think before using them).

vs vectorization

A lot of the time a loop over a Numpy array can be written as a vectorized operation - something that acts on the whole array at once. It is usually a good idea to write Python+Numpy code like this. If you have multiple chained vectorized operations, it is sometimes worthwhile to write it explicitly as a Cython loop to avoid allocating intermediate arrays.

Alternatively the little-known Pythran backend for Cython convert a set of vectorized Numpy operations to optimized C++ code.

Indexing array slices

Isn't a problem in Cython, but typically isn't something that will get you significant speed-ups alone.

Calling Numpy functions

e.g. last = np.sin(an)

These require a Python call, so Cython typically cannot accelerate these - it has no visibility into the contents of the Numpy function.

However, here the operation is on a single value, and not on a Numpy array. In this case we can use sin from the C standard library, which will be significantly quicker than a Python function call. You'd do from libc.math cimport sin and call sin rather than np.sin.

Numba is an alternative Python accelerator that has better visibility into Numpy functions can can often optimize without changes.

Array allocations

e.g. transformed_a = np.zeros_like(a).

This is just a Numpy function call and thus Cython has no ability to speed it up. If it's just an intermediate value and not returned to Python then you might consider a fixed-size C array on the stack

cdef double transformed_a[10]  # note - you must know the size at compile-time

or by allocating them with the C malloc function (remember to free it). Or by using Cython's cython.view.array (which is still a Python object, but can be a little quicker).

Whole-array arithmetic

e.g. transformed_a * b, which multiplies transformed_a and b element by element.

Cython doesn't help you here - it's just a disguised function call (although Pythran+Cython may have some benefits). For large arrays this kind of operation is pretty efficient in Numpy so don't overthink it.

Note that whole-array operations aren't defined for Cython typed memoryviews, so you need to do np.asarray(memview) to get them back to Numpy arrays. This typically doesn't need a copy and is quick.

For some operations like this, you can use BLAS and LAPACK functions (which are fast C implementations of array and matrix operations). Scipy provies a Cython interface for them (https://docs.scipy.org/doc/scipy/reference/linalg.cython_blas.html) to use. They're a little more complex to use that the natural Python code.

The illustrative example

Just for completeness, I'd write it something like:

import numpy as np
from libc.math cimport sin
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def some_func(double[::1] a, b):
    cdef double[::1] transformed_a = np.zeros_like(a)
    cdef double last = 0
    cdef double an, delta
    cdef Py_ssize_t n
    for n in range(1, a.shape[0]):
        an = a[n]
        if an > 0:
            delta = an - a[n-1]
            transformed_a[n] = delta*last
        else:
            last = sin(an)
    return np.asarray(transformed_a) * b

which is a little over 10x faster.

cython -a is helpful here - it produces an annotated HTML file that shows which lines contain a lot of interaction with Python.

Great self-anwser! Note that Cython can also help programmers to write parallel Numpy codes. This is an important point as modern desktop processors have about 4-8 cores and computing server ones have typically 16-32 cores. Not to mention the number of core is expected to grow in the future. — Jérôme Richard, Jan 24 '22 at 23:43
Thanks. I left out parallel code for brevity, but I agree that it can sometimes be very useful — DavidW, Jan 25 '22 at 08:12
In addition to prange, I'd suggest to mention multiple loops as a smoking gun for "worth writing as a numpy vectorized expression; failing that, use cython ". This is likely a subset of the indexing single elements section. — ev-br, Jan 30 '22 at 11:51
This is great! One question I have is regarding the difference between using `malloc` and `PyMem_Malloc`. Also if there are general guidelines for using these with, e.g. `try` and `finally` (supposing all dynamic arrays created inside the cython function are deleted before returning). Another thing is that you used `int` for `n`, but why/when is it better to use `long` or, better yet, `Py_ssize_t` instead. This last issue was discussed [here](https://stackoverflow.com/a/20987501/13132315). — Breno, Feb 08 '22 at 22:39
@Breno `Py_ssize_t` would definitely be better than int. I'll update that. I don't have a strong view on `malloc` Vs `PyMem_Malloc` - [as far as I can see they're usually the same](https://stackoverflow.com/questions/4823054/is-there-any-reason-to-use-malloc-over-pymem-malloc) but just make sure that the `free` matches. `try/finally` is a very good idea for making sure memory is freed if the lifetime is in a single function. Otherwise you usually want to have a Python object free the memory in a destructor. — DavidW, Feb 08 '22 at 22:52