#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True, language_level=3
cpdef int query(double[::1] q, double[:,::1] data) nogil:
    cdef:
        int n = data.shape[0]
        int dim = data.shape[1]
        int best_i = -1
        double best_ip = -1
        double ip
    for i in range(n):
        ip = 0
        for j in range(dim):
            ip += q[j] * data[i, j]
        if ip > best_ip:
            best_i = i
            best_ip = ip
    return best_i
After compiling, I time the code from Python:
import numpy as np
import ip
n, dim = 10**6, 10**2
X = np.random.randn(n, dim)
q = np.random.randn(dim)
%timeit ip.query(q, X)
This takes roughly 100ms. Meanwhile the equivalent numpy code:
%timeit np.argmax(q @ X.T)
Takes just around 50ms.
This is odd, since the NumPy code seemingly has to allocate the big array q @ X.T before taking the argmax. I thus wonder if there are some optimizations I am lacking?
I have added extra_compile_args=["-O3", '-march=native'], to my setup.py and I also tried changing the function definition to
cpdef int query(np.ndarray[double] q, np.ndarray[double, ndim=2] data):
but it had virtually no difference in performance.