Poor performance of C++ function in Cython

Question

I have this C++ function, which I can call from Python with the code below. The performance is only half compared to running pure C++. Is there a way to get their performance at the same level? I compile both codes with -Ofast -march=native flags. I do not understand where I can lose 50%, because most of the time should be spent in the C++ kernel. Is Cython making a memory copy that I can avoid?

namespace diff
{
    void diff_cpp(double* __restrict__ at, const double* __restrict__ a, const double visc,
                  const double dxidxi, const double dyidyi, const double dzidzi,
                  const int itot, const int jtot, const int ktot)
    {
        const int ii = 1;
        const int jj = itot;
        const int kk = itot*jtot;

        for (int k=1; k<ktot-1; k++)
            for (int j=1; j<jtot-1; j++)
                for (int i=1; i<itot-1; i++)
                {
                    const int ijk = i + j*jj + k*kk;
                    at[ijk] += visc * (
                            + ( (a[ijk+ii] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                            + ( (a[ijk+jj] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                            + ( (a[ijk+kk] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                            );
                }
    }
}

I have this .pyx file

# import both numpy and the Cython declarations for numpy
import cython
import numpy as np
cimport numpy as np

# declare the interface to the C code
cdef extern from "diff_cpp.cpp" namespace "diff":
    void diff_cpp(double* at, double* a, double visc, double dxidxi, double dyidyi, double dzidzi, int itot, int jtot, int ktot)

@cython.boundscheck(False)
@cython.wraparound(False)
def diff(np.ndarray[double, ndim=3, mode="c"] at not None,
         np.ndarray[double, ndim=3, mode="c"] a not None,
         double visc, double dxidxi, double dyidyi, double dzidzi):
    cdef int ktot, jtot, itot
    ktot, jtot, itot = at.shape[0], at.shape[1], at.shape[2]
    diff_cpp(&at[0,0,0], &a[0,0,0], visc, dxidxi, dyidyi, dzidzi, itot, jtot, ktot)
    return None

I call this function in Python

import numpy as np
import diff
import time

nloop = 20;
itot = 256;
jtot = 256;
ktot = 256;
ncells = itot*jtot*ktot;

at = np.zeros((ktot, jtot, itot))

index = np.arange(ncells)
a = (index/(index+1))**2
a.shape = (ktot, jtot, itot)

# Check results
diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
print("at={0}".format(at.flatten()[itot*jtot+itot+itot//2]))

# Time the loop
start = time.perf_counter()
for i in range(nloop):
    diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
end = time.perf_counter()

print("Time/iter: {0} s ({1} iters)".format((end-start)/nloop, nloop))

This is the setup.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy

setup(
    cmdclass = {'build_ext': build_ext},
    ext_modules = [Extension("diff",
                             sources=["diff.pyx"],
                             language="c++",
                             extra_compile_args=["-Ofast -march=native"],
                             include_dirs=[numpy.get_include()])],
)

And here the C++ reference file that reaches twice the performance:

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <stdlib.h>
#include <cstdio>
#include <ctime>
#include "math.h"

void init(double* const __restrict__ a, double* const __restrict__ at, const int ncells)
{
    for (int i=0; i<ncells; ++i)
    {
        a[i]  = pow(i,2)/pow(i+1,2);
        at[i] = 0.;
    }
}

void diff(double* const __restrict__ at, const double* const __restrict__ a, const double visc, 
          const double dxidxi, const double dyidyi, const double dzidzi, 
          const int itot, const int jtot, const int ktot)
{
    const int ii = 1;
    const int jj = itot;
    const int kk = itot*jtot;

    for (int k=1; k<ktot-1; k++)
        for (int j=1; j<jtot-1; j++)
            for (int i=1; i<itot-1; i++)
            {
                const int ijk = i + j*jj + k*kk;
                at[ijk] += visc * (
                        + ( (a[ijk+ii] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                        + ( (a[ijk+jj] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                        + ( (a[ijk+kk] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                        );
            }
}

int main()
{
    const int nloop = 20;
    const int itot = 256;
    const int jtot = 256;
    const int ktot = 256;
    const int ncells = itot*jtot*ktot;

    double *a  = new double[ncells];
    double *at = new double[ncells];

    init(a, at, ncells);

    // Check results
    diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 
    printf("at=%.20f\n",at[itot*jtot+itot+itot/2]);

    // Time performance 
    std::clock_t start = std::clock(); 

    for (int i=0; i<nloop; ++i)
        diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 

    double duration = (std::clock() - start ) / (double)CLOCKS_PER_SEC;

    printf("time/iter = %f s (%i iters)\n",duration/(double)nloop, nloop);

    return 0;
}

How does your c++ testing code look like? And maybe setup-file? — ead, Sep 29 '17 at 21:02
It may be a typo, but `nloop` in your C++ source is half that in your Python source, 10 vs 20. That could certainly explain a factor of two performance difference. — bnaecker, Sep 29 '17 at 21:17
@bnaecker. You are correct, but I divide benchmark time by `nloop`, so that does not explain the difference. I changed it nonetheless, thanks! — Chiel, Sep 29 '17 at 21:20
@Chiel Right you are. What about the `#pragma ivdep`? Could be vectorization by the compiler in the pure C++ case. Have you tried removing that and comparing? — bnaecker, Sep 29 '17 at 21:30
@bnaecker. That is a flag for the intel compiler which I am not using in these tests. Removed it now, to avoid confusion... — Chiel, Sep 29 '17 at 21:31

ead · Accepted Answer · 2017-10-01T09:03:11.273

The problem here is not what is happening during the run, but which optimization is happening during the compilation.

Which optimization is done depends on the compiler (or even version) and there is no guarantee that every optimization, which can be done will be done.

Actually there are two different reasons why cython is slower, depending on whether you use g++ or clang++:

g++ is unable to optimize due to flag -fwrapv in the cython build
clang++ is unable to optimize in the first place (read on to see what happens).

First issue (g++): Cython compiles with different flags compared to the flags of your pure c++-program and as result some optimizations can't be done.

If you look at the log of the setup, you will see:

 x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native

As you told, -Ofast will win against -O2because it comes last. But the problem is -fwrapv, which seems to prevent some optimization, as signed integer overflow cannot be considered UB and used for optimization any longer.

So you have following options:

add -fno-wrapv to extra_compile_flags, the disadvantage is, that all files are now compiled with changed flags, what might be unwanted.
build a library from cpp with only flags you like and link it to your cython module. This solution has some overhead, but has the advantage of being robust: as you pointed out for different compilers different cython-flags could be the problem - so the first solution might be too brittle.
not sure you can disable default flags, but maybe there is some information in docs.

Second issue (clang++) inlining in the test cpp-program.

When I compile your cpp-program with my pretty old 5.4-version g++:

 g++ test.cpp -o test -Ofast -march=native -fwrapv

it becomes almost 3-times slower compared to the compilation without -fwrapv. This is however a weakness of the optimizer: When inlining, it should see, that no signed-integer overflow is possible (all dimensions are about 256), so the flag -fwrapv shouldn't have any impact.

My old clang++-version (3.8) seems to do a better job here: with the flags above I cannot see any degradation of the performance. I need to disable inlining via -fno-inline to become a slower code but it is slower even without -fwrapv i.e.:

 clang++ test.cpp -o test -Ofast -march=native -fno-inline

So there is a systematical bias in favor of your c++-program: the optimizer can optimize the code for the known values after the inlining - something the cython can not do.

So we can see: clang++ was not able to optimize function diff with arbitrary sizes but was able to optimize it for size=256. Cython however, can only use the not optimized version of diff. That is the reason, why -fno-wrapv has no positive impact.

My take-away from it: disallow inlining of the function of interest (e.g. compile it in its own object file) in the cpp-tester to ensure a level ground with cython, otherwise one sees performance of a program which was specially optimized for this one input.

NB: A funny thing is, that if all ints are replaced by unsigned ints, then naturally -fwrapv doesn't play any role, but the version with unsigned int is as slow as int-version with -fwrapv, which is only logical, as there is no undefined behavior to be exploited.

The last flag wins, as far as I know, otherwise it is impossible to override if the flags are already set. — Chiel, Sep 29 '17 at 21:47
@Chiel OK, you are right, but there is something in the other flags that interfere with the optimization - the assemblies are pretty different for files compiled via cython and directly with g++ — ead, Sep 29 '17 at 21:50
This is not true. If I compile the fast c++ code with all the flags that come the Cython compilation, then I still reproduce the performance difference. Can you reproduce the difference? — Chiel, Sep 29 '17 at 22:05
@Chiel With `g++ test.cpp -o test -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -ftrapv -Ofast -march=native`- 0.20 seconds — ead, Sep 29 '17 at 22:18
Shouldn't `-ftrapv` anyhow not override `-fwrapv`? I don't find any differences in the speed of my Cython code, unfortunately. — Chiel, Sep 29 '17 at 22:24
@Chiel `-ftrapv` overrides `-ftrapw` but now `-ftrapv` prevents the optimization. My line in setup is `extra_compile_args=['-std=c++11', '-Ofast', '-march=native', '-fno-wrapv']` and I get exact the same speed as with cpp — ead, Sep 29 '17 at 22:26
Interesting. I do reach good results with gcc, but not with clang, where the Cython version is still slower — Chiel, Sep 29 '17 at 22:31
@Chiel I see, I copied a wrong line with `-ftrapv` sorry for confusion - '-ftrapv` is not from cython, but I tried to use it first... — ead, Sep 29 '17 at 22:31
@Chiel if it is so brittle, I would build a (static) library with my cpp and with flags I like and link it to the cython module. You don't really need inlining here... — ead, Sep 29 '17 at 22:36
@Chiel I think I found explanation, why clang++ is slower - it was never fast in the first place, please take a look at my updated answer for more info — ead, Oct 01 '17 at 09:04
That is a great piece of research. I reproduce it. If I retrieve the dimensions as a command line parameter I have identical speeds. Awesome. Thanks a lot! — Chiel, Oct 01 '17 at 10:56

Poor performance of C++ function in Cython

1 Answers1

Linked