I wrote a program for calculating certain polynomials in python and it's reasonably fast, but I ran cProfile on it and the results are disturbing. The specific run takes 296 seconds, which is fine, but the cumulative time spent in abc.py __instancecheck__ is 43 seconds. This is really pointing to writing it in something other than python, especially since there's a calculation I want to run that, with the current code, would take 50 days.
  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    201/1    0.001    0.000  296.104  296.104 {built-in method builtins.exec}
        1    0.000    0.000  296.104  296.104 double_samuel_schubmult7.py:1(<module>)
        1   85.132   85.132  295.717  295.717 double_samuel_schubmult7.py:205(schubmult)
    72618   34.608    0.000   94.825    0.001 double_samuel_schubmult7.py:247(<listcomp>)
 46620756   19.717    0.000   60.217    0.000 double_samuel_schubmult7.py:171(elem_sym_func)
        1    0.002    0.002   54.962   54.962 parallel.py:1000(__call__)
        1    0.584    0.584   54.886   54.886 parallel.py:960(retrieve)
    12039    0.013    0.000   54.181    0.005 pool.py:767(get)
    12039    0.008    0.000   54.160    0.004 pool.py:764(wait)
    12054    0.018    0.000   54.156    0.004 threading.py:589(wait)
      768    0.010    0.000   54.119    0.070 threading.py:288(wait)
     3126   54.105    0.017   54.105    0.017 {method 'acquire' of '_thread.lock' objects}
 58143528    7.986    0.000   43.820    0.000 abc.py:117(__instancecheck__)
 58143528   12.915    0.000   35.834    0.000 {built-in method _abc._abc_instancecheck}
 8188300/3054264   30.769    0.000   30.769    0.000 double_samuel_schubmult7.py:148(elem_sym_poly)
58143557/58143542    7.913    0.000   22.918    0.000 abc.py:121(__subclasscheck__)
58143557/58143542   15.006    0.000   15.006    0.000 {built-in method _abc._abc_subclasscheck}
The threading portion is for a small part of the code at the end that is unproblematic, most of the code is single-threaded.
Does this 43 seconds of abc __instancecheck__ time spent really mean that if I write it in, say, C, it will be at least 43 seconds faster? Is there a way to suppress it?
I should note that for polynomial calculations I'm using symengine, which could be where this is happening, or numpy. Below is the main function (schubmult).
from symengine import *
import numpy as np
..200 lines of omitted code..
def schubmult(perm_dict,v):
    vn1 = inverse(v)
    th = theta(vn1)
    if th[0]==0:
        return perm_dict        
    mu = permtrim(uncode(th))
    vmu = permtrim(mulperm(list(v),mu))
    inv_vmu = inv(vmu)
    inv_mu = inv(mu)
    ret_dict = {}
    vpaths = [([(vmu,0)],1)]
    while th[-1] == 0:
        th.pop()
    for i in range(len(th)):
        k = i+1
        vpaths2 = []
        for path,s in vpaths:
            last_perm = path[-1][0]
            newperms = kdown_perms(last_perm,th[i],k)
            for new_perm,s2,vdiff in newperms:
                new_perm2 = permtrim(new_perm)
                if i == len(th)-1 and (len(new_perm2) != 2 or new_perm2[0]!=1):
                    continue
                path2 = [*path,(new_perm2,vdiff)]
                vpaths2 += [(path2,s*s2)]
        vpaths = vpaths2
    arr0 = [0 for vpath in vpaths]
    for u,val in perm_dict.items():
        inv_u = inv(u)
        vpathsums = {u: val*np.array([vpath[1] for vpath in vpaths])}
        for index in range(len(th)):            
            newpathsums = {}
            for up, arr in vpathsums.items():
                inv_up = inv(up)
                newperms = elem_sym_perms(up,min(th[index],(inv_mu-(inv_up-inv_u))-inv_vmu),th[index])
                for up2, udiff in newperms:
                    newpathsums[up2] = newpathsums.get(up2,np.array(arr0))+arr*[elem_sym_func(th[index],index+1,up,up2,vpaths[i][0][index][0],vpaths[i][0][index+1][0],udiff,vpaths[i][0][index+1][1],var2,var3) for i in range(len(vpaths))]
            vpathsums = newpathsums
        if len(vpaths)<300:
            ret_dict = add_perm_dict({ep: np.sum(arr) for ep,arr in vpathsums.items()},ret_dict)
        else:
            ret_dict = add_perm_dict(dict(Parallel(n_jobs=-1,require='sharedmem')(delayed(pairsum)(ep,arr) for ep,arr in vpathsums.items())),ret_dict)            
    return ret_dict
