I have the following operation:
import pandas as pd
import numpy as np
def some_calc(x,y):
    x = x.set_index('Cat')
    y = y.set_index('Cat')
    y = np.sqrt(y['data_point2'])
    vec = pd.DataFrame(x['data_point1'] * y)
    grid = np.random.rand(len(x),len(x))
    result = vec.dot(vec.T).mul(grid).sum().sum()
    return result
sample_size = 100
cats = ['a','b','c','d']
df1 = pd.DataFrame({'Cat':[cats[np.random.randint(4)] for _ in range(sample_size)],
                    'data_point1':np.random.rand(sample_size),
                    'data_point2':np.random.rand(sample_size)})
df2 = df1.groupby('Cat').sum().reset_index()
I would like to run some_calc across each of the df2 rows using their relative data points from df1.
The code below works well:
df2['Apply'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                             y=df1[df1['Cat']==x['Cat']][['Cat','data_point2']]),axis=1)
(I reset the index in df2 because I don't know how to apply across indices.
Also, I'm passing both Cat as the index field and data_point as vectors to some_calc because without an index v.dot(v.T) will crunch the dot product into one single number. This errors with .mul() because I need the full MxM matrix as opposed to a float value. I might be doing something wrong here though...)
I'm currently exploring how I can vectorize the above so that when sample_size grows I will not be hampered by a slow down in the calculation.
I saw that in previous threads you can toggle raw=True so that the input deal with np.array as opposed to pd.Series.
df2['ApplyRaw'] = df2.apply(lambda x: some_calc(x=df1[df1['Cat']==x['Cat']][['Cat','data_point1']], 
                                                y=df1[df1['Cat']==x['Cat']]['Cat','data_point2']),axis=1, raw=True)
However, it throws an error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
I tried omitting Cat from the argument but still the same issue.
Are there any code improvements or tricks I can employ that allow me to vectorize the above?
Or do I have to amend some_calc?