I have a dask dataframe grouped by the index (first_name).
import pandas as pd
import numpy as np
from multiprocessing import cpu_count
from dask import dataframe as dd
from dask.multiprocessing import get 
from dask.distributed import Client
NCORES = cpu_count()
client = Client()
entities = pd.DataFrame({'first_name':['Jake','John','Danae','Beatriz', 'Jacke', 'Jon'],'last_name': ['Del Toro', 'Foster', 'Smith', 'Patterson', 'Toro', 'Froster'], 'ID':['X','U','X','Y', '12','13']})
df = dd.from_pandas(entities, npartitions=NCORES)
df = client.persist(df.set_index('first_name'))
(Obviously entities in the real life is several thousand rows)
I want to apply a user defined function to each grouped dataframe. I want to compare each row with all the other rows in the group (something similar to Pandas compare each row with all rows in data frame and save results in list for each row).
The following is the function that I try to apply:
def contraster(x, DF):
    matches = DF.apply(lambda row: fuzz.partial_ratio(row['last_name'], x) >= 50, axis = 1) 
    return [i for i, x in enumerate(matches) if x]
For the test entities data frame, you could apply the function as usual:
entities.apply(lambda row: contraster(row['last_name'], entities), axis =1)
And the expected result is:
Out[35]: 
0    [0, 4]
1    [1, 5]
2       [2]
3       [3]
4    [0, 4]
5    [1, 5]
dtype: object
When entities is huge, the solution is use dask.  Note that DF in the contraster function must be the groupped dataframe.
I am trying to use the following:
df.groupby('first_name').apply(func=contraster, args=????)
But How should I specify the grouped dataframe (i.e. DF in contraster?)