I need to do some computation on different slices of some big dataframes.
Suppose I have 3 big dataframes df1, df2 and df3.
Each of which has a "Date" column.
I need to do some computation on these dataframes, based on date slices and since each iteration is independent from the other iteration, I need to do these iterations concurrently.
df1              # a big dataframe
df2              # a big dataframe
df3              # a big dataframe
So I define my desired function and in each child process, first a slice of df1, df2, df3 is created, within the process, then other computations go on.
Since df1,df2 and df3 are global dataframes, I need to specify them as arguments in my function. Otherwise it won't be recognized.
Something like below:
slices = [ '2020-04-11', '2020-04-12', '2020-04-13', ]
# a list of dates to get sliced further
def my_func(slice,df1=df1,df2=df2,df3=df3):
    sliced_df1 = df1[df1.Date >  slice]
    sliced_df2 = df2[df2.Date <  slice]
    sliced_df3 = df3[df3.Date >= slice]
    #
    # other computations
    # ...
    #
    return desired_df
The concurrent processing is configured as below :
import psutil
pool = multiprocess.Pool(psutil.cpu_count(logical=False))
final_df = pool.map(my_func,[slice for slice in slices])
pool.close()
final_df = pd.concat(final_df, ignore_index = True)
However, it seems that only one core goes up upon the execution.
I suppose that since each child process wants to access the global dataframes df1, df2 and df3, there should be a shared memory for child process and as I searched through net, I guess I have to use the multiprocessing.manager(), but I am not sure how to use it or if I am right about using it?
I am actually new to the concept of concurrent processing and I appreciate if someone can help.
PS: It seems that my question is similar to this post. However, it hasn't an accepted answer.
