I have a class defined and used as follows, that includes a method called in the context of a multiprocessing pool:
from multiprocessing import Pool
class MyClass:
def __init__(self, big_object):
self.big_object = big_object
def compute_one(self, myvar):
return self.big_object.do_something_with(myvar)
def compute_all(self, myvars):
with Pool() as pool:
results = pool.map(compute_one, myvars)
return results
big_object = pd.read_csv('very_big_file.csv')
cls = MyClass(big_object)
vars = ['foo', 'bar', 'baz']
all_results = cls.compute_all(vars)
This "works" but the issue is that big_object takes several Gbs in RAM, and with the above code, this big object is copied in RAM for every process launched by multiprocessing.Pool.
How to modify the above code so that big_object would be shared among all processes, but still able to be defined at class instantiation?
-- EDIT
Some context into what I'm trying to do might help identify a different approach altogether.
Here, big_object is a 1M+ rows Pandas dataframe, with tens of columns. And compute_one computes statistics based on a particular column.
A simplified high-level view would be (extremely summarized):
- take one of the columns (e.g.
current_col = manufacturer) - for each
rem_colof the remaining categorical columns:- compute a
big_object.groupby([current_col, rem_col]).size()
- compute a
The end result of this would look like:
| manufacturer | country_us | country_gb | client_male | client_female |
|---|---|---|---|---|
| bmw | 15 | 18 | 74 | 13 |
| mercedes | 2 | 24 | 12 | 17 |
| jaguar | 48 | 102 | 23 | 22 |
So in essence, this is about computing statistics for every column, on every other column, over the entire source dataframe.
Using multiprocessing here allows, for a given current_col, to compute all the statistics on remaining_cols in parallel. Among these statistics are not only sum but also mean (for remaining numerical columns).
A dirty approach using a global variable instead (a global big_object instantiated from outside the class), takes the entire running time from 5+ hours to about 20 minutes. My only issue is that I'd like to avoid this global object approach.