I've searched probably 10 threads on multiprocessing looking but nothing seems to fit perfectly to my usecase. Here is a general idea of what I want to parallelize.
class foo():
def boo():
filename = 'path to the data file'
with reader(filename) as fileReader:
for id, feature in fileReader:
boo2(id, feature)
def boo2(id, feature):
*process feature then save the output to a folder*
Here I want to parallelize the call to boo2() where fileReader is an iterator (a sequentialMatrixReader from pykaldi) with tens of thousands of rows of id and feature where id is a string and each feature is a matrix (hundreds of row x tens of columns). boo2 will compute a smaller matrix and save the result to a folder based on id. Each call to boo2 are independent from one another so I want to parallelize it.
From my understanding I can't use multiprocessing.Pool since boo2 is a class function and I can't pull that out of the class due to it's complexity.
I don't know how to use multiprocessing.Process since the number of cores is much less than the number of rows of the iterator and I am unsure how to queue new calls to boo2 once I've start() and join() processes (I've tried to split the fileReader into n batches and set a Process per batch however I'd much prefer to queue the calls in one-line vs multiple batchs)
I've also looked into pathos module since it doesn't have problems with class functions. However from sample use-cases the closest fit to my need is:
pathos.threading.ThreadPoolpool.imap(boo2, [feature for feature in fileReader])
But because of how large fileReader is I am unable to fit [feature for feature in fileReader] in memory.
Any and all help is appreciated. Thank you.