I have two consecutive functions that process large lists.
I call one after another, using joblib's Parallel, delayed, in an attempt to increase a processing speed for both functions individually.
However, I am seeing an output from function_1 also as soon as Parallel calls function_2 and I don't understand why. In a nutshell, this leads to a function_2 not being called.
The main code:
from mycode import function_2
from joblib import Parallel, delayed
import gc
if __name__ == '__main__':  
   list = list_1
   print ">>> First call"
   Parallel(n_jobs = -1)(delayed(function_1) 
                                         (item) for item in list)
   gc.collect()
   do_other_stuff()
   list = list_2
   print ">>> Second call"
   Parallel(n_jobs=-1, backend='threading')(delayed(function_2)
                                         (item) for item in list)
Threaded functions:
def function_1(): # Gets called first
    print "this comes from function 1"
    pass
def function_2(): # Gets called second
    print "this comes from function 2"
    pass
Output:
>>> First call
this comes from function 1
this comes from function 1
this comes from function 1
this comes from function 1
>>> Second call
this comes from function 1
this comes from function 1
this comes from function 1
this comes from function 1
My hypothesis is that there are some parts of function_1 stored in a memory, which is retained after calling it ( possibly due to a joblib memory mapping / sharing feature? ).
This is why I gc.collect() between the calls. Since this doesn't help, I think about reloading modules between calls ( joblib, Parallel, delayed ), which seems ugly.
Did anyone experience similar behavior (on windows)?
Is there some fix?
Do I need to un/reload joblib or mycode modules here, between Parallel steps and if so, why?
 
     
    