I have a single process that is run using subprocess module's Popen:
result = subprocess.Popen(['tesseract','mypic.png','myop'])
st = time()
while result.poll() is None:
sleep(0.001)
en = time()
print('Took :'+str(en-st))
Which results in:
Took :0.44703030586242676
Here, a tesseract call is made to process an image mypic.png(attached) and output the OCR's result to myop.txt.
Now I want this to happen on multiple processes on behalf of this comment (or see this directly), so the code is here:
lst = []
for i in range(4):
lst.append(subprocess.Popen(['tesseract','mypic.png','myop'+str(i)]))
i=0
l = len(lst)
val = 0
while(val!=(1<<l)-1):
if(lst[i].poll() is None):
print('Waiting for :'+str(i))
sleep(0.01)
else:
temp = val
val = val or (1<<(i))
if(val!=temp):
print('Completed for :'+temp)
i = (i+1) %l
What this code does is make 4 calls to tesseract, save the process objects in a list lst, iterate through all of these objects until all of them are completed. Explanation for the implementation of the infinite loop is given at the bottom.
The problem here is that the latter program is taking a hell lot of time to complete. It is continuously waiting for the processes to complete using poll() function, which is None until the process has not been completed. This should not have happened. It should have taken a little more than 0.44s only. Not something like 10 minutes! Why is this happening?
I came to this specific error by digging into pytesseract, which was taking a lot of time when run parallely using multiprocessing or pathos. So this is a scaled down version of a much bigger issue. My question on that can be found here.
Explanation for the infinite loop:
val is 0 initially. It is ORed with the 2^i when the ith process completes. So, if there are 3 processes, then if the first process(i=0) is completed then 2^0 = 1 is OR'ed with val making it 1. With second and third processes being completed, val becomes 2^0 | 2^1 | 2^2 = 7. And 2^3-1 is also 7. So the loop works until val equals 2^{number of processes}-1.
