Parallel python iteration

Question

I want to create a number of instances of a class based on values in a pandas.DataFrame. This I've got down.

import itertools
import multiprocessing as mp
import pandas as pd

class Toy:
    id_iter = itertools.count(1)

    def __init__(self, row):
        self.id = self.id_iter.next()
        self.type = row['type']

if __name__ == "__main__":

    table = pd.DataFrame({
        'type': ['a', 'b', 'c'],
        'number': [5000, 4000, 30000]
        })

    for index, row in table.iterrows():
        [Toy(row) for _ in range(row['number'])]

Multiprocessing Attempts

I've been able to parallelize this (sort of) by adding the following:

pool = mp.Pool(processes=mp.cpu_count())
m = mp.Manager()
q = m.Queue()

for index, row in table.iterrows():
    pool.apply_async([Toy(row) for _ in range(row['number'])])

It seems that this would be faster if the numbers in row['number'] are substantially longer than the length of table. But in my actual case, table is thousands of lines long, and each row['number'] is relatively small.

It seems smarter to try and break up table into cpu_count() chunks and iterate within the table. But now we're at the edge of my python skills.

I've tried things that the python interpreter screams at me for, like:

pool.apply_async(
        for index, row in table.iterrows(): 
        [Toy(row) for _ in range(row['number'])]
        )

Also things that "can't be pickled"

Parallel(n_jobs=4)(
    delayed(Toy)([row for _ in range(row['number'])]) \
            for index, row in table.iterrows()
)

Edit

This may gotten me a little bit closer, but still not there. I create the class instances in a separate function,

def create_toys(row):
    [Toy(row) for _ in range(row['number'])]

....

Parallel(n_jobs=4, backend="threading")(
    (create_toys)(row) for i, row in table.iterrows()
)

but I'm told 'NoneType' object is not iterable.

Did you see this question? http://stackoverflow.com/questions/26784164/solved-pandas-multiprocessing-apply — JD Long, Jun 09 '15 at 19:36
I can see how that applies, but I can't quite coerce it to my problem. — gregmacfarlane, Jun 09 '15 at 20:15
You create a number of `Toy` instances, but it looks like you just throw them away. It's not clear why you're doing any of this, which makes it hard to suggest ways to do it better. — user2357112, Jun 10 '15 at 01:43
In my real case the class calls a `write` method that writes the instance to an xml tree. That's an entirely different question... — gregmacfarlane, Jun 10 '15 at 01:48

score 3 · Accepted Answer · answered Jun 10 '15 at 01:35

3

It's a little bit unclear to me what the output you are expecting is. Do you just want a big list of the form

[Toy(row_1) ... Toy(row_n)]

where each Toy(row_i) appears with multiplicity row_i.number?

Based on the answer mentioned by @JD Long I think you could do something like this:

def process(df):
    L = []
    for index, row in table.iterrows():
        L += [Toy(row) for _ in range(row['number'])]
    return L

table = pd.DataFrame({
    'type': ['a', 'b', 'c']*10,
    'number': [5000, 4000, 30000]*10
    })

p = mp.Pool(processes=8)
split_dfs = np.array_split(table,8)    
pool_results = p.map(process, split_dfs)
p.close()
p.join()

# merging parts processed by different processes
result = [a for L in pool_results for a in L]

answered Jun 10 '15 at 01:35

maxymoo

35,286
11
92
119

This is exactly what I needed, though that last line took me a long time to figure out. I ended up on [this question](http://stackoverflow.com/questions/406121/flattening-a-shallow-list-in-python) before seeing you had already covered what I needed! – gregmacfarlane Jun 10 '15 at 12:27
1

Nice one, I actually quite dislike that syntax, I find it quite unreadable and I can never remember which order the loops run in. (not sure how I'd do it differently though) – maxymoo Jun 10 '15 at 23:21
can you please have a look at this question :- https://stackoverflow.com/questions/53561794/iteration-over-a-pandas-df-in-parallel – ak3191 Dec 17 '18 at 18:36

Parallel python iteration

Multiprocessing Attempts

Edit

1 Answers1

Linked