I am doing a calculation on permutations of things from a generator created by itertools. I have a piece of code in this form (this is a dummy example):
import itertools
import pandas as pd
combos = itertools.permutations('abcdefghi',2)
results = []
i=0
for combo in combos:
    i+=1 #this line is actually other stuff that's expensive
    results.append([combo[0]+'-'+combo[1],i])
rdf = pd.DataFrame(results, columns=['combo','value'])
Except in the real code,
- there are several hundred thousand permutations
- instead of i+=1I am opening files and getting results ofclf.predictwhereclfis a classifier trained in scikit-learn
- in place of iI'm storing a value from that prediction
I think the combo[0]+'-'+combo[1] is trivial though.
This takes too long. What should I do to make it faster? Such as:
1) writing better code (maybe I should initialize results with the proper length instead of using append but how much will that help? and what's the best way to do that when I don't know the length before iterating through combs?)
2) initializing a pandas dataframe instead of a list and using apply?
3) using cython in pandas? Total newbie to this.
4) parallelizing? I think I probably need to do this, but again, total newbie, and I don't know whether it's better to do it within a list or a pandas dataframe. I understand I would need to iterate over the generator and initialize some kind of container before parallelizing.
Which combination of these options would be best and how can I put it together?
 
    