I have a list of lists of tuples. Each tuple has the form (string,int), e.g.
lst = list()
lst.append([('a',5),('c',10),('d',3),('b',1)])
lst.append([('c',14),('f',4),('b',1)])
lst.append([('d',22),('f',2)])
Think of the int's as counts of each string in different blocks of text.
What I need to do is produce a list of top-N occurring strings together with their cumulative counts. So in the example above, a appears 5 times, b appears twice, c appears 24 times etc. If N=2, then I would have to produce either a pair of parallel lists ['d','c'] and [25,24] or a list of tuples [('d',25),('c',24)]. I need to do it as quickly as possible. My machine has lots of RAM so memory is not an issue.
I have this implementation:
import numpy as np
def getTopN(lst,N):
    sOut = []
    cOut = []
    for l in lst:
        for tpl in l:
            s = tpl[0]
            c = tpl[1]
            try:
                i = sOut.index(s)
                cOut[i] += c
            except:
                sOut.append(s)
                cOut.append(c)
    sIndAsc = np.argsort(cOut).tolist()
    sIndDes = sIndAsc[::-1]
    cOutDes = [cOut[sir] for sir in sIndDes]
    sOutDes = [sOut[sir] for sir in sIndDes]
    return sOutDes[0:N],cOutDes[0:N]
There's gotta be a better way, but what would it be?
 
     
     
    