create row number by group, using python datatable

Question

If I have a python datatable like this:

from datatable import f, dt
data = dt.Frame(grp=["a","a","b","b","b","b","c"], value=[2,3,1,2,5,9,2])

how do I create an new column that has the row number, by group?. That is, what is the equivalent of R data.table's

data[, id:=1:.N, by=.(grp)]

This works, but seems completely ridiculous

data['id'] = np.concatenate(
                [np.arange(x)
                    for x in data[:,dt.count(), dt.by(f.grp)]['count'].to_numpy()])

desired output:

   | grp    value     id
   | str32  int32  int64
-- + -----  -----  -----
 0 | a          2      0
 1 | a          3      1
 2 | b          1      0
 3 | b          2      1
 4 | b          5      2
 5 | b          9      3
 6 | c          2      0

score 2 · Answer 1 · answered Oct 18 '21 at 21:34

2

One approach is to convert to_pandas, groupby (on the pandas DataFrame) and use cumcount:

import datatable as dt

data = dt.Frame(grp=["a", "a", "b", "b", "b", "b", "c"], value=[2, 3, 1, 2, 5, 9, 2])

data["id"] = data.to_pandas().groupby("grp").cumcount()
print(data)

Output

   | grp    value     id
   | str32  int32  int64
-- + -----  -----  -----
 0 | a          2      0
 1 | a          3      1
 2 | b          1      0
 3 | b          2      1
 4 | b          5      2
 5 | b          9      3
 6 | c          2      0
[7 rows x 3 columns]

answered Oct 18 '21 at 21:34

Dani Mesejo

61,499
6
49
76

1

appreciate the quick response, but I was hoping for something fast.. On my ~60million row dataset, with about 3 million groups, the R approach takes about ~2.5 seconds, while both your pandas approach, and my verbose numpy concatenate arange approach takes ~30 seconds... – langtang Oct 18 '21 at 22:17
Out of curiosity why are you using datatable and not pandas directly? – Dani Mesejo Oct 18 '21 at 22:19
2

speed. I find pandas intolerably slow. – langtang Oct 18 '21 at 23:02

sammywemmy · Accepted Answer · 2022-08-03T00:40:14.283

Update:

Datatable now has a cumcount function in dev :

data[:, [f.value, dt.cumcount()], 'grp']

   | grp    value     C0
   | str32  int32  int64
-- + -----  -----  -----
 0 | a          2      0
 1 | a          3      1
 2 | b          1      0
 3 | b          2      1
 4 | b          5      2
 5 | b          9      3
 6 | c          2      0
[7 rows x 3 columns]

Old Answer:

datatable does not have a cumulative count function, in fact there is no cumulative function for any aggregation at the moment.

One way to possibly improve the speed is to use a faster iteration of numpy, where the for loop is done within C, and with more efficiency. The code is from here and modified for this purpose:

from datatable import dt, f, by
import numpy as np

In [244]: def create_ranges(indices):
     ...:     cum_length = indices.cumsum()
     ...:     ids = np.ones(cum_length[-1], dtype=int)
     ...:     ids[0] = 0
     ...:     ids[cum_length[:-1]] = -1 * indices[:-1] + 1
     ...:     return ids.cumsum()


counts =  data[:, dt.count(), by('grp', add_columns=False)].to_numpy().ravel()
data[:, f[:].extend({"counts" : create_ranges(counts)})]

   | grp    value  counts
   | str32  int32   int64
-- + -----  -----  ------
 0 | a          2       0
 1 | a          3       1
 2 | b          1       0
 3 | b          2       1
 4 | b          5       2
 5 | b          9       3
 6 | c          2       0
[7 rows x 3 columns]

The create_ranges function is wonderful (the logic built on cumsum is nice) and really kicks in as the array size increases.

Of course this has its drawbacks; you are stepping out of datatable into numpy territory and then back into datatable; the other aspect is that I am banking on the fact that the groups are sorted lexically; this won't work if the data is unsorted (and has to be sorted on the grouping column).

Preliminary tests show a marked improvement in speed; again it is limited in scope and it would be much easier/better if this was baked into the datatable library.

If you are good with C++, you could consider contributing this function to the library; I and so many others would appreciate your effort.

You could have a look at pypolars and see if it helps with your use case. From the h2o benchmarks it looks like a very fast tool.

Fantastic! Works like a charm.. True, have to do a sort prior to this, but in my instance, that was required anyway, because I need a certain order with in the groups. On my 60 million row dataset, this was reduced from ~25 seconds (with my loop above) to about 1.2 seconds.. Thanks for all the help — langtang, Oct 22 '21 at 13:43

create row number by group, using python datatable

2 Answers2