I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:
      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01
I'd like create a subindex for for my ID and the resulting dataframe looks like below:
               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01
To do this I need to first create my subindex list. I searched in the help document and it seems to most neat way is to use transform:
subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))
However, it is really slow. I looked around and found apply can do the work too:
subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))
Of course the subindex needed to be flattened since it is a list of lists here. This works much faster than the transform method. Then I tested with a for loop of my own:
subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))
It's even faster. With my larger dataset (about 90k rows), the transform method takes about 44 secs on my computer, apply takes ~2 secs and the for loop takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply and for loop makes a difference to me. However, the for loop looks ugly and may not be easily applied if I need to create other group-based variables.
So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?
 
     
    