Pandas dataframe: how to summarize columns containing value

Question

Here is my dataframe:

df= pd.DataFrame(
{"mat" : ['A' ,'A', 'A', 'A', 'B'],
 "ppl" : ['P', 'P', 'P', '',  'P'],
 "ia1" : ['',  'X', 'X', '',  'X'],
 "ia2" : ['X', '',  '',  'X', 'X']},
index = [1, 2, 3, 4, 5])

I want to select unique values on the two first columns. I do:

df2 = df.loc[:,['mat','ppl']].drop_duplicates(subset=['mat','ppl']).sort_values(by=['mat','ppl'])

I get, as expected:

  mat ppl
4   A    
1   A   P
5   B   P

What I want now is, df3 to be:

 mat ppl ia1 ia2
   A           X
   A   P   X   X
   B   P   X   X

That is: in df3 for row A+P, in column ia1, I got an X because there is a X in column ia1 in one of the row of df, for A+P

Actually, very close to question http://stackoverflow.com/questions/14246817/python-pandas-custom-agg-function — thdox, Apr 07 '17 at 13:15

jezrael · Accepted Answer · 2017-04-26T14:41:56.040

1

Solutions with aggregate and unique, if multiple unique values then are joined with ,:

df = df.groupby(['mat','ppl']).agg(lambda x: ','.join(x[x != ''].unique())).reset_index()
print (df)
  mat ppl ia1 ia2
0   A           X
1   A   P   X   X
2   B   P   X   X

Explanation:

Aggregation is working with Series and aggregation function, where output is scalar. I use custom function where first filter out empty spaces by boolean indexing (x[x != ''], then get unique values. For scalar output is used join - it works if empty Series (all values are empty strings) and second advantage is if multiple unique values get one joined value with ,.

For testing is possible use custom function what is same as lambda function:

def f(x):
    a = ''.join(x[x != ''].unique().tolist())
    return a

df = df.groupby(['mat','ppl']).agg(f).reset_index()
print (df)
  mat ppl ia1 ia2
0   A           X
1   A   P   X   X
2   B   P   X   X

As comment of OP mentioned:

Instead of using lambda x: ','.join(x[x != ''].unique()), I used lambda x: ','.join(set(x)-set([''])). I went from 13min 5s to 43.2 s

edited Apr 26 '17 at 14:41

answered Apr 07 '17 at 12:21

jezrael

822,522
95
1,334
1,252

Can you please explain the `lambda x: ','.join(x[x != ''].unique())` ? – thdox Apr 07 '17 at 13:01
Please check answer. – jezrael Apr 07 '17 at 13:06
What I was not understanding is that `x` is representing all columns to aggregate. – thdox Apr 07 '17 at 13:16
Hmmm, I think if no column is specify like `df = df.groupby(['mat','ppl']).agg({'ia1':f}).reset_index()` or `df = df.groupby(['mat','ppl'])['ia1'].agg(f).reset_index()` then function `agg` use all columns and apply aggreagate function. Btw, thank you. – jezrael Apr 07 '17 at 13:19
Well, this is hugely slow on a dataframe with 100K rows and groupby on 10 columns + 4 columns to aggregate. – thdox Apr 07 '17 at 13:20
Hmmm, it is really large df and complicated function. I am afraid about it :( – jezrael Apr 07 '17 at 13:21
Instead of using `lambda x: ','.join(x[x != ''].unique())`, I used `lambda x: ','.join(set(x)-set(['']))`. I went from 13min 5s to 43.2 s. @jezrael Can you update answer? – thdox Apr 26 '17 at 14:38
Good news. Sure. But I am on phone only. – jezrael Apr 26 '17 at 14:40

Pandas dataframe: how to summarize columns containing value

1 Answers1