Imagine i have the following dataframe:
import pandas as pd
df = pd.DataFrame({'col1': ['a','b','c','d','e','f','g','h','i','j','k','l'], 'col2': [1,1,1,2,2,3,3,3,4,5,5,6]})
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
8 i 4
9 j 5
10 k 5
11 l 6
if i use this code:
df[df.col2.isin(df.groupby('col2').size().head(3).index)]
i can retrieve the 3 most frecuent categories in col2.
EDIT:
What i would like to do is to filter the dataframe in such a way that only the most frecuent categories of col2 remains in col2. Then i would like to create dummy columns for each category, indicating how many records of the same category there are per category and per letter in col1.
This would be the resulting dataframe:
col1 col2_1 col2_2 col2_3 rest_count
0 a 1 0 0 0
1 b 1 0 0 0
2 c 1 0 0 0
3 d 0 1 0 0
4 e 0 1 0 0
5 f 0 0 1 0
6 g 0 0 1 0
7 h 0 0 1 0
8 i 0 0 0 1
9 j 0 0 0 1
10 k 0 0 0 1
11 l 0 0 0 1
How can i store a count of the rest of categories way in a newly created column rest_count??
Thanks in advance