Here my 2 cents. Assume that we have the following unbalanced dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})
print(df.head())
The first rows:
Category Sentiment Gender
0 C 1 M
1 B 0 M
2 B 0 M
3 B 0 M
4 A 0 M
Assume now that we want to get a balanced dataset by Sentiment:
df_grouped_by = df.groupby(['Sentiment'])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
print(df_balanced.head())
The first rows of the balanced dataset:
Category Sentiment Gender
0 C 0 F
1 C 0 M
2 C 0 F
3 C 0 M
4 C 0 M
Let's verify that it is balanced in terms of Sentiment
df_balanced.groupby(['Sentiment']).size()
We get:
Sentiment
0 369
1 369
dtype: int64
As we can see we ended up with 369 positive and 369 negative Sentiment labels.