I want to segment a dataset containing items (labeled with IDs), and multiple categorical features that take different values (for instance, color takes 'blue', 'orange', 'green'; size takes 'S', 'M', 'L', brand takes 'Brand A', 'Brand B', etc.):
| ID | Brand | Color | Size | Price |
|---|---|---|---|---|
| 1 | Brand 1 | Orange | S | 23 |
| 2 | Brand 2 | Blue | XXL | 3 |
| 3 | Brand 1 | Green | XXXL | 45 |
| 4 | Brand 2 | Blue | M | 200 |
I can easily do it by hand for 1 or 2 features (with a small number of values). E.G. if I segment by brand I get:
| ID | Brand | Color | Size | Price |
|---|---|---|---|---|
| 1 | Brand 1 | Orange | S | 23 |
| 3 | Brand 1 | Green | XXXL | 45 |
and
| ID | Brand | Color | Size | Price |
|---|---|---|---|---|
| 2 | Brand 2 | Blue | XXL | 3 |
| 4 | Brand 2 | Blue | M | 200 |
Unfortunately, some features take 10+ values. Moreover, the number of subsets explodes if I want to segment according to more than 1 feature for segmentation. I am trying to test different levels of segmentation (e.g. color + brand, color+brand+size) which is why I don't do it by hand.
I am trying to figure out a function that take the dataframe and a list of features in input and that output all the different subsets but for now, my code is worthless.
Thank you in advance if you think you can help me!