sklearn stratified sampling based on a column

Question

I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one column (Categories), i.e all the different category of reviews are present both in train and test data proportionally.

The data looks like this:

**ReviewerID**       **ReviewText**        **Categories**       **ProductId**

1212                   good product         Mobile               14444425
1233                   will buy again       drugs                324532
5432                   not recomended       dvd                  789654123

Im using the following code to do so:

import pandas as pd
Meta = pd.read_csv('C:\\Users\\xyz\\Desktop\\WM Project\\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split

train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)

it gives the following error

NameError: name 'y' is not defined

As I'm relatively new to python I cant figure out what I'm doing wrong or whether this code will stratify based on column categories. It seems to work fine when i remove the stratify option as well as the categories column from train-test split.

Any help will be appreciated.

You haven't defined `y` before using it in `train_test_split`. — Quazi Marufur Rahman, May 03 '16 at 07:01
You need to define variable y before. From the sklearn page, stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. So y had to be the labels that you are using. — nEO, May 03 '16 at 07:07
the categories is your y and you need to split the data (X and Y). You are not doing any split on the data right now — nEO, May 03 '16 at 07:20

nEO · Accepted Answer · 2018-12-20T23:09:43.547

    >>> import pandas as pd
    >>> Meta = pd.read_csv('C:\\Users\\*****\\Downloads\\so\\Book1.csv')
    >>> import numpy as np
    >>> from sklearn.model_selection import train_test_split
    >>> y = Meta.pop('Categories')
    >>> Meta
        ReviewerID      ReviewText  ProductId
        0        1212    good product   14444425
        1        1233  will buy again     324532
        2        5432  not recomended  789654123
    >>> y
        0    Mobile
        1     drugs
        2       dvd
        Name: Categories, dtype: object
    >>> X = Meta
    >>> X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify=y)
    >>> X_test
        ReviewerID    ReviewText  ProductId
        0        1212  good product   14444425

what if there are more than one column to stratify on? for example Category 1 and Category 2. Is there away to stratify over multiple columns as opposed to just one? — Ankhnesmerira, Aug 12 '21 at 05:08

score 14 · Answer 2 · edited Apr 22 '18 at 01:43

sklearn.model_selection.train_test_split

stratify : array-like or None (default is None)

If not None, data is split in a stratified fashion, using this as the class labels.

Along the API docs, I think you have to try like X_train, X_test, y_train, y_test = train_test_split(Meta_X, Meta_Y, test_size = 0.2, stratify=Meta_Y).

Meta_X, Meta_Y should be assigned properly by you(I think Meta_Y should be Meta.categories based on your code).

score 4 · Answer 3 · answered Jul 08 '21 at 18:23

I am not sure why StratifiedShuffleSplit isn't mentioned by anyone

from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['Categories']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

For documentation refer StratifiedShuffleSplit

Tomasz Bartkowiak · Answer 4 · 2022-10-06T08:21:35.377

1

You don't need to use sklearn - use DataFrame.groupby with DataFrame.sample instead:

df.groupby([cols]).apply(lambda f: f.sample(frac=ratio))

Note: you might also need to reset_index(drop=True) afterwards

edited Oct 06 '22 at 08:21

answered Oct 04 '22 at 16:20

Tomasz Bartkowiak

12,154
4
57
62

sklearn stratified sampling based on a column

4 Answers4

Linked