I have a fairly large CSV file containing amazon review data which I read into a pandas data frame. I want to split the data 80-20(train-test) but while doing so I want to ensure that the split data is proportionally representing the values of one column (Categories), i.e all the different category of reviews are present both in train and test data proportionally.
The data looks like this:
**ReviewerID**       **ReviewText**        **Categories**       **ProductId**
1212                   good product         Mobile               14444425
1233                   will buy again       drugs                324532
5432                   not recomended       dvd                  789654123 
Im using the following code to do so:
import pandas as pd
Meta = pd.read_csv('C:\\Users\\xyz\\Desktop\\WM Project\\Joined.csv')
import numpy as np
from sklearn.cross_validation import train_test_split
train, test = train_test_split(Meta.categories, test_size = 0.2, stratify=y)
it gives the following error
NameError: name 'y' is not defined
As I'm relatively new to python I cant figure out what I'm doing wrong or whether this code will stratify based on column categories. It seems to work fine when i remove the stratify option as well as the categories column from train-test split.
Any help will be appreciated.
 
     
     
     
     
    