I need to split my dataset into two splits: 80% and 20%.
My dataset looks like this:
PersonID    Timestamp   Foo1    Foo2    Foo3    Label
1   1626184812  6   5   2   1
2   616243602   8   5   2   1
2   634551342   4   8   3   1
2   1531905378  3   8   8   1
3   616243602   10  7   8   2
3   634551342   7   5   8   2
4   1626184812  7   9   1   2
4   616243602   5   7   9   1
4   634551342   9   1   6   2
4   1531905378  3   3   3   1
4   768303369   6   1   7   2
5   1626184812  5   7   8   2
5   616243602   6   2   6   1
6   1280851467  3   2   2   2
7   1626184812  10  1   10  1
7   616243602   6   3   6   2
7   1531905378  9   5   7   2
7   634551342   3   7   9   1
8   616243602   8   7   4   2
8   634551342   2   2   4   1
(Note, you should be able to use pd.read_clipboard() to get this data into a dataframe.)
What I am trying to accomplish is:
- Split this dataset into an 80/20 split (training, testing)
- The dataset should be mostly organized by Timestamp, meaning, the older data should be in training, and the newer data should be in testing
- Additionally, a single Person should not be split between training and testing. For example, all of the samples for a given PersonID must be in one set or the other!
The first two points are accomplished in the minimal example below. The third point is what I am having trouble with. For example, using sklearn's train_test_split:
Minimal example below:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
# Normal split
x = pd.read_clipboard()
train, test = train_test_split(x, train_size=0.8, test_size=0.20, random_state=8)
# Organizing it by time
x = pd.read_clipboard()
x = x.sort_values(by='Timestamp')
train, test = train_test_split(x, train_size=0.8, test_size=0.20, random_state=8)
I am struggling to figure out how to group the dataframe so that one person is not split across train and test. For example, in above, each PersonID in the test dataframe also appears in the train dataframe. How can I keep the proportions about equal while ensuring that PersonID is not split?
 
    