My data look like this:
Title  Source Y
aaaaa  a      1
bbbbb  a      0
ccccc  b      0
ddddd  c      0
eeeee  c      0
fffff  a      0
ggggg  b      0
hhhhh  c      1
iiiii  a      0
jjjjj  a      0
....
....
....
Being Y the expected value Data with Y = 1 --> 20% Data with Y = 0 --> 80%
I´m doing a dataset split in this way. Note: train_val_split = 0.4
def split_dataset(self, dataset: Dataset | DatasetDict) -> Dataset | DatasetDict:
        if self.train_val_split is not None:
            split = dataset["train"].train_test_split(self.train_val_split)
            dataset["train"] = split["train"]
            dataset["validation"] = split["test"]
        dataset = self._select_samples(dataset)
        return dataset
And i´m getting this
Training set
Title  Source Y
aaaaa  a      1
ddddd  c      0
eeeee  c      0
fffff  a      0
ggggg  b      0
hhhhh  c      1
Test set
Title  Source Y
bbbbb  a      0
ccccc  b      0
iiiii  a      0
jjjjj  a      0
i would like to split the data keeping the percentages of the initial dataset, in other words, i would like to get something like this
Title  Source Y
ccccc  b      0
ddddd  c      0
eeeee  c      0
fffff  a      0
ggggg  b      0
hhhhh  c      1
Test set
Title  Source Y
aaaaa  a      1
bbbbb  a      0
iiiii  a      0
jjjjj  a      0
Is there any way of doing this?
 
    