There might be a more elegant and/or efficient way to accomplish your goal. I have yet no solution in mind to randomly pick a fixed number of sets of n consecutives elements in a list (without replacement).
I would probably start by doing something like this, though:
import random
def custom_split(df, train_size, n_adjacent=3):
# Number of desired sets of n_adjacent consecutive rows.
test_size = int(len(df)*(1-train_size)//n_adjacent)
n_attempt = 10
while n_attempt > 0:
retry = False
available_idx = list(range(len(df)))
test_idx = []
for _ in range(test_size):
# If no more consecutive indices, it will try again from the beginning.
if len(available_idx) < n_adjacent:
retry = True
n_attempt -= 1
break
# Choosing an idx from the available ones .
add_idx = random.choice(available_idx[:-(n_adjacent-1)])
# Extending with this indice and the two following ones.
new_idx = list(range(add_idx, add_idx + n_adjacent))
# Removing those indices from the available list,
# also removing indices that are no more
# part of n_adjacent consecutive ones.
available_idx = [idx for idx in available_idx if idx not in new_idx \
and idx + n_adjacent - 1 not in new_idx]
test_idx.extend(new_idx)
if not retry:
# It succeeded.
# Masking the test_idx as False.
train_idx = np.ones(len(df), dtype=np.bool)
train_idx[test_idx] = False
return df.iloc[train_idx,:], df.iloc[test_idx,:]
# Raises an exception if failed 10 times.
raise Exception("Could not find consecutive indices to randomly choose from.")
# 80% train, 20% test, rounding up the train portion.
# Thanks to the mask, all the dataframe is represented.
train_set, test_set = custom_split(a_dataframe, train_size = 0.8, n_adjacent = 5)
The major issue with this solution is that you can end up lacking consecutive indices when calling random.choice. That's the reason for the while loop: it will try again as long as it fails up to 10 times, else it will raise an exception.
The "idx" are not from the index column in the DataFrame, they are instead the locations of the rows in there axe. That's why I use them with iloc and not with loc.
Result with a 20 rows DataFrame, 70% train_size and 3 n_adjacent:
# IDX
# train:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16]
# test:
[17, 18, 19, 9, 10, 11]
Don't forget to shuffle the train set or both the sets afterwards, according to your needs. Here is an elegant way to shuffle DataFrames rows : https://stackoverflow.com/a/34879805/10409093