How to randomly split a DataFrame into several smaller DataFrames?

Question

I'm having trouble randomly splitting DataFrame df into groups of smaller DataFrames.

df
  movie_id  1   2   4   5   6   7   8   9   10  11  12  borda
0   1       5   4   0   4   4   0   0   0   4   0   0   21
1   2       3   0   0   3   0   0   0   0   0   0   0   6   
2   3       4   0   0   0   0   0   0   0   0   0   0   4   
3   4       3   0   0   0   0   5   0   0   4   0   5   17  
4   5       3   0   0   0   0   0   0   0   0   0   0   3   
5   6       5   0   0   0   0   0   0   5   0   0   0   10  
6   7       4   0   0   0   2   5   3   4   4   0   0   22  
7   8       1   0   0   0   4   5   0   0   0   4   0   14  
8   9       5   0   0   0   4   5   0   0   4   5   0   23  
9   10      3   2   0   0   0   4   0   0   0   0   0   9   
10  11      2   0   4   0   0   3   3   0   4   2   0   18  
11  12      5   0   0   0   4   5   0   0   5   2   0   21  
12  13      5   4   0   0   2   0   0   0   3   0   0   14  
13  14      5   4   0   0   5   0   0   0   0   0   0   14  
14  15      5   0   0   0   3   0   0   0   0   5   5   18  
15  16      5   0   0   0   0   0   0   0   4   0   0   9   
16  17      3   0   0   4   0   0   0   0   0   0   0   7   
17  18      4   0   0   0   0   0   0   0   0   0   0   4   
18  19      5   3   0   0   4   0   0   0   0   0   0   12  
19  20      4   0   0   0   0   0   0   0   0   0   0   4   
20  21      1   0   0   3   3   0   0   0   0   0   0   7   
21  22      4   0   0   0   3   5   5   0   5   4   0   26  
22  23      4   0   0   0   4   3   0   0   5   0   0   16  
23  24      3   0   0   4   0   0   0   0   0   3   0   10

I've tried sample and arange, but with bad results.

ran1 = df.sample(frac=0.2, replace=False, random_state=1)
ran2 = df.sample(frac=0.2, replace=False, random_state=1)
ran3 = df.sample(frac=0.2, replace=False, random_state=1)
ran4 = df.sample(frac=0.2, replace=False, random_state=1)
ran5 = df.sample(frac=0.2, replace=False, random_state=1)

print(ran1, '\n')
print(ran2, '\n')
print(ran3, '\n')
print(ran4, '\n')
print(ran5, '\n')

This turned out to be 5 exact same DataFrames.

   movie_id  1  2  4  5  6  7  8  9  10  11  12  borda  
13    14     5  4  0  0  5  0  0  0   0   0   0     14  
18    19     5  3  0  0  4  0  0  0   0   0   0     12  
3     4      3  0  0  0  0  5  0  0   4   0   5     17  
14    15     5  0  0  0  3  0  0  0   0   5   5     18  
20    21     1  0  0  3  3  0  0  0   0   0   0      7

Also I've tried :

g = df.groupby(['movie_id'])
h = np.arange(g.ngroups)
np.random.shuffle(h)

df[g.ngroup().isin(h[:6])]

The output :

    movie_id    1   2   4   5   6   7   8   9   10  11  12  borda   
4      5        3   0   0   0   0   0   0   0   0   0   0   3   
6      7        4   0   0   0   2   5   3   4   4   0   0   22  
7      8        1   0   0   0   4   5   0   0   0   4   0   14  
16     17       3   0   0   4   0   0   0   0   0   0   0   7   
17     18       4   0   0   0   0   0   0   0   0   0   0   4   
18     19       5   3   0   0   4   0   0   0   0   0   0   12

But there's still only one smaller group, other datas from df aren't grouped.

I'm expecting the smaller groups to be split evenly by using percentage. And the whole df should be split into groups.

Does [this](https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe) help? — m13op22, Feb 17 '19 at 05:14

Dawei · Accepted Answer · 2019-02-17T05:49:23.880

Use np.array_split

shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)

df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.

It gives you:

for part in result:
    print(part,'\n')

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
5          6  5  0  0  0  0  0  0  5   0   0   0     10
4          5  3  0  0  0  0  0  0  0   0   0   0      3
7          8  1  0  0  0  4  5  0  0   0   4   0     14
16        17  3  0  0  4  0  0  0  0   0   0   0      7
22        23  4  0  0  0  4  3  0  0   5   0   0     16 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
13        14  5  4  0  0  5  0  0  0   0   0   0     14
14        15  5  0  0  0  3  0  0  0   0   5   5     18
21        22  4  0  0  0  3  5  5  0   5   4   0     26
1          2  3  0  0  3  0  0  0  0   0   0   0      6
20        21  1  0  0  3  3  0  0  0   0   0   0      7 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
10        11  2  0  4  0  0  3  3  0   4   2   0     18
9         10  3  2  0  0  0  4  0  0   0   0   0      9
11        12  5  0  0  0  4  5  0  0   5   2   0     21
8          9  5  0  0  0  4  5  0  0   4   5   0     23
12        13  5  4  0  0  2  0  0  0   3   0   0     14 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
18        19  5  3  0  0  4  0  0  0   0   0   0     12
3          4  3  0  0  0  0  5  0  0   4   0   5     17
0          1  5  4  0  4  4  0  0  0   4   0   0     21
23        24  3  0  0  4  0  0  0  0   0   3   0     10
6          7  4  0  0  0  2  5  3  4   4   0   0     22 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
17        18  4  0  0  0  0  0  0  0   0   0   0      4
2          3  4  0  0  0  0  0  0  0   0   0   0      4
15        16  5  0  0  0  0  0  0  0   4   0   0      9
19        20  4  0  0  0  0  0  0  0   0   0   0      4

This is so nice and clean. Thanks a lot! – Jerry Chen Feb 17 '19 at 08:57 — Jerry Chen, Feb 17 '19 at 08:57
but what if we want different sizes? – seeker_after_truth Apr 02 '22 at 16:10 — seeker_after_truth, Apr 02 '22 at 16:10

keineahnung2345 · Answer 2 · 2019-02-17T05:40:10.773

A simple demo:

df = pd.DataFrame({"movie_id": np.arange(1, 25),
          "borda": np.random.randint(1, 25, size=(24,))})
n_split = 5
# the indices used to select parts from dataframe
ixs = np.arange(df.shape[0])
np.random.shuffle(ixs)
# np.split cannot work when there is no equal division
# so we need to find out the split points ourself
# we need (n_split-1) split points
split_points = [i*df.shape[0]//n_split for i in range(1, n_split)]
# use these indices to select the part we want
for ix in np.split(ixs, split_points):
    print(df.iloc[ix])

The result:

    borda  movie_id
8       3         9
10      2        11
22     14        23
7      14         8

    borda  movie_id
0      16         1
20      4        21
17     15        18
15      1        16
6       6         7

    borda  movie_id
9       9        10
19      4        20
5       1         6
16     23        17
21     20        22

    borda  movie_id
11     24        12
23      5        24
1      22         2
12      7        13
18     15        19

    borda  movie_id
3      11         4
14     10        15
2       6         3
4       7         5
13     21        14

score 2 · Answer 3 · answered Feb 17 '19 at 05:45

IIUC, you can do this:

frames={}
for e,i in enumerate(np.split(df,6)):
    frames.update([('df_'+str(e+1),pd.DataFrame(np.random.permutation(i),columns=df.columns))])
print(frames['df_1'])

   movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
0         4  3  0  0  0  0  5  0  0   4   0   5     17
1         3  4  0  0  0  0  0  0  0   0   0   0      4
2         2  3  0  0  3  0  0  0  0   0   0   0      6
3         1  5  4  0  4  4  0  0  0   4   0   0     21

Explanation: np.split(df,6) splits the df to 6 equal size. pd.DataFrame(np.random.permutation(i),columns=df.columns) randomly reshapes the rows so creating a dataframe with this information and storing in a dictionary names frames.

Finally print the dictionary by calling each keys, values as dataframe will be returned. you can try print frames['df_1'] , frames['df_2'] , etc. It will return random permutations of a split of the dataframe.

How to randomly split a DataFrame into several smaller DataFrames?

3 Answers3