TL;DR: np.random.shuffle(ndarray) can do the job.
So, in your case
np.random.shuffle(DataFrame.values)
DataFrame, under the hood, uses NumPy ndarray as a data holder. (You can check from DataFrame source code)
So if you use np.random.shuffle(), it would shuffle the array along the first axis of a multi-dimensional array. But the index of the DataFrame remains unshuffled.
Though, there are some points to consider.
- function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
- sklearn.utils.shuffle(), as user tj89 suggested, can designate- random_statealong with another option to control output. You may want that for dev purposes.
- sklearn.utils.shuffle()is faster. But WILL SHUFFLE the axis info(index, column) of the- DataFramealong with the- ndarrayit contains.
Benchmark result
between sklearn.utils.shuffle() and np.random.shuffle().
ndarray
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
np.random.shuffle(nd)
0.8897626010002568 sec
DataFrame
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
np.random.shuffle(df.values)
0.9357550159329548 sec
Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()
used code
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
pythonbenchmarking