0

I need to generate a random shuffle a very large csv (where I don't know in advance how many columns would be) in this way. So I have to go from this form

1,a,...
2,b,...
3,c,...

to something like this

3,b,...
1,c,...
2,a,...

I know I can shuffle the rows with shuffle, but I need to shuffle each column independently. I am wondering if it possible with a combination of bash commands.

emanuele
  • 341

1 Answers1

0

I created a python script that generate a bash script. I don't think is the most elegant way, but works quite well.

import csv

FILENAME = 'my_huge_csv.csv'

with open(FILENAME,'r') as f: reader = csv.reader(f,delimiter=',') NCOL = len(next(reader))

with open("shuffle_{}.sh".format(FILENAME),"w+") as f: f.write("#/bin/bash \n") f.write("/usr/bin/head -n 1 {} > final.csv \n".format(FILENAME)) for i in range(NCOL): f.write("/usr/bin/tail -n +2 {}|/usr/bin/cut -d, -f{}|shuf > tmp_file_{}.csv &\n".format(FILENAME, i+1,i+1)) f.write("wait \n") cut_arg = ['tmp_file_{}.csv'.format(i+1) for i in range(NCOL)] cut_cmd = '/usr/bin/paste -d , ' + ' '.join(cut_arg) + ' >> final.csv \n' f.write(cut_cmd) f.write('rm '+ ' '.join(cut_arg) + ' \n')

Than I have to simply execute chmod +x on my script and running it.

emanuele
  • 341