I have multiple large csv files. Each file has a size of 1 GB to 7 GB.
All my files doesn't have any headers. It is in this format. ID,col1,col2,col3
File 1 has this structure.
1,23,22,27
2,67,29,22
3,34,34,23
File 2 has this structure.
4,23,22,27
5,67,29,22
6,34,34,23
i.e. The ID is unique in all files.
I would like to merge these files into a single csv file and then sort the rows based on the ID column. The resulting file will be around 75 GB. And it starts with the ID 1.
1,23,22,27
2,67,29,22
3,34,34,23
4,23,22,27
5,67,29,22
6,34,34,23
At the moment I'm doing like this.
import pandas as pd
CHUNK_SIZE = 10000000 # Number of Rows
output_file = 'combined.csv'
for csv_file_name in sorted_fnames:
chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
print(csv_file_name)
for chunk in chunk_container:
chunk.to_csv(output_file, mode="a", index=False)
And then I'm sorting the file like this.
sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv
However, the merging process is incredibly slow. It takes more than an hour to merge the file.
Note: I have only 16 GB RAM. That's why I'm using the chunking option.
Is there any fastest solution available?
Thanks