I want to match R's data.table::fwrite csv file writing speed in Python.
Let's check some timings. First R...
library(data.table)
nRow=5e6
nCol=30
df=data.frame(matrix(sample.int(100,nRow*nCol,replace=TRUE),nRow,nCol))
ta=Sys.time()
fwrite(x=df,file="/home/cryo111/test2.csv")
tb=Sys.time()
tb-ta
#Time difference of 1.907027 secs
The same for Python using pandas.to_csv
import pandas as pd
import numpy as np
import datetime
nRow=int(5e6)
nCol=30
df = pd.DataFrame(np.random.randint(0,100,size=(nRow, nCol)))
ta=datetime.datetime.now()
df.to_csv("/home/cryo111/test.csv")
tb=datetime.datetime.now()
(tb-ta).total_seconds()
#96.421676
Currently there is a huge performance gap. One main reason might be that fwrite uses all cores for the write process, whereas to_csv is probably only single-threaded.
I wasn't able to find any Python packages that have out-of-the-box csv file writers that could match data.table::fwrite. Have I missed something? Is there another way to speed up the write process?
The file size is in both cases around 400MB. The code ran on the same machine.
I have tried Python 2.7, 3.4, 3.5. I am using R 3.3.2 and data.table 1.10.4. On Python 3.4, I was using pandas 0.20.1
