There are 40GB gzipped tsv files stored on S3.
I load it by using
df = spark.read.csv()
and store the df on to the HDFS by
df.write.parquet()
The resultant size after that is 20 GB
But if I call repartition on the DataFrame before storing it, the data size increases about 10x
df.repartition(num)
df.write.parquet()
Event I use repartition and give the argument equal to the existing number of partitions, data size still increases a lot.
This makes the operation extremely slow.
But I do need the repartition step because the sc.read.csv doesn't return a reasonable partitioned DataFrame.
Anyone knows about this issue?