Spark repartition dataframe cause datasize increase 10 times

Asked Nov 24 '17 at 10:02

Active Nov 24 '17 at 10:13

Viewed 239 times

There are 40GB gzipped tsv files stored on S3.

I load it by using

df = spark.read.csv()

and store the df on to the HDFS by

df.write.parquet()

The resultant size after that is 20 GB

But if I call repartition on the DataFrame before storing it, the data size increases about 10x

df.repartition(num)
df.write.parquet()

Event I use repartition and give the argument equal to the existing number of partitions, data size still increases a lot.

This makes the operation extremely slow.

But I do need the repartition step because the sc.read.csv doesn't return a reasonable partitioned DataFrame.

Anyone knows about this issue?

edited Nov 24 '17 at 10:13

philantrovert

asked Nov 24 '17 at 10:02

Qichu Gong

1

GZip is not a splittable codec in Hadoop. That's why you don't get a lot of partitions when you read the CSV. I don't think there's much you can do here apart from changing the type of your file or probably extracting it before you read it using Spark. – philantrovert Nov 24 '17 at 10:15
You need to review your partition size when writing in parquet ! This is quite broad to answer – eliasah Nov 24 '17 at 11:00

0 Answers0