I'm perplexed between the behaviour of numPartitions parameter in the following methods:
DataFrameReader.jdbcDataset.repartition
The official docs of DataFrameReader.jdbc say following regarding numPartitions parameter
numPartitions: the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly.
And official docs of Dataset.repartition say
Returns a new Dataset that has exactly
numPartitionspartitions.
My current understanding:
- The
numPartitionparameter inDataFrameReader.jdbcmethod controls the degree of parallelism in reading the data from database - The
numPartitionparameter inDataset.repartitioncontrols the number of output files that will be generated when thisDataFramewould be written to disk
My questions:
- If I read
DataFrameviaDataFrameReader.jdbcand then write it to disk (without invokingrepartitionmethod), then would there still be as many files in output as there would've been had I written out aDataFrameto disk after having invokedrepartitionon it? - If the answer to the above question is:
- Yes: Then is it redundant to invoke
repartitionmethod on aDataFramethat was read usingDataFrameReader.jdbcmethod (withnumPartitionsparameter)? - No: Then please correct the lapses in my understanding. Also in that case shouldn't the
numPartitionsparameter ofDataFrameReader.jdbcmethod be called something like 'parallelism'?
- Yes: Then is it redundant to invoke