The result of our reduceByKey operation results in an RDD that is quite skewed, with lots of data in one or two partitions. To increase the parallelism of the processing after the reduceByKey we do a repartition, which forces a shuffle.
rdd.reduceByKey(_+_).repartition(64)
I know that it is possible to pass a Partitioner into the reduceByKey opertion. But (other than creating a Custom one) I think the options are the HashPartioner and the RangePartitioner. And I think both of these will result in the data being skewed after they partition, since the keys are quite unique.
Is it possible to shuffle the output RDD in a reduceByKey evenly, without the additional repartition call?