How to apply partial sort on a Spark DataFrame?

Question

The following code:

val myDF = Seq(83, 90, 40, 94, 12, 70, 56, 70, 28, 91).toDF("number")
myDF.orderBy("number").limit(3).show

outputs:

+------+
|number|
+------+
|    12|
|    28|
|    40|
+------+

Does Spark's laziness in combination with the limit call and the implementation of orderBy automatically result in a partially sorted DataFrame, or are the remaining 7 numbers also sorted, even though it's not needed? And if so, is there a way to avoid this needless computational work?

Using .explain() shows, that two sorts stages are performed, first on each partition and then (with the top 3 each) a global one. But it does not state if these sorts are full or partial.

myDF.orderBy("number").limit(3).explain(true)

== Parsed Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
   +- Sort [number#3416 ASC NULLS FIRST], true
      +- Project [value#3414 AS number#3416]
         +- LocalRelation [value#3414]

== Analyzed Logical Plan ==
number: int
GlobalLimit 3
+- LocalLimit 3
   +- Sort [number#3416 ASC NULLS FIRST], true
      +- Project [value#3414 AS number#3416]
         +- LocalRelation [value#3414]

== Optimized Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
   +- Sort [number#3416 ASC NULLS FIRST], true
      +- LocalRelation [number#3416]

== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[number#3416 ASC NULLS FIRST], output=[number#3416])
+- LocalTableScan [number#3416]

Probably related https://stackoverflow.com/questions/59195346/does-pyspark-changes-order-of-instructions-for-optimization — abiratsis, Jul 29 '20 at 11:11

mazaneicha · Answer 1 · 2020-07-26T18:47:21.187

3

If you explain() your dataframe, you'll find that Spark will first do a "local" sort within each partition, and then pick only top three elements from each for a final global sort before taking the top three out of it.

scala> myDF.orderBy("number").limit(3).explain(true)
== Parsed Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
   +- Sort [number#3 ASC NULLS FIRST], true
      +- Project [value#1 AS number#3]
         +- LocalRelation [value#1]

== Analyzed Logical Plan ==
number: int
GlobalLimit 3
+- LocalLimit 3
   +- Sort [number#3 ASC NULLS FIRST], true
      +- Project [value#1 AS number#3]
         +- LocalRelation [value#1]

== Optimized Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
   +- Sort [number#3 ASC NULLS FIRST], true
      +- LocalRelation [number#3]

== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[number#3 ASC NULLS FIRST], output=[number#3])
+- LocalTableScan [number#3]

I think its best seen in the Optimized Logical Plan section, but physical says the same thing.

edited Jul 26 '20 at 18:47

answered Jul 26 '20 at 18:14

mazaneicha

8,794
4
33
52

1

Nice, thanks. And do you know if the local sorts, and the final global one, are full or partial? (I've edited my question accordingly.) – Tobias Hermann Jul 27 '20 at 11:50
1

Good question! I believe it uses Guava's TopK internally, so it should be partial. But looking at the source might be the only option if you want to be really sure :) – mazaneicha Jul 27 '20 at 12:28
@abiratsis pointed to his great answer confirming this ^ – mazaneicha Jul 29 '20 at 14:05
Right, so on both levels (partial and final) no full sort is happening. That's good news! :) – Tobias Hermann Jul 29 '20 at 15:00

score -1 · Answer 2 · answered Jul 26 '20 at 16:33

-1

myDF.orderBy("number").limit(3).show
myDF.limit(3).orderBy("number").show

1 => will do full sort and then pick first 3 elements.

2 => will return dataframe with first 3 elements and sort.

answered Jul 26 '20 at 16:33

rohitash jain

1
2

I know. But that is now what I'm asking. ;) – Tobias Hermann Jul 26 '20 at 17:25

How to apply partial sort on a Spark DataFrame?

2 Answers2