I created a Spark Dataset[Long]:
scala> val ds = spark.range(100000000)
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
When I ran ds.count it gave me result in 0.2s (on a 4 Core 8GB machine). Also, the DAG it created is as follows:
But, when I ran ds.rdd.count it gave me result in 4s (same machine). But the DAG it created is as follows:
So, my doubts are:
- Why
ds.rdd.countis creating only one stage whereasds.countis creating 2 stages ? - Also, when
ds.rdd.countis having only one stage then why it is slower thands.countwhich has 2 stages ?

