Spark Dataframe native performance vs Pyspark RDD map on simple string split operation

Question

I don't expect the following code to benefit from the Dataframe Catalyst query optimizer, but I do expect there to be a performance difference between the Scala/native performance of string split and the Python performance. However, my performance results are disappointing, as the native Dataframe API appears to be slower.

My test is as follows:

def get_df(spark):
    return spark.read.load('s3://BUCKET/test-data.csv',
                           format='com.databricks.spark.csv',
                           inferSchema='true', header='true')

def upsize_df(df, exponent=10):
    for i in range(exponent):
        df = df.unionAll(df)
    return df

def rdd_ver(df):
    df = df.rdd.map(lambda row: row + tuple(
                        row.order_id.split('-'))).toDF(
                            df.columns + ['psrid', 'eoid'])
    df.show()

def df_ver(df):
    split_col = pyspark.sql.functions.split(df['order_id'], '-')
    df = df.withColumn('psrid', split_col.getItem(0))
    df = df.withColumn('eoid', split_col.getItem(1))
    df.show()

Cluster/YARN details:

Spark 2.0 on AWS
6 executors
2 cores per executor

Test procedure:

Create new PySpark shell in IPython
Get dataframe of toy-sized dataset (1000 rows)
repartition Dataframe to 12 partitions
upsize_df with unionAll, to get to 1 million rows
run df.count() to force execution of repartition and upsize_df
finally, run %time rdd_ver(df) or %time df_ver(df)

My results so far have been surprising and disappointing. Here is a sampling of the results I've received, in seconds:

rdd_ver: 14.5, 22.4, 13.1, 24.7, 17.8 --- mean: 18.5

df_ver: 30.5, 26.9, 32.0, 29.7, 39.8 --- mean: 31.8

I'd appreciate any thoughts, either on the test procedure itself (the operation itself is derived from some production code) or on the poor performance of the Dataframe version.

EDIT:

The Spark Web UI indicates that my jobs are not actually being scheduled/submitted very quickly. I am not sure how reliable the Web UI's information is, but the 'Submitted' time displayed on the active job in this screenshot is over a minute after I initially hit 'enter' in the active Pyspark session to kick off %time df_ver(df)

Furthermore, it seems that none of the 6 executors are doing anything. They've all apparently been killed by Spark since I wasn't actively doing anything in the Spark session for more than a few seconds. It seems like the entire job is being run by the driver node, but I can't confirm that since I don't know the Spark Web UI well enough.

I would `df.explain(extended = true)` to see the plans. Also, look at web UI's SQL tabs and drill down to jobs/tasks and other metrics. — Jacek Laskowski, Aug 31 '16 at 21:26
It does seem like the wall clock numbers are very different from what the Spark Web UI is giving me. However, the Spark Web UI numbers are pretty hard to interpret. I'm attaching a screenshot to the question. — Peter Gaultney, Aug 31 '16 at 23:13

score 0 · Answer 1 · answered Aug 31 '16 at 23:55

0

Why do you think it should be faster in scala? Python string operations are very fast:

Python:

In [58]: %time "this is my string".split()
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 7.87 µs

Scala:

bash-3.2$ echo '
object TimeSplit {
   def main(args: Array[String]): Unit = {
     val now = System.nanoTime
     val split = "this is my string".split(" ")
     val diff = System.nanoTime - now
     println("%d microseconds".format(diff/1000))
   }
 }' > timesplit.scala

bash-3.2$ scalac timesplit.scala
bash-3.2$ scala TimeSplit
21 microseconds

answered Aug 31 '16 at 23:55

maxymoo

35,286
11
92
119

This doesn't seem to me like a very robust benchmark. I would expect Scala to be faster for large numbers of operations because it is running compiled bytecode on the JVM as opposed to interpreted code. Also, using the Dataframe API should require fewer context switches, since there will be no need to invoke a Python interpreter on the worker nodes. Even if Python is just as fast as Scala (it generally isn't https://www.quora.com/Which-one-is-faster-Scala-or-Python), I would expect the Dataframe API to be as fast as Python RDD maps. I'm not observing that, hence my question. – Peter Gaultney Sep 01 '16 at 01:12
the bottle neck won't be starting python interpreters. in any case, my point is that the string processing is actually faster in python than in scala so that could explain your timings. – maxymoo Sep 01 '16 at 01:59
take a look at http://stackoverflow.com/questions/32464122/spark-performance-for-scala-vs-python for more in depth exploration – maxymoo Sep 01 '16 at 01:59

Spark Dataframe native performance vs Pyspark RDD map on simple string split operation

1 Answers1

Python:

Scala: