I would like to run udf on Pandas on Spark dataframe. I thought it should be easy but having tough time figuring it out.
For example, consider my psdf (Pandas Spark DataFrame)
     name     p1     p2
0     AAA    1.0    1.0
1     BBB    1.0    1.0
I have a simple function,
def f(a:float, b:float) -> float:
   return math.sqrt(a**2 + b**2)
I expect below psdf,
     name     p1     p2  R
0     AAA    1.0    1.0  1.4
1     BBB    1.0    1.0  1.4
The function is quite dynamic and I showed only a sample here. For example, there is another function with 3 arguments.
I tried below code but getting an error on not set compute.ops_on_diff_frames parameter and document says it is expensive. Hence, want to avoid it.
psdf["R"] = psdf[["p1","p2"]].apply(lambda x: f(*x), axis=1)
Note: I saw one can convert to normal spark dataframe and use withColumn but not sure if it will have performance penality
Any suggestions?
 
     
    