I have a spark Scala library and I am building a python wrapper on top of it. One class of my library provides the following method
package com.example
class F {
def transform(df: DataFrame): DataFrame
}
and I am using py4j in the following way to create a wrapper for F
def F():
return SparkContext.getOrCreate()._jvm.com.example.F()
which allows me to call the method transform
The problem is that the python Dataframe object is obviously different from the Java Dataframe object. For this purpose, I need a way to convert a python df to a java one, for which I use the following code from py4j docs
class DataframeConverter(object):
def can_convert(self, object):
from pyspark.sql.dataframe import DataFrame
return isinstance(object, DataFrame)
def convert(self, object, gateway_client):
from pyspark.ml.common import _py2java
return _py2java(SparkContext.getOrCreate(), object)
protocol.register_input_converter(DataframeConverter())
My problem is that now I want to do the inverse: getting a java dataframe from the transform and continue to use it in python. I tried to use protocol.register_output_converter but I couldn't find any useful example, apart for code dealing with java collections.
How can I do that? An obvious solution would be to create a python class F which defines all methods present in java F, forwards all the python calls to the jvm, get back the result and convert it accordingly. This approach works but it implies that I have to redefine all methods of F thus generating code duplication and a lot more maintainance