I have a dataframe and I apply a function to it. This function returns an numpy array the code looks like this:
create_vector_udf = udf(create_vector, ArrayType(FloatType()))
dataframe = dataframe.withColumn('vector', create_vector_udf('text'))
dmoz_spark_df.select('lang','url','vector').show(20)
Now spark seems not to be happy with this and does not accept ArrayType(FloatType())
I get the following error message:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
I could just numpyarray.tolist() and return a list version of it, but obviously I would always have to recreate the array if I want to use it with numpy.
so is there a way to store a numpy array in a dataframe column?