I'm writing a process that needs to generate a UUID for certain groups that match based on some criteria. I got my code working, but I'm worried about potential issues from creating the UUID within my UDF (thus making it non-deterministic). Here's a simplified example of some code to illustrate:
from uuid import uuid1
from pyspark.sql import SparkSession
from pyspark.sql.functions import PandasUDFType, pandas_udf
spark = (
SparkSession.builder.master("local")
.appName("Word Count")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
df = spark.createDataFrame([["j", 3], ["h", 3], ["a", 2]], ["name", "age"])
@pandas_udf("name string, age integer, uuid string", PandasUDFType.GROUPED_MAP)
def create_uuid(df):
df["uuid"] = str(uuid1())
return df
>>> df.groupby("age").apply(create_uuid).show()
+----+---+--------------------+
|name|age| uuid|
+----+---+--------------------+
| j| 3|1f8f48ac-0da8-430...|
| h| 3|1f8f48ac-0da8-430...|
| a| 2|d5206d03-bcce-445...|
+----+---+--------------------+
This currently works on some data processing over 200k records on AWS Glue, and I haven't found any bugs yet.
I use uuid1 since that uses the node information to generate the UUID thus ensuring no 2 nodes generate the same id.
One thought I had was to register the UDF as non-deterministic with:
udf = pandas_udf(
create_uuid, "name string, age integer, uuid string", PandasUDFType.GROUPED_MAP
).asNondeterministic()
But that gave me the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o60.flatMapGroupsInPandas.
: org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate or Window, found:
`age`,create_uuid(name, age),`name`,`age`,`uuid`
in operator FlatMapGroupsInPandas [age#1L], create_uuid(name#0, age#1L), [name#7, age#8, uuid#9]
;;
FlatMapGroupsInPandas [age#1L], create_uuid(name#0, age#1L), [name#7, age#8, uuid#9]
+- Project [age#1L, name#0, age#1L]
+- LogicalRDD [name#0, age#1L], false
My questions are:
- What are some potential issues this could encounter?
- If it does have potential issues, what are some says in which I could make this deterministic?
- Why can't GROUPED_MAP functions be labeled as non-deterministic?