I am trying to create a column in my Spark Dataframe a flag if a column's row is in a separate Dataframe.
This is my main Spark Dataframe (df_main)
+--------+
|main    |
+--------+
|28asA017|
|03G12331|
|1567L044|
|02TGasd8|
|1asd3436|
|A1234567|
|B1234567|
+--------+
This is my reference (df_ref), there are hundreds of rows in this reference so I obviously can't hard code them like this solution or this one
+--------+
|mask_vl |
+--------+
|A1234567|
|B1234567|
...
+--------+
Normally, what I'd do in pandas' dataframe is this:
df_main['is_inref'] = np.where(df_main['main'].isin(df_ref.mask_vl.values), "YES", "NO")
So that I would get this
+--------+--------+
|main |is_inref|
+--------+--------+
|28asA017|NO      |
|03G12331|NO      |
|1567L044|NO      |
|02TGasd8|NO      |
|1asd3436|NO      |
|A1234567|YES     |
|B1234567|YES     |
+--------+--------+
I have tried the following code, but I don't get what the error in the picture means.
df_main = df_main.withColumn('is_inref', "YES" if F.col('main').isin(df_ref) else "NO")
df_main.show(20, False)

 
     
    