I have a pyspark dataframe:
Example:
text <String> | name <String> | original_name <String>
----------------------------------------------------------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019
----------------------------------------------------------------------------
NATUREISVERYGOODFORHEALTH | null | null
----------------------------------------------------------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D
----------------------------------------------------------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH
----------------------------------------------------------------------------
I want to loop the name column and look if name values exists in text, if yes, I create a new_column, will be contain the original_name value of the name values found in text. Knowing that some times the dataframe columns are null.
Example:
in the line 4 in the dataframe example, the
textcontain 2 values fromnamecolumn:[OURHEALTH, VITAMIND], I should take itsoriginal_namevalues and store them in anew_column.in the line 2, the
textcontainOURHEALTHfromnamecolumn, I should store in thenew_columnthe originalnamevalue that found ==>[OUR_/HEALTH]
Expect result:
text <String> | name <String> | original_name <String> | new_column <Array>
------------------------------|------------------|---------------------------|----------------------------
HELLOWORLD2019THISISGOOGLE | WORLD2019 | WORLD_2019 | [WORLD_2019]
------------------------------|------------------|---------------------------|----------------------------
NATUREISVERYGOODFOROURHEALTH | null | null | [OUR_/HEALTH]
------------------------------|------------------|---------------------------|----------------------------
THESUNCONTAINVITAMIND | VITAMIND | VITAMIN_D | [VITAMIN_D]
------------------------------|------------------|---------------------------|----------------------------
BECARETOOURHEALTHISVITAMIND | OURHEALTH | OUR_/HEALTH | [OUR_/HEALTH, VITAMIN_D ]
-----------------------------------------------------------------------------|----------------------------
I hope that I was clear in my explanation.
I tried by the following code:
df = df.select("text", "name", "original_name").agg(collect_set("name").alias("name_array"))
for name_item in name_array:
df.withColumn("new_column", F.when(df.text.contains(name_item), "original_name").otherwise(None))
Someone can help me please ? Thank you