I have 2 dataframes.
One dataframe DF1 has a column with values that are seperated by a delimiter say comma. The second dataframe DF2 has a column with single value (this could be part of the comma seperated column values in the other dataframe, DF1). I need to iterate DF2 records/rows, and see if DF2.color exists in the comma seperated column values in DF1.csv_column, and if exists add the df1 row ID TO A NEW DATAFRAME.
df1= sqlContext.createDataFrame([("A001","RED, WHITE, BLUE"),("A002","RED, YELLOW"),("A003","GREEN,RED"),("A004","WHITE,YELLOW")], ["id","csv_column"])
df1.show()
df2= sqlContext.createDataFrame([("C1","RED"),("C2","WHITE"),("C3","BLUE"),("C4","YELLOW"),("C5","RED"),("C6","GREEN"),("C7","BLUE")], ["CLRCODE","COLOR"])
df2.show()
+----+----------------+ 
| id | csv_column     | 
+----+----------------+ 
|A001|RED, WHITE, BLUE| 
|A002|RED, YELLOW     | 
|A003|GREEN, RED      | 
|A004|WHITE, YELLOW   |
 +----+----------------+
+-------+-------+ 
|CLRCODE| COLOR | 
+-------+-------+ 
| C1    | RED   | 
| C2    | WHITE | 
| C3    | BLUE  | 
| C4    | YELLOW| 
| C5    | RED   | 
| C6    | GREEN | 
| C7    | BLUE  | 
+-------+-------+
Expected Result: The column csv_column in df1 has RED,WHITE,BLUE and so I have add the IDs for RED,WHITE,BLUE as a row to the new dataframe, and so on. Please note CLRCODE in DF2, is just a place holder and is not used hence. How to get this result.
+-------+
|df1.id |
+-------+
|A001   |
|A002   |
|A003   |
|A001   |
|A004   |
|A001   |
|A002   |
|A004   |
|A001   |
|A002   |
|A003   |
|A003   |
|A001   |
+-------+
I checked in this SO solution here, but there the dataframe comma seperated column is validated against a static string value, but I need to iterate through a dataframe which contains many rows of various values.
Thanks for the help.
 
    