I have a sample hive/spark table as below:
| row_key | data_as_of_date | key | value |
|---|---|---|---|
| A | 20210121 | key1 | value1 |
| A | 20210121 | key2 | value2 |
| A | 20210121 | key3 | value3 |
| B | 20210121 | key1 | value1 |
| B | 20210121 | key2 | value1 |
| B | 20210121 | key3 | value2 |
| B | 20210121 | key4 | value3 |
| C | 20210121 | key1 | value2 |
...and goes on.
I have another hive/spark table with same columns. Sample data below:
| row_key | data_as_of_date | key | value |
|---|---|---|---|
| A | 20210121 | key1 | value1 |
| A | 20210121 | key2 | value2 |
| B | 20210121 | key1 | value1 |
| B | 20210121 | key4 | value3 |
| C | 20210121 | key1 | value2 |
row_key is the joining column between these 2 tables and same row_key can repeat in multiple rows in both the tables.
I am struggling in writing spark sql query or using spark dataframe to show/select all the rows from table 1 which have key column value not present in table 2 joining on the row_key.
For the sample data the result should be:
| row_key | data_as_of_date | key | value |
|---|---|---|---|
| A | 20210121 | key3 | value3 |
| B | 20210121 | key2 | value1 |
| B | 20210121 | key3 | value2 |
Please help with the spark sql query or dataframe operations in scala.
Let me know if any more Info. is required.
