I have a DataFrame which contains the following details.
|id|Name|Country|version|
|1 |Jack|UK     |new    |
|1 |Jack|USA    |old    |
|2 |Rose|Germany|new    |
|3 |Sam |France |old    |
I would like to create a DataFrame where, if the data is duplicate based on "id" it picks the new version over the old version as so
|id|Name|Country|version|
|1 |Jack|UK     |new    |
|2 |Rose|Germany|new    |
|3 |Sam |France |old    |
What is the best way to do this in Java/Spark, or do I have to use some sort of nested SQL query?
The simplified SQL version would look something like below:
WITH new_version AS (
    SELECT
      ad.id
      ,ad.name
      ,ad.country
      ,ad.version
    FROM allData ad
    WHERE ad.version = 'new'
),
old_version AS (
    SELECT
      ad.id
      ,ad.name
      ,ad.country
      ,ad.version
    FROM allData ad
    LEF JOIN new_version nv on nv.id = ad.id
    WHERE ad.version = 'old'
      AND nv.id is null
),
SELECT id, name, country, version FROM new_version
UNION ALL
SELECT id, name, country, version FROM old_version
 
     
    