Got multiple data files that belong to different weeks - all files of same format. I need to consolidate the files using scala code that runs on Spark. End result should be only unique records by key, also end result should keep the record from latest file for same key fields.
Each data file can potentially have close to 1/2 Billion records and hence the code has to be high performing one...
Example:
Latest data file
CID PID Metric
C1  P1  10
C2  P1  20
C2  P2  30
Previous data File
CID PID Metric
C1  P1  20
C2  P1  30
C3  P1  40
C3  P2  50
Oldest data File
CID PID Metric
C1  P1  30
C2  P1  40
C3  P1  50
C3  P2  60
C4  P1  30
Output file expectation
C1  P1  10
C2  P1  20
C2  P2  30
C3  P1  40
C3  P2  50
C4  P1  30
 
     
    