I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.
Example:
If I insert 10 rows into the table today it will have the CREATE_DATE as todays date.. If I insert the same 10 rows again tomorrow it will have a different CREATE_DATE which creates the problem of me using DISTINCT..
Is there a way of deleting the duplicate records based on 9 columns and ignoring the 10th.
Example: Lets consider i have 5 columns in the table. This is an EXTERNAL HIVE TABLE partitioned by DAYID and MARKETID. Whenever the columns other than CREATEDATE (as referred by Row 1 and 2) are same OR if the rows are duplicate (as referred by Row 3 and 4) it should retain any one of those rows. Doesn't matter which it retains.
COL1 COL2 CREATEDATE   DAYID    MARKETID  
A     1    20131206   20131207 1234  
A     1    20131207   20131207 1234  
A     1    20131206   20131207 1234  
B     1    20131206   20131207 1234  
B     1    20131206   20131207 1234  
C     2    20131206   20131207 1234  
C     2    20131207   20131207 5678 
output---
COL1 COL2 CREATEDATE   DAYID    MARKETID
A     1    20131206   20131207   1234
B     1    20131206   20131207   1234
C     2    20131206   20131207   1234
C     2    20131207   20131207   5678
Thanks Nates
 
     
     
     
    