Removing DUPLICATE rows in hive based on columns

Question

I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.

Example:

If I insert 10 rows into the table today it will have the CREATE_DATE as todays date.. If I insert the same 10 rows again tomorrow it will have a different CREATE_DATE which creates the problem of me using DISTINCT..

Is there a way of deleting the duplicate records based on 9 columns and ignoring the 10th.

Example: Lets consider i have 5 columns in the table. This is an EXTERNAL HIVE TABLE partitioned by DAYID and MARKETID. Whenever the columns other than CREATEDATE (as referred by Row 1 and 2) are same OR if the rows are duplicate (as referred by Row 3 and 4) it should retain any one of those rows. Doesn't matter which it retains.

COL1 COL2 CREATEDATE   DAYID    MARKETID  
A     1    20131206   20131207 1234  
A     1    20131207   20131207 1234  
A     1    20131206   20131207 1234  
B     1    20131206   20131207 1234  
B     1    20131206   20131207 1234  
C     2    20131206   20131207 1234  
C     2    20131207   20131207 5678

output---

COL1 COL2 CREATEDATE   DAYID    MARKETID
A     1    20131206   20131207   1234
B     1    20131206   20131207   1234
C     2    20131206   20131207   1234
C     2    20131207   20131207   5678

Thanks Nates

Can you provide some sample data here and the output you want ? — Mukesh S, Dec 06 '13 at 06:34
Example: Lets consider i have 5 columns in the table. COL1 COL2 CREATEDATE DAYID MARKETID A — user3072054, Dec 06 '13 at 14:11

dimamah · Accepted Answer · 2016-06-05T12:01:54.100

21

You can do the following :

select col1,col2,dayid,marketid,max(createdate) as createdate
from tablename
group by col1,col2,dayid,marketid

This way you are grouping the data by all the columns except the data so if there are rows with the same values in these columns they will be in the same group, and then, just "choose" the createdate you want by using an aggregate function like max/min etc.

edited Jun 05 '16 at 12:01

answered Dec 07 '13 at 10:26

dimamah

2,883
18
31

score 1 · Answer 2 · answered Jan 15 '20 at 03:54

1

we don't need to write all the column name in sql code by this way:

select * from (
  select *, row_number() over (partition by (col1, col2) order by col1) tmp_row_number
  from table_name
) t
where t.tmp_row_number==1

the only side effect is add an extra column tmp_row_number to the table.

answered Jan 15 '20 at 03:54

shiqin zhang

400
1
4
19

You can remove the extra column by specifying the select (not using `*`). – Mukyuu Jan 15 '20 at 04:12
this sql is used when the table has thousands of columns and we don't want to write all the column names in the sql code ^_^ – shiqin zhang Jan 15 '20 at 07:55
1

in that case, you can try to put the result in a temp table and remove the extra column before dropping the temp table: https://stackoverflow.com/a/1712243/3654837 – Mukyuu Jan 15 '20 at 08:41

score 0 · Answer 3 · edited Mar 28 '17 at 10:24

Well, hive does not provide row level update/delete, therefore we can avoid the duplicate data while loading the data in base tables.As shown below

CREATE TABLE RAW_TABLE  
(
    COL1 STRING,
    COL2 STRING,
    CREATEDATE STRING,
    DAYID STRING,
    MARKETID STRING
)
ROW FORMAT DELIMITED 
FIELDS TERMINATE BY'\t'
STORED AS TEXTFILE;

LOAD DATA INPATH '/FOLDER/TO/EXAMPLE.txt  INTO RAW_TABLE;

CREATE TABLE JLT_CLEAN AS
SELECT col1,
  col2,
  dayid,
  marketid,
  MAX(createdate) AS createdate
FROM JLT_STAHING
GROUP BY col1,
  col2,
  dayid,
  marketid;

This what we can use.

Removing DUPLICATE rows in hive based on columns

3 Answers3

Linked