0

I have a couple of tab delimited files.

Each file is structured like so:

ID          Title                                 Rating    Date_Rated

What I want to do is merge all these files into one, and keep only the latest rating.

file1 may have data such as:

70202148    Sherlock Holmes: A Game of Shadows    5         28/12/13

file2 may have data such as:

70202148    Sherlock Holmes: A Game of Shadows    4.5       25/12/13

2 Answers2

1

Assuming the date field is in dd/mm/yy format following should do the trick:

cat file1 file2 ... | \
    sort -t$'\t' -n -k1,1 -k4.7r -k4.4r -k4.1r | sort -t$'\t' -k1,1n -u
0

Using shell tools for this task would be dangerous as sort will not be able to understand date formats. An example is that if you run Miroslav's code in the below file the output will print the line with date record 28/02/14 which is wrong.

$ cat file.txt 
70202148    Sherlock Holmes: A Game of Shadows    5         28/12/14
70202148    Sherlock Holmes: A Game of Shadows    5         28/02/14
70202148    Sherlock Holmes: A Game of Shadows    5         28/12/13
70202148    Sherlock Holmes: A Game of Shadows    5         28/12/13

Instead of using shell tools we need to use high level scripting/programming languages for this. You can use Python, Perl, Ruby or any other language for this. Below is a Python script which does the job.

#!/usr/bin/env python3


import datetime

data = {}

for line in open('file.txt'):

    line = line.strip().split()
    if len(line) == 0:
        continue

    if line[0] not in data:
        date = datetime.datetime.strptime(line.pop(-1), '%d/%m/%y')
        data[line.pop(0)] = {'rating':line.pop(-1), 'year':date, 'title': ' '.join(line[1:]) }
    else:
        date = datetime.datetime.strptime(line.pop(-1), '%d/%m/%y')
        if date > data[line[0]]['year']:
            data[line.pop(0)] = {'rating':line.pop(-1), 'year':date, 'title': ' '.join(line[1:]) }


for val in sorted(data):
    print('{} {} {} {}'.format(val, data[val]['title'], data[val]['rating'], data[val]['year'].strftime('%d/%m/%y')))

Output:

$ ./filter.py 
70080038 Iron Man 4 18/02/14
70202148 Sherlock Holmes: A Game of Shadows 5 28/12/14
Kannan Mohan
  • 458
  • 2
  • 6