I have a 90G file made of json items.Below is a sample of 3 lines only:
{"description":"id1","payload":{"cleared":"2020-01-31T10:23:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}
{"description":"id2","payload":{"cleared":"2020-01-31T11:01:54Z","first":"2020-01-31T02:45:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}
{"description":"id3","payload":{"cleared":"2020-01-31T5:33:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T07:50:47Z","last":"2019-01-31T04:50:47Z"}}
The end goal is,for each line, to get the max of first, cleared and last and update timestamp with max. Then sort all the items by timestamp.Ignore the sorting for now.
I initially jsonified the file to a json file and used the below code:
#!/usr/bin/python
import json as simplejson
from collections import OrderedDict
with open("input.json", "r") as jsonFile:
    data = simplejson.load(jsonFile, object_pairs_hook=OrderedDict)
for x in data:
    maximum = max(x['payload']['first'],x['payload']['cleared'],x['payload']['last'])
    x['payload']['timestamp']= maximum
data_sorted = sorted(data, key = lambda x: x['payload']['timestamp'])
with open("output.json", "w") as write_file:
    simplejson.dump(data_sorted, write_file)
The above code worked for a small test file but the script got killed when I ran it for the 90G file.
I then decided to deal with it line by line using the below code:
#!/usr/bin/python
import sys
import json as simplejson
from collections import OrderedDict
first_arg = sys.argv[1]
data = []
with open(first_arg, "r") as jsonFile:
    for line in jsonFile:
        y = simplejson.loads(line,object_pairs_hook=OrderedDict)
    payload = y['payload']
        first  = payload.get('first', None)
        clearedAt = payload.get('cleared')
        last = payload.get('last')
        lst = [first, clearedAt, last]
        maximum = max((x for x in lst if x is not None))
        y['payload']['timestamp']= maximum
        data.append(y)
with open("jl2json_new.json", "w") as write_file:
    simplejson.dump(data, write_file, indent=4)
It still got killed. So I'm wondering about the best way to approach this problem?
I tried the following approach but it wasn't helpful: https://stackoverflow.com/a/21709058/322541
 
     
    