I can't figure out a way to reduce memory usage for this program further. Basically, I'm reading from JSON log files into a pandas dataframe, but:
- the list appendfunction is what is causing the issue. It creates two different objects in memory, causing huge memory usage.
- .to_picklemethod of pandas is also a huge memory hog, because the biggest spike in memory is when writing to the pickle.
Here is my most efficient implementation to date:
columns = ['eventName', 'sessionId', "eventTime", "items", "currentPage", "browserType"]
df = pd.DataFrame(columns=columns)
l = []
for i, file in enumerate(glob.glob("*.log")):
    print("Going through log file #%s named %s..." % (i+1, file))
    with open(file) as myfile:
        l += [json.loads(line) for line in myfile]
        tempdata = pd.DataFrame(l)
        for column in tempdata.columns:
            if not column in columns:
                try:
                    tempdata.drop(column, axis=1, inplace=True)
                except ValueError:
                    print ("oh no! We've got a problem with %s column! It don't exist!" % (badcolumn))
        l = []
        df = df.append(tempdata, ignore_index = True)
        # very slow version, but is most memory efficient
        # length = len(df)
        # length_temp = len(tempdata)
        # for i in range(1, length_temp):
        #     update_progress((i*100.0)/length_temp)
        #     for column in columns:
        #         df.at[length+i, column] = tempdata.at[i, column]
        tempdata = 0
print ("Data Frame initialized and filled! Now Sorting...")
df.sort(columns=["sessionId", "eventTime"], inplace = True)
print ("Done Sorting... Changing indices...")
df.index = range(1, len(df)+1)
print ("Storing in Pickles...")
df.to_pickle('data.pkl')
Is there an easy way to reduce memory? The commented code does the job but takes 100-1000x longer. I'm currently at 45% memory usage at max during the .to_pickle part, 30% during the reading of the logs. But the more logs there are, the higher that number goes.
 
     
     
    