I have some efficiency problems with the algorythm I am going to explain by snippets:
- First the - df_fsis created. I create a random DataFrame here to make it work with the example- import pandas as pd import numpy as np import random as rd import string R = 2500 # ROWS / 2 C = 100 # COLUMNS NPROF = 1 # NUMBER OF PROFILES, up to 6 STNNBR = 'STNNBR' d = {} for x in range(C): key = ''.join(rd.choices(string.ascii_uppercase + string.digits, k=10)) nan_list = [np.nan] * R float_list = [float(1000 * rd.random()) for i in range(R)] l = nan_list + float_list rd.shuffle(l) d[key] = l d[STNNBR] = [int(200 * rd.random()) for i in range(R*2)] df_fs = pd.DataFrame(d)
- The list - cur_plot_colsindicates the name of columns we are going to work with:- pos_list = [] while len(pos_list) < 20: v = int(C * rd.random()) if v not in pos_list: pos_list.append(v) d_keys = list(d.keys()) cur_plot_cols = [d_keys[p] for p in pos_list]
- The - prof_dfis a huge DataFrame that I initialize with many- NaNvalues and many columns. The number of columns increases with- cur_plot_colsand- NFPROF:- tab_list = ['SALNTY', 'OXYGEN', 'NITRAT', 'PHSPHT', 'SILCAT', 'ALKALI', 'TCARBN', 'PH_TOT', 'CFC_11', 'CFC_12', 'CFC113', 'SF6'] compound_cols = [] for tab in tab_list: for col in cur_plot_cols: for n in range(NPROF): compound_cols.append('{}_{}_{}'.format(tab, col, n)) d_aux = {} if compound_cols != []: d_aux = dict.fromkeys(compound_cols, []) prof_df = pd.DataFrame(d_aux) # init empty columns prof_df['INDEX'] = df_fs.index.values prof_df = prof_df.set_index(['INDEX'])
- More variables that I need to make the example work: - plot_prof_invsbl_points = True stt_order_reversed = [31] # up to 6 elements tabs_flags_plots = { 'NITRAT': { # tab name 'flag': 'NITRAT_FLAG_W', }, 'SALNTY': { 'flag': 'SALNTY_FLAG_W', }, } visible_flags = [3, 4, 5, 6]
- Finally the problematic algorithm, the line labeled with - FIXMEis the main bottle neck- f = cur_plot_cols + [STNNBR] df_fs = df_fs.filter(f) for tab in tab_list: i = NPROF - 1 for stt in stt_order_reversed: for col in cur_plot_cols: df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()] if plot_prof_invsbl_points is False: # this is never True in this example extracted from the original code if tab in tabs_flags_plots.keys(): flag = tabs_flags_plots[tab]['flag'] df_aux = df_aux[df_aux[flag].isin(visible_flags)] prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col] # FIXME: this is the main bottle neck i -= 1
Measurements
I have measured the time with the line_profile tool and this is the result:
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31        13        114.0      8.8      0.0      for tab in tab_list:
    32        12        148.0     12.3      0.0          i = NPROF - 1
    37        24        267.0     11.1      0.0          for stt in stt_order_reversed:
    38       372      12430.0     33.4      0.0              for col in cur_plot_cols:
    39       360   12890156.0  35806.0     13.1                  df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
    40       360      11687.0     32.5      0.0                  if plot_prof_invsbl_points is False:
    41                                                               flag = self.env.tabs_flags_plots[tab]['flag']
    42                                                               df_aux = df_aux[df_aux[flag].isin(self.env.visible_flags)]
    43       360   85075802.0 236321.7     86.3                  prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
    44        12        201.0     16.8      0.0              i -= 1
Any suggestion to make these lines faster?
df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
Notes
Actually the real index I am using into the DataFrames are hashes, that is to say, strings.
I have to update the prof_df DataFrame columns. But the columns names are made with the parameters [tab, col, i], I need to iterate over them to set the column I want to update in each iteration. Is there a way to iterate faster and update those columns? Any alternative?
Some time ago I used this expression to assign grouped values:
ml_df['xs{}'.format(n_plot)] = df_p.groupby('STNNBR').apply(lambda x: list(x[col_x_name]))
But I do not know how to apply the same method here because this time I need the column name and the i value on the left side of the assigment.
The algorythm takes 6 seconds to complete, too much.
 
    