I have a pandas.Series of business dates called s_dates. I want to pass each of these dates (together with some other hyper-parameters) to a function called func_sql_to_df which formats an SQL-query and then returns a pandas.DataFrame. Finally, all of the DataFrames should be concatenated (appended) into a single pandas.DataFrame called df_summary where the business date is the identifier.
From here I need to do two things:
- export df_summaryto an Excel sheet or csv-file.
- group df_summaryby the dates and then apply another function calledfunc_analysisto each column.
My attempt is something like this
df_summary = pd.concat(list(
    s_dates.apply(func_sql_to_df, args=hyper_param)
))
df_summary.groupby('dates').apply(func_analysis)
# Export data
...
However, the first statement where df_summary is defined takes quite long. There are a total of 250 dates where the first couple of iterations takes approximately 3 seconds but it increases to over 3 minutes after about 100 iterations (and continues to do so). All of the SQL-queries take more or less the same time to execute individually and the resulting dataframes all have the same number of observations.
I want to increase the performance of this setup, but I am already not using any loops (only apply-functions) and the SQL-query has already been optimized a lot. Any suggestions?
Update: If I am not mistaken then my attempt is actually the suggested solution as stated in the accepted answer to this post.
Update2: My SQL-query looks something like this. I do not know if all the dates can be passed at ones as the conditions specified in the WHERE-statement must hold for each passed value in dates.
select /*+ parallel(auto) */ 
         MY_DATE as EOD_DATE -- These are all the elements in 'DATES' passed
       , Var2
       , Var3
       , ColA
       , ColB
       , ...
       , ColN
from Database1
where
    Var2 in (select Var2 from Datebase2 where update_time < MY_DATE) -- Cond1
and Var3 in (select Var3 from DataBase3 where EOD_DATE = MY_DATE) -- Cond2
and cond3
and cond4
...
Running the query for any date in dates on its own seems to take around 2-8 seconds. However, as mentioned some of the iterations in the apply-function takes more than 3 minutes.
