Vectorizing/Speeding Up Pandas When Using One Df to Query Another

Question

I'm trying to find the annual percent change for each ticker in fundamentals_df, by looking at pricing data in another dataframe (prices_df). However, the dataset is quite large, and a full loop can take over 3 hours. I've been looking into vectorization as an alternative, but I'm quite new, and am looking for advice on how I could apply it to my particular problem. Lot's of solutions out there are for single dataframe, and not cases where you have to use data from one dataframe to access another.

Here's an outline of the steps I'm taking right now:

I am looping though each row in fundamentals_df, and getting the ticker, and year attributes.
Then going into prices_df and finding all rows with the same ticker
Then further filtering by getting the the rows between a given year, and the next year (year + 1).
Finally, getting the first and last values, and calculating the percent change.

The Dataset

prices_df:

ticker                 object
date           datetime64[ns]
close                 float64

fundamentals_df has a handful of fields, but these are the two most relevant:

ticker             object
year               object

The ticker attribute is unique, and is shared by both dataframes.

The code

Here's the loop that goes through fundamentals_df, getting each ticker, and year. Then using the ticker to perform an initial filter on prices_df, just getting the rows relevant to the current ticker.

current_ticker = ''
returns_dict = {}

for index, row in fundamentals_df.iterrows():
    if row.ticker != current_ticker:
        # This first "if" statement determines if we're working on a new ticker. 
        # If we are, then get the new ticker's price history, and pretty it up.

        current_ticker = row.ticker
        
        # Filter prices_df by ticker
        temp_price_df = prices_df.query(f'ticker == "{current_ticker}"')

        # Make sure it's sorted by date. SUPER IMPORTANT since we calculate pct_change using first and last rows.
        temp_price_df = temp_price_df.sort_values(by='date')
    else:
        # If we're working on the same ticker, then no need to revisit prices_df
        pass
    
    annual_return = calc_annual_return(temp_price_df, row.year)
    returns_dict[index] = annual_return

Here's the function to calculate annual return. It is responsible for taking a temporary dataframe with all the prices for a given ticker, and filter out all the dates that aren't relevant. then use the first and last values remaining to calculate percent change.

def calc_annual_return(prices, year):
    # "year" represents the end of the period;
    # we want the stocks performance for the following year, so we add 1
    year = int(year) + 1
    
    # Just get the slice of time we meed
    prices = prices.loc[prices['date'].between(f'{year}-01-01', f'{year + 1}-01-01')]

    try:
        start_price = prices.iloc[0].closeadj
        end_price = prices.iloc[-1].closeadj

        pct_change = (end_price - start_price) / start_price
    except:
        return 'n/a'
    
    return pct_change

I know this is quite a long post, I hope it's thorough, and clear. Any help would be greatly appreciated.

As much detail as you put into your question, the overall process is still not quite clear. Have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a sample of each input df and your expected output, so that we can better understand how to help. — G. Anderson, Jul 08 '22 at 17:42

score 0 · Answer 1 · answered Jul 08 '22 at 18:21

One way to do it in a vectorized fashion (although most likely not fully optimized) would be to do the following:

Exemplary input:

fundamentals_df = pd.DataFrame(
    columns=['ticker', 'year'],
    data=[['A', '2021'],
          ['A', '2020'],
          ['A', '2019'],
          ['B', '2020'],
          ['B', '2019'],
         ]
)

prices_df = pd.DataFrame(
    columns=['ticker', 'date', 'close'],
    data=[['A', '2022-01-04', 10],
          ['A', '2022-01-08', 11],
          ['A', '2021-02-24', 8],
          ['A', '2021-12-04', 10],
          ['A', '2020-01-01', 4],
          ['A', '2021-01-01', 7],
          ['B', '2022-01-04', 9],
          ['B', '2022-01-08', 11],
          ['B', '2021-02-24', 2],
          ['B', '2021-12-04', 12],
          ['B', '2021-01-01', 6],
         ]
)

Code:

pct_change_df = pd.DataFrame()
# iterate over every year, could be improved to not require iteration at all
for year in fundamentals_df['year'].unique():
    # reduce DataFrames to given year
    fundamentals_df_tmp = fundamentals_df[fundamentals_df['year']==year]
    year = int(year) + 1
    prices_df_tmp = prices_df[prices_df['date'].between(
        pd.to_datetime(f'{year}-01-01'), pd.to_datetime(f'{year + 1}-01-01')
    )]
    
    # merge price data of the year to ticker + year combination
    fundamentals_df_tmp = fundamentals_df_tmp.merge(prices_df_tmp)
    
    # reduce to minimum and maximum date of each ticker in given year
    min_df = fundamentals_df_tmp.groupby('ticker')['date'].min().reset_index()
    max_df = fundamentals_df_tmp.groupby('ticker')['date'].max().reset_index()
    date_df = pd.concat([min_df, max_df])
    fundamentals_df_tmp = fundamentals_df_tmp.merge(date_df)
    
    # match starting close and ending close of year and calculate percentage change
    fundamentals_df_tmp = fundamentals_df_tmp.sort_values(['ticker', 'date'])
    fundamentals_df_tmp['close_end'] = fundamentals_df_tmp['close'].shift(-1)
    fundamentals_df_tmp['pct_change'] = fundamentals_df_tmp['close_end'] / fundamentals_df_tmp['close'] - 1
    fundamentals_df_tmp = fundamentals_df_tmp.loc[fundamentals_df_tmp.groupby('ticker')['date'].idxmin()]
    fundamentals_df_tmp = fundamentals_df_tmp[['ticker', 'year', 'pct_change']]
    pct_change_df = pct_change_df.append(fundamentals_df_tmp)

Output (pct_change_df):

  ticker  year  pct_change
0      A  2021    0.100000
1      A  2020    0.428571
3      B  2020    1.000000
0      A  2019    0.750000
2      B  2019    0.000000

Vectorizing/Speeding Up Pandas When Using One Df to Query Another

The Dataset

The code

1 Answers1