I'm trying to find the annual percent change for each ticker in fundamentals_df, by looking at pricing data in another dataframe (prices_df). However, the dataset is quite large, and a full loop can take over 3 hours. I've been looking into vectorization as an alternative, but I'm quite new, and am looking for advice on how I could apply it to my particular problem. Lot's of solutions out there are for single dataframe, and not cases where you have to use data from one dataframe to access another.
Here's an outline of the steps I'm taking right now:
- I am looping though each row in fundamentals_df, and getting theticker, andyearattributes.
- Then going into prices_dfand finding all rows with the sameticker
- Then further filtering by getting the the rows between a given year, and the next year (year + 1).
- Finally, getting the first and last values, and calculating the percent change.
The Dataset
prices_df:
ticker                 object
date           datetime64[ns]
close                 float64
fundamentals_df has a handful of fields, but these are the two most relevant:
ticker             object
year               object
The ticker attribute is unique, and is shared by both dataframes.
The code
Here's the loop that goes through fundamentals_df, getting each ticker, and year. Then using the ticker to perform an initial filter on prices_df, just getting the rows relevant to the current ticker.
current_ticker = ''
returns_dict = {}
for index, row in fundamentals_df.iterrows():
    if row.ticker != current_ticker:
        # This first "if" statement determines if we're working on a new ticker. 
        # If we are, then get the new ticker's price history, and pretty it up.
        current_ticker = row.ticker
        
        # Filter prices_df by ticker
        temp_price_df = prices_df.query(f'ticker == "{current_ticker}"')
        # Make sure it's sorted by date. SUPER IMPORTANT since we calculate pct_change using first and last rows.
        temp_price_df = temp_price_df.sort_values(by='date')
    else:
        # If we're working on the same ticker, then no need to revisit prices_df
        pass
    
    annual_return = calc_annual_return(temp_price_df, row.year)
    returns_dict[index] = annual_return
Here's the function to calculate annual return. It is responsible for taking a temporary dataframe with all the prices for a given ticker, and filter out all the dates that aren't relevant. then use the first and last values remaining to calculate percent change.
def calc_annual_return(prices, year):
    # "year" represents the end of the period;
    # we want the stocks performance for the following year, so we add 1
    year = int(year) + 1
    
    # Just get the slice of time we meed
    prices = prices.loc[prices['date'].between(f'{year}-01-01', f'{year + 1}-01-01')]
    try:
        start_price = prices.iloc[0].closeadj
        end_price = prices.iloc[-1].closeadj
        pct_change = (end_price - start_price) / start_price
    except:
        return 'n/a'
    
    return pct_change 
I know this is quite a long post, I hope it's thorough, and clear. Any help would be greatly appreciated.
 
    