I have a large dataframe (2,000,000+ rows by 2,000+ columns) and I want to create some new columns based on the data contained in each row. One column is a number which I want to use as an indexer to select the data in other columns. This indexer changes for each row, which means selecting the columns by their name isn't possible so I need to use numeric indexing. After identifying the columns I need, I then want to output the values to a new set of columns (after which I will perform some calculations).
Below is a simplified version of what I have tried:
# import packages
import pandas as pd
import numpy as np
# create dataframe
n = 10000
a_bar = 9; a_sd = 2
b_bar = 1000; b_sd = 100
np.random.seed(12345)
df = pd.DataFrame(dict(month_index=np.random.normal(a_bar, a_sd, size=n),
                       month_1=np.random.normal(b_bar, b_sd, size=n),
                       month_2=np.random.normal(b_bar, b_sd, size=n),
                       month_3=np.random.normal(b_bar, b_sd, size=n),
                       month_4=np.random.normal(b_bar, b_sd, size=n),
                       month_5=np.random.normal(b_bar, b_sd, size=n),
                       month_6=np.random.normal(b_bar, b_sd, size=n),
                       month_7=np.random.normal(b_bar, b_sd, size=n),
                       month_8=np.random.normal(b_bar, b_sd, size=n),
                       month_9=np.random.normal(b_bar, b_sd, size=n),
                       month_10=np.random.normal(b_bar, b_sd, size=n),
                       month_11=np.random.normal(b_bar, b_sd, size=n),
                       month_12=np.random.normal(b_bar, b_sd, size=n)
                       ),
                  columns=['month_index', 'month_1', 'month_2', 'month_3',\
                           'month_4', 'month_5', 'month_6', 'month_7',\
                           'month_8', 'month_9', 'month_10', 'month_11', 'month_12'])
# round all column values 
   
df = df.round()
# restrict value of month index
conditions = [df.month_index < 7, df.month_index > 12, (df.month_index >= 7) & (df.month_index <= 12)]
values = [7, 12, df.month_index]
df["month_index"] = np.select(conditions, values)
# reduce size of dataframe
for column in df.columns:
    df[column] = pd.to_numeric(df[column], downcast='integer')
# select relevant data using function
def select_columns(df):
    
    i = 0
         
    while i < len(df):
        j = df.at[i, 'month_index']
        df.at[i, "temp_1"] = df.iat[i, j-5]
        df.at[i, "temp_2"] = df.iat[i, j-4]
        df.at[i, "temp_3"] = df.iat[i, j-3]
        df.at[i, "temp_4"] = df.iat[i, j-2]
        df.at[i, "temp_5"] = df.iat[i, j-1]
        df.at[i, "temp_6"] = df.iat[i, j]
    
        i += 1
    
    return df
df = select_columns(df)
As I've found out this is terribly inefficient and I would much rather use pandas .apply method, or (if possible) vectorisation. Whilst I have tried out numpy's vectorisation functionality, I am aware that this approach is only feasible if columns can be passed to a function as numpy arrays. my questions are:
- What is the best approach for my problem (given the number of rows/columns I am working with)?
- Is it feasible to vectorise over a dataframe with a large number of columns?
 
    