I'm using pandas to do an outer merge on a set of about ~1000-2000 CSV files. Each CSV file has an identifier column id which is shared between all the CSV files, but each file has a unique set of columns of 3-5 columns. There are roughly 20,000 unique id rows in each file. All I want to do is merge these together, bringing all the new columns together and using the id column as the merge index.
I do it using a simple merge call:
merged_df = first_df # first csv file dataframe
for next_filename in filenames:
   # load up the next df
   # ...
   merged_df = merged_df.merge(next_df, on=["id"], how="outer")
The problem is that with nearly 2000 CSV files, I get a MemoryError in the merge operation thrown by pandas. I'm not sure if this is a limitation due to a problem in the merge operation?
The final dataframe would have 20,000 rows and roughly (2000 x 3) = 6000 columns. This is large, but not large enough to consume all the memory on the computer I am using which has over 20 GB of RAM.  Is this size too much for pandas manipulation? Should I be using something like sqlite instead? Is there something I can change in the merge operation to make it work on this scale?
thanks.
 
     
    