I have the following snippet which iterates over a list of .csv files and then uses a insert_csv_data function which reads, preprocesses and inserts the .csv file's data into a .hyper file (Hyper is Tableau's new in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets):
A detailed interpretation of the insert_csv_data function can be found here
for csv in csv_list:
insert_csv_data(hyper)
The issue with the above code is that it inserts one .csv file into the .hyper file at a time, which is pretty slow at the moment.
I would like to know if there's a faster or parallel workaround as I'm using Apache Spark for processing on Databricks. I've done some research and found modules like multiprocessing,
joblib and asyncio that might work for my scenario, but I'm unsure of how to correctly implement them.
Please Advise
Edit:
Parallel Code:
from joblib import Parallel, delayed
element_run = Parallel(n_jobs=1)(delayed(insert_csv_data)(csv) for csv in csv_list)