Question is simple:
master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?
Below image is explaning what am trying to do, it errors for some reason, am i missing something here?
Question is simple:
master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?
Below image is explaning what am trying to do, it errors for some reason, am i missing something here?
Just for others in case they are after how it worked:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.
You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)
Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)
Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.