I have two different DAGs that need to run in different frequencies. One i.e. dag1 needs to run weekly and the dag2 needs to run daily. Now dag2 should only run when dag1 has finished, on every occurrence when dag1 runs.
I have defined two DAGs as follows in two different python modules.
dag1.py
PROJECT_PATH = path.abspath(path.join(path.dirname(__file__), '../..'))
with DAG('dag1',
default_args={
'owner': 'airflow',
'start_date': dt.datetime(2019, 8, 19, 9, 30, 00),
'concurrency': 1,
'retries': 0
}
schedule_interval='00 10 * * 1',
catchup=True
) as dag:
CRAWL_PARAMS = BashOperator(
task_id='crawl_params',
bash_command='cd {}/scraper && scrapy crawl crawl_params'.format(PROJECT_PATH)
)
dag2.py
PROJECT_PATH = path.abspath(path.join(path.dirname(__file__), '../..'))
with DAG('dag2',
default_args = {
'owner': 'airflow',
'start_date': dt.datetime(2019, 8, 25, 9, 30, 00),
'concurrency': 1,
'retries': 0
}
schedule_interval='5 10 * * *',
catchup=True
) as dag:
CRAWL_DATASET = BashOperator(
task_id='crawl_dataset',
bash_command='''
cd {}/scraper && scrapy crawl crawl_dataset
'''.format(PROJECT_PATH)
)
Currently I have manually set a gap of 5 minutes between two dags. This setup is not working currently and also lacks the function to make dag2 dependent on dag1 as required.
I had checked the answers here and here but was not able to figure out.
NOTE: the schedule_intervals are indicative only. The intention is to run dag1 every Monday at a fixed time and run dag2 daily on a fixed time and on Monday, it should only after dag1 finishes.
Here each dag has multiple tasks as well.