I'm still in the process of deploying Airflow and I've already felt the need to merge operators together. The most common use-case would be coupling an operator and the corresponding sensor. For instance, one might want to chain together the EmrStepOperator and EmrStepSensor.
I'm creating my DAGs programmatically, and the biggest one of those contains 150+ (identical) branches, each performing the same series of operations on different bits of data (tables). Therefore clubbing together tasks that make-up a single logical step in my DAG would be of great help.
Here are 2 contending examples from my project to give motivation for my argument.
1. Deleting data from S3 path and then writing new data
This step comprises 2 operators
DeleteS3PathOperator: Extends fromBaseOperator& usesS3HookHadoopDistcpOperator: Extends fromSSHOperator
2. Conditionally performing MSCK REPAIR on Hive table
This step contains 4 operators
BranchPythonOperator: Checks whether Hive table is partitionedMsckRepairOperator: Extends fromHiveOperatorand performs MSCK REPAIR on (partioned) tableDummy(Branch)Operator: Makes up alternate branching path toMsckRepairOperator(for non-partitioned tables)Dummy(Join)Operator: Makes up the join step for both branches
Using operators in isolation certainly offers smaller modules and more fine-grained logging / debugging, but in large DAGs, reducing the clutter might be desirable. From my current understanding there are 2 ways to chain operators together
HooksWrite actual processing logic in hooks and then use as many hooks as you want within a single operator (Certainly the better way in my opinion)
SubDagOperatorA risky and controversial way of doing things; additionally the naming convention for SubDagOperator makes me frown.
My questions are
- Should operators be composed at all or is it better to have discrete steps?
- Any pitfalls, improvements in above approaches?
- Any other ways to combine operators together?
- In taxonomy of Airflow, is the primary motive of Hooks same as above, or do they serve some other purposes too?
UPDATE-1
3. Multiple Inhteritance
While this is a Python feature rather than Airflow specific, its worthwhile to point out that multiple inheritance can come handy in combining functionalities of operators. QuboleCheckOperator, for instance, is already written using that. However in the past, I've tried this thing to fuse EmrCreateJobFlowOperator and EmrJobFlowSensor, but at the time I had run into issues with @apply_defaults decorator and had abandoned the idea.