download_images >> train >> serve
This line sets the sequence of operations for an ML pipeline in Airflow. source
A metaphor to think of Airflow is that of an air-traffic controller that is orchestrating, sequencing, mediating, managing the flow of the flights of airplanes (source). It is an example of the mediator pattern which decouples dependencies in a complex system. The airplanes do not talk directly to each other, they talk to the air-traffic controller.
A functional alternative to Airflow is to use a bunch of cron jobs to schedule bash scripts. Airflow instead defines pipelines as Directed Acyclic Graphs (DAGs) in python code. This critical talk on “Don’t use Apache Airflow” describes it as cron on steroids.
A complete example of an ML pipeline built with airflow that outputs the results to a streamlit app – https://github.com/bhlr/docker-airflow-streamlit
Each operation calls an operator to do the job locally or remotely.
How does it perform an operation remotely on another node ? ssh/remote execution ? docker daemon ? k8s operator ? There can be many different ways – this logic is encapsulated by an Executor.
A thread on airflow and alternatives- https://news.ycombinator.com/item?id=23349507 .
https://github.com/pditommaso/awesome-pipeline – A number of pipeline tools for ETL
Intro talk on Airflow by Astronomer – https://www.youtube.com/watch?v=GIztRAHc3as ,
and on an ETL use case with Snowflake – https://www.youtube.com/watch?v=3-XGY0bGJ6g
How can one compose these DAGs further and manage cross-DAG depedencies ? An approach is discussed in https://medium.com/quintoandar-tech-blog/effective-cross-dags-dependency-in-apache-airflow-1885dc7ece9f to define an explicit mediator between multiple DAGs.