4. 4
Apache Airflow : What is it?
In a :
Airflow is a platform to
programmatically author, schedule
and monitor workflows (a.k.a. DAGs
or Directed Acyclic Graphs)
13. 13
Apache Airflow : Why use it?
When would you use a Workflow Scheduler like
Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines
• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron)
• DB Back-ups, Scheduled code/config deployment
14. 14
What should a Workflow Scheduler do well?
• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met
• Easily scale for growing load
Apache Airflow : Why use it?
15. 15
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow : Why use it?
17. Use-Case : Message Scoring
17
enterprise A
enterprise B
enterprise C
S3
S3 uploads every 15
minutes
18. Use-Case : Message Scoring
18
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour
19. Use-Case : Message Scoring
19
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
20. Use-Case : Message Scoring
20
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
21. Use-Case : Message Scoring
21
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
22. 22
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
23. 23
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
30. 30
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
It ships with a
• DAG Scheduler
• Web application (UI)
• Powerful CLI
• Celery Workers!
Apache Airflow : Behind the Scenes
31. 31
Apache Airflow : Behind the Scenes
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
32. 32
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
33. 1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks over Celery
33
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
34. 34
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages
DAGs using the Airflow UI!
2. Airflow’s webserver stores
scheduling metadata in the
metadata DB
3. The scheduler picks up new
schedules and distributes
work over Celery /
RabbitMQ
4. Airflow workers pick up
Airflow tasks from RabbitMQ
Celery / RabbitMQ
Apache Airflow : Behind the Scenes