2. What is Airflow?
● Apache Airflow is an open-source workflow
management platform. It started at Airbnb in October
2014 as a solution to manage the company's
increasingly complex workflows
● Airflow is a platform to programmatically author,
schedule and monitor workflows.
● Use airflow to author workflows as directed acyclic
graphs (DAGs) of tasks. The airflow scheduler executes
your tasks on an array of workers while following the
specified dependencies.
2
3. Why use Airflow?
● Have you ever managed messy data pipeline? In
real life, data pipeline can be pretty messy
● Using CLI interface
● Had you ever have to adjust your workflow? Surely
you want to be able to scale your workflow up and
down quickly and effectively
3
4. When not to use Airflow
A sampling of examples that Airflow can not satisfy in a first-class
way includes:
● DAGs which need to be run off-schedule or with no schedule
at all
● DAGs that run concurrently with the same start time
● DAGs with complicated branching logic
● DAGs with many fast tasks
● DAGs which rely on the exchange of data
● Parametrized DAGs
● Dynamic DAGs
4
5. What Airflow is used for in general?
● Monitoring Cron jobs
● transferring data from one place to other.
● Automating your DevOps operations.
● Periodically fetching data from websites and
update the database for your awesome price
comparison system.
● Data processing for recommendation based
systems.
● Machine Learning Pipelines.
5
7. Robinhood
● Managing dependencies between jobs was difficult.
With cron we would use worst-case expected durations
for upstream jobs to schedule downstream jobs.
● Failure handling and alerting had to be managed by the
job. We would have to rely on the job, or the on-call
engineer to handle retries and upstream failures in the
case of dependent jobs.
● Retrospection was difficult. We would need to sift
through logs or alerts to check how a job may have
performed on a certain day in the past.
7
8. Google
● In May 2018 Google announced Google Cloud
Composer, a managed Apache Airflow service that is
fully integrated in the Google Cloud platform and has
thus become one of the cornerstones for
orchestrating managed services in Google Cloud.
●
8