4. What is airflow ?
š 프로그램적으로 데이터 파이프라인을 author, schedule, monitor
š 컴포넌트 : Web Server , Scheduler, Executor, Worker, Metadatabase
š 키 컨셉 : DAG, Operator, Task, TaskInstance, Workflow
5. Airflow Architecture
š Airflow Webserver : Serves the UI Dashboard over http
š Airflow Scheduler : A daemon
š Airflow Worker : working wrapper
š Metadata Database : Stores information regarding state of tasks
š Executor : Message queuing process that is bound to the scheduler and determines the
worker processes that executes scheduled tasks
Airflow Webserver
Airflow Scheduler
Worker
Worker
Worker
Meta DB
Logs
Dags
6. How airflow works ?
š 1. The scheduler reads the DAG folder
š 2. Your Dag is parsed by a process to create a DagRun based on the scheduling
parameters of your DAG.
š 3. A TaskInstance is instantiated for each Task that needs to be executed and flagged to
“Scheduled” in the metadata database
š 4. The Scheduler gets all TaskInstances flagged “Scheduled” from the metadata database,
changes the state to “Queued” and sends them to the executors to be executed.
š 5. Executors pull out Tasks from the queue ( depending on your execution setup ), change
the state from “Queued” to “Running” and Workers start executing the TaskInstances.
š 6. When a Task is finished, the Executor changes the state of that task to its final
state( success, failed, etc ) in the database and the DAGRun is updated by the scheduler
with the state “Success” or “Failed”. Of course, the web server periodically fetch data
from metadatabae to update UI.
9. What is DAG ?
š Finte directed graph with no directed cycles. No cycle
š Dag represents a collection of tasks to run, organized in a way that represent their
dependencies and relations
š Each node is a Task
š Each edge is Dependency
š 어떻게 워크플로우를 실행시킬건가?
10. DAG’s important properties
š Defined in Python files placed into Airflow’s DAG_FOLDER ( usually ~/airflow/dags)
š Dag_id
š Description
š Start_date
š Schedule_interval
š Dependent_on_past : run the next DAGRun if the Previous one completed successfully
š Default_args : constructor keyword parameter when initializing opeators
11. What is Operator?
š Determines what actually gets done.
š Operators are usually (but now always) atomic, meaning they can stand on their own and
don’t need to share resources with any other operators.
š Definition of single task
š Should be idempotent ( 항상 같은 결과를 출력 )
š Task is created by instantiating an Operator class
š An operator defines the nature of this task and how should it be executed
š Operator is instantiated, this task becomes a node in your DAG.
12. Many Operators
š Bash Operator
š Python Operator
š EmailOperator ( sends an email )
š SqlOperator ( Executes a SQL command
š All Operators inherit from BaseOperator
š 3 types of operators
š Action operators that perform action (BashOperator, PythonOperator, EmailOperator … )
š Transfer operators that move data from one system to another ( sqlOperator, sftpOperator)
š Sensor operators waiting for data to arrive at defined location.
13. Operator ++
š Transfer Operators
š Move data from one system to another
š Pulled out from the source, staged on the machine where the executor is running, and then transferred
to the target system.
š Don’t use if you are dealing with a large amount of data
š Sensor Operators
š Inherit of BaseSensorOperator
š They are useful for monitoring external processes like waiting for files to be uploaded in HDFS or a
partition appearing in Hive
š Basically long running task
š Sensor operator has a poke method called repeatedly until it returns True ( method used for monitoring
the external process)
14. Make Dependencies in python
š set_upstream()
š set_downstream()
š << ( = set_upstream )
š >> ( = set_downstream )
A
B
C
D
š B depends of A
š C depends of A
š D depends of B and C
( Example )
A.set_downstream(B)
A >> B
A >> { B, C } >> D
15. How the Scheduler Works
š DagRun
š A Dag consists of Tasks and need those tasks to run
š When the Scheduler parses a Dag, it automatically creates a DagRun which is an instantiation of a DAG in time according to start_date
and schedule
š Backfill and Catchup
š Scheduler Interval
š None
š @once
š @hourly
š @daily
š @weekly
š @monthly
š @yearly
š Cron time string format can be used : ( * * * * * - Minute(0-59) Hour(0-23) Day of the month(1-31) Month(1-12) Day of the week(0-7)
16. Concurrency vs Parallelism
š Concurrent – If it can support two or more actions in progress at the same time
š Parallel – If it can support two or more actions executing simultaneously
š In concurrent systems, multiple actions can be in progress (may not be executed) at the
same time
š In parallel systems, multiple actions are simultaneously executed
17. Database and Executor
š Sequential Executor ( Default executor, SQLlite )
š Default executor you get when you run Apache Airflow
š Only run one task at time (Sequential), useful for debugging
š It is the only executor that can be used with SQLite since SQLlite donesn’t support multiple writers
š Local Executor ( PostgreSQL )
š It can run multiple tasks at a time
š Multiprocessing python library and queues to parallelize the execution of tasks
š Run tasks by spawning processes in a controlled fashion in different modes on the same machine
š Can tune the number of processes to spawn by using the parallelism parameter
18. Database and Executor
š Celery Executor
š Celery == Python Task-Queue System
š Task-Queue System handle distribution of tasks on workers across threads or network nodes
š Tasks need to be pushed into a broker( RabbitMQ )
š celery workers will pop them and schedule task executions
š Recommend for production use of Airflow
š Allows distributing the execution of task instances to multiple worker node(Computer)
š ++ Dask, Mesos, Kubernetes … etc
20. Executor Architecture
Meta DB
Web Server
Scheduler +
Worker
Local Executor ( Single Machine )
Meta DB
Web Server Scheduler +
Worker
Worker
Worker
Celery
Celery Executor
21. Advanced Concept
š SubDAG
š Minimising repetitive patterns
š Main DAG mangages all the subDAGs as normal taks
š SubDAGs must be scheduled the same as their parent DAG
š Hooks
š Interfaces to interact with your external sources such as (PostgreSQL, Spark, SFTP … )
22. XCOM
š Tasks communicate ( cross-communication , allows multiple tasks to exchange messages )
š Principally defined by a key, value and a timestamp
š XCOMs data can be “pushed” or “pulled”
š X_com_push()
š If a task returns a value, a XCOM containing that value is automatically pushed
š X_com_pull()
š Task gets the message based on parameters such as “key”, “task_ids” and “dag_id”
š Keys that are automatically given to XCOMs when they are pushed by being returned from
23. Branching
š Allowing DAG to choose between different paths according to the result of a specific task
š Use BranchPythonOperator
š When using branch, do not use property depends on past+
24. Service Level Agreement ( SLAs )
š SLA is a contract between a service provider and the end user that defines the level of
service expected from the service provider
š Define what the end user will received ( Must be received )
š Time, relative to the execution_date of tast not the start time(more than 30 min from exec )
š Different from ‘execution_timeout’ parameter << It makes task stopped and marks failed