SlideShare ist ein Scribd-Unternehmen logo
1 von 77
Downloaden Sie, um offline zu lesen
Apache Airflow in the Cloud
Programmatically orchestrating workloads with Python
Basic instructions
● Join the chat room: https://tlk.io/pydata-london-airflow (All necessary links are in this
chat room)
○ Please fill the google form - (For those who haven’t filled the form prior to this
tutorial) - This is to add you in our Google Cloud Platform project
○ Download the JSON file for Google Cloud access
○ Clone the Github repo for this tutorial
○ Follow the link for airflow installation
○ Link to this slide deck
● Clone the repository
○ Github link:
https://github.com/DataReplyUK/airflow-workshop-pydata-london-2018
Agenda
● Who we are and why we are here ?
● Different workflow management system
● What is Apache Airflow ?
● Basic Concept and UI Walkthrough
● Tutorial 1: Basic workflow
● Tutorial 2: Dynamic workflow
● GCP introduction
● Tutorial 3: Workflow in GCP
Hello!
I am Kaxil Naik
Data Engineer at Data
Reply and an Airflow
contributor
Who we are!
Hi!
I am Satyasheel
Data Engineer at Data
Reply
First of all Some Truth
Data is weird and it breaks stuffs
A bad data can make fail everything
From internet to local computer to
Service API’s
Image Source: https://www.free-vectors.com/images/Objects/263-hand-wrench-tool---spanner-vector.png
Robust pipelines need robust workflow
● Cron (Seriously!)
● Oozie
● Luigi
● Apache Airflow
“
https://visitkokomo.files.wordpress.com/2010/10/elwood-first-car.jpg
“
https://visitkokomo.files.wordpress.com/2010/10/elwood-first-car.jpg
Pros:
○ Used by thousands of companies
○ Web api, Java api, cli and html
support
○ Oldest among all
Oozie
Cons:
○ XML
○ Significant effort in managing
○ Difficult to customize
https://i.pinimg.com/originals/31/f9/bc/31f9bc4099f1ab2c553f29b8d95274c7.jpg
Pros:
○ Pythonic way to write a DAG
○ Pretty Stable
○ Huge Community
○ Came from Spotify engineering
team
Luigi
Cons:
○ Have to schedule workflows
externally
○ The open source Luigi UI is hard
to use
○ No inbuilt monitoring, alerting
https://cdn.vectorstock.com/i/1000x1000/88/00/car-sketch-vector-98800.jpg
Pros:
○ Python Code Base
○ Active community
○ Trigger rules
○ Cool web UI and rich CLI
○ Queues & Pools
○ Zombie Cleanup
○ Easily extendable
Airflow
Cons:
○ No role based access control
○ Minor issues (Deleting DAGs is
not straight forward)
○ Umm!!!
Image Source: https://s3.envato.com/files/133677334/Preview_Image_Set/GRID_06.jpg
Pros:
○ Python Code Base
○ Active community
○ Trigger rules
○ Cool web UI and rich CLI
○ Queues & Pools
○ Zombie Cleanup
○ Easily extendable
Airflow
Cons:
○ No role based access control
○ Minor issues (Deleting DAGs is
not straight forward)
○ Umm!!!
Image Source: https://s3.envato.com/files/133677334/Preview_Image_Set/GRID_06.jpg
A bad data can make fail everything
From internet to local computer to
Service API’s
What is Airflow ?
http://www.belairdaily.com/wp-content/uploads/2018/04/Cable-Management-Market-1.jpg
What is Airflow ?
Airflow is a platform to programatically author,
schedule and monitor workflows (a.k.a DAGs)
What is Airflow ?
● Open Source ETL workflow management tool
written purely in python
● It’s the glue that binds your data ecosystem
together
● It can handle failures
● Alert on failures
● Monitor performance of tasks over time
● Scale!
● Developed by Airbnb
● Inspired by Facebook’s dataswarm
● It is a production ready
● It Ships with:
○ DAG scheduler
○ Web Application UI
○ Powerful CLI
What is Airflow ?
● Open Source ETL workflow management tool
written purely in python
● It’s the glue that binds your data ecosystem
together
● It can handle failures
● Alert on failures
● Monitor performance of tasks over time
● Scale!
● Developed by Airbnb
● Inspired by Facebook’s dataswarm
● It is a production ready
● It Ships with:
○ DAG scheduler
○ Web Application UI
○ Powerful CLI
What is Airflow ?
● Open Source ETL workflow management tool
written purely in python
● It’s the glue that binds your data ecosystem
together
● It can handle failures
● Alert on failures (Email, Slack)
● Monitor performance of tasks over time
● Scale!
● Developed by Airbnb
● Inspired by Facebook’s dataswarm
● It is a production ready
● It Ships with:
○ DAG scheduler
○ Web Application UI
○ Powerful CLI
What is Airflow ?
● Open Source ETL workflow management tool
written purely in python
● It’s the glue that binds your data ecosystem
together
● It can handle failures
● Alert on failures
● Monitor performance of tasks over time
● Scale!
● Developed by Airbnb
● Inspired by Facebook’s dataswarm
● It is a production ready
● It Ships with:
○ DAG scheduler
○ Web Application UI
○ Powerful CLI
What is Airflow ?
● Open Source ETL workflow management tool
written purely in python
● It’s the glue that binds your data ecosystem
together
● It can handle failures
● Alert on failures
● Monitor performance of tasks over time
● Scale!
● Developed by Airbnb
● Inspired by Facebook’s dataswarm
● It is a production ready
● It Ships with:
○ DAG scheduler
○ Web Application UI
○ Powerful CLI
What is Airflow ?
● Open Source ETL workflow management tool
written purely in python
● It’s the glue that binds your data ecosystem
together
● It can handle failures
● Alert on failures
● Monitor performance of tasks over time
● Scale!
● Developed by Airbnb
● Inspired by Facebook’s dataswarm
● It is a production ready
● It Ships with:
○ DAG scheduler
○ Web Application UI
○ Powerful CLI
What is Airflow ?
● Open Source ETL workflow management tool
written purely in python
● It’s the glue that binds your data ecosystem
together
● It can handle failures
● Alert on failures
● Monitor performance of tasks over time
● Scale!
● Developed by Airbnb
● Inspired by Facebook’s dataswarm
● It is production ready
● It Ships with:
○ DAG scheduler
○ Web Application UI
○ Powerful CLI
Airflow Web UI
Airflow Web UI
First look of Airflow WebUI right after installation
Similar to Cron
This is how it looks once you start running your DAGs ….
No. of successfully
completed tasks
No. of Queued tasks
No. of failed
tasks
Status for recent
DAG Runs
Can pause a Dag by
switching it Off
No. of successful DAG
runs
No of DAG instance
running
No of times DAG fails
to run
Links to detailed DAG info
Web UI: Tree View
A tree representation of the DAG that spans across time(task run history)
Web UI: Graph View
Visualize task dependencies & current status for a specific run
Web UI: Task Duration
The duration of your different tasks over the past N runs.
Web UI: Gantt
Which task is a blocker?
https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned
Web UI: DAG details
See task metadata, rendered template, execution logs etc… for debugging
DAG Anatomy
DAG Anatomy
Workflow as code
Python Code
DAG Definition
DAG Default
Configurations
Tasks
Task Dependencies
Workflow as code
DAG Definition
DAG Default Configurations
Tasks
Workflow as codeTasks
Task Dependencies
DAG Anatomy
Workflow as code
Python Code
DAG
(Workflow)
DAG Definition
DAG Default
Configurations
Tasks
Task Dependencies
Concepts: Core Ideas
Concepts: DAG
● DAG - Directed Acyclic Graph
● Define workflow logic as shape of the graph
● It is a collection of all the tasks you want to run, organized in a way
that reflects their relationships and dependencies.
Concepts:
OPERATORS
● Workflows are composed of Operators
● While DAGs describe how to run a workflow, Operators determine
what actually gets done
Concepts:
OPERATORS
● 3 main types of operators:
○ Sensors are a certain type of operator that will keep running until a
certain criterion is met
○ Operators that perform an action, or tell another system to perform
an action
○ Transfer Operators move data from one system to another
Concepts: TASKS
● A parameterized instance of an operator
Once an operator is instantiated, it is referred as a “task”. The
instantiation defines specific values when calling the abstract
operator
● The parameterized task becomes a node in a DAG
Concepts: Tasks
Parameterised
Operator
Setting Dependencies
t1.set_downstream(t2)
OR
t2.set_upstream(t1)
OR
t1 >> t2
Concepts:
TASK INSTANCE
● Represents a specific run of a task
● Characterized as the combination of a dag, a task, and a point in
time.
● Task instances also have an indicative state, which could be
“running”, “success”, “failed”, “skipped”, “up for retry”, etc.
Architecture
Architecture
Airflow comes with 5 main types of built in execution modes:
● Sequential
● Local
● Celery (Out of the box)
● Mesos (Community driven)
● Kubernetes (Community driven)
Runs on a single machine
Runs on distributed system
Architecture: Sequential
Executor
● Default mode
● Minimum setup - work with sqlite as
well
● Process 1 task at a time
● Good for demo purpose
Architecture: Local
Executor
● Spawned by scheduler process
● Vertical scaling
● Production grade
● Does not need broker or any other
negotiator
Architecture: Celery
Executor
● Vertical & Horizontal scaling
● Production grade
● Can be monitored (Via Flower)
● Supports pool and queues
Task callbacks for success / failure / SLA miss
Useful Feature
https://speakerdeck.com/takus/building-data-pipelines-with-apache-airflow
Starting Airflow
Instructions to start
Airflow
● SetUp Airflow installation directory
$ export AIRFLOW_HOME=~/airflow
● Initiating Airflow Database
$ source airflow_workshop/bin/activate
$ airflow initdb
● Start the web server, default port is 8080
$ airflow webserver -p 8080
● Start the scheduler (In another terminal)
$ source airflow_workshop/bin/activate
$ airflow scheduler
● Visit localhost:8080 in the browser to see Airflow Web UI
Copy DAGs
● Clone the Git Repo
$ git clone
https://github.com/DataReplyUK/airflow-workshop-pydata-london-2018.git
● Copy the dags folder in AIRFLOW_HOME
$ cp -r airflow-workshop-pydata-london-2018/dags $AIRFLOW_HOME/dags
Tutorial 1: Basic Workflow
Tutorial 2: Dynamic Workflow
Advanced Concepts
➔ XCom
➔ Trigger Rules
➔ Variables
➔ Branching
➔ SubDAGs
Concepts: XCOM
● Abbreviation of “cross-communication”
● Means of communication between task instances
● Saved in database as a pickled object
● Best suited for small pieces of data (ids, etc.)
Concepts: Trigger Rule
● Trigger condition for next upstream task
● Each operator has ‘trigger_rule’ argument
● Following are the different trigger rules:
○ all_success: (default) all parents have succeeded
○ all_failed: all parents are in a failed or upstream_failed state
○ all_done: all parents are done with their execution
○ one_failed: fires as soon as at least one parent has failed, it
does not wait for all parents to be done
○ one_success: fires as soon as at least one parent succeeds, it
does not wait for all parents to be done
Concepts: Variables
● Generic way to store and retrieve arbitrary content
● Can be used for storing settings as a simple key value store
● Variables can be created, updated, deleted and exported into json
file form the UI
Concepts: Branching
● Branches the workflow based on condition.
● Condition can be defined using BranchPythonOperator
Concepts: SubDAGs
Perfect for repeating patterns
Tutorial 3: Workflow in GCP
GCP: Introduction
● A cloud computing service offered by Google
● Popular products that we are going to use today:
○ Google Cloud storage: File and object storage
○ BigQuery: Large scale analytics data warehouse
○ And many more..
● Click here to access our GCP Project
○ Google Cloud Storage - link
○ BigQuery - link
Tutorial 3:
Create Connection
● Click on Admin ↠ Connections
● Or Visit http://localhost:8080/admin/connection/
Tutorial 3:
Create Connection
● Click on Create and enter following details:
○ Conn Id: airflow-service-account
○ Conn Type: Google Cloud Platform
○ Project ID: pydata2018-airflow
○ Keyfile Path: PATH_TO_YOUR_JSON_FILE
■ E.g. “/Users/kaxil/Desktop/Service_Account_Keys/sb01-service-account.json”
○ Keyfile JSON:
○ Scopes: https://www.googleapis.com/auth/cloud-platform
● Click on Save
Tutorial 3:
Import Variables
● Click on Admin ↠ Variables
● Or Visit http://localhost:8080/admin/variable/
Tutorial 3:
Import Variables
● Click on Choose file and select variables.json (file in the
directory where you have cloned the Git repo)
● Click on Import Variables button.
● Edit bq_destination_dataset_table variable to enter:
“pydata_airflow.kaxil_usa_names” after replacing kaxil with
your firstname_lastname
Tutorial 3
● Objective:
○ Waits for a file to be uploaded in Google Cloud Storage
○ Once the files are uploaded, a BigQuery table is created
and the data from GCS is imported to it
● Visit:
○ Folder where files would be uploaded: click here
○ Dataset where the table would be created: click here
Tutorial 3
● Trigger DAG_3_GCS_To_BigQuery dag and check Graph View
to see the current running task.
Summary
● Airflow = workflow as a code
● Integrates seamlessly into “pythonic” data science stack
● Easily extensible
● Clean management of workflow metadata
● Different alerting system (email, Slack)
● Huge community and under active development
● Proven real-world projects
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 

Was ist angesagt? (20)

Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 

Ähnlich wie Apache Airflow in the Cloud: Programmatically orchestrating workloads with Python - PyData London 2018

What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfPrefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfJeff Hale
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Corneil du Plessis
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesDrew Hansen
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Runwesley chun
 
How to Create a Service in Choreo
How to Create a Service in ChoreoHow to Create a Service in Choreo
How to Create a Service in ChoreoWSO2
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlcAlexey Tokar
 
Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023Chris Grundemann
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDA GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDJulian Mazzitelli
 
On component interface
On component interfaceOn component interface
On component interfaceLaurence Chen
 

Ähnlich wie Apache Airflow in the Cloud: Programmatically orchestrating workloads with Python - PyData London 2018 (20)

Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfPrefect Paris Airflow Meetup Jeff Hale April 2023.pdf
Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)
 
Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
 
How to Create a Service in Choreo
How to Create a Service in ChoreoHow to Create a Service in Choreo
How to Create a Service in Choreo
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlc
 
Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023Interconnection Automation For All - Extended - MPS 2023
Interconnection Automation For All - Extended - MPS 2023
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDA GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
 
On component interface
On component interfaceOn component interface
On component interface
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 

Mehr von Kaxil Naik

Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...
Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...
Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...Kaxil Naik
 
Airflow: Save Tons of Money by Using Deferrable Operators
Airflow: Save Tons of Money by Using Deferrable OperatorsAirflow: Save Tons of Money by Using Deferrable Operators
Airflow: Save Tons of Money by Using Deferrable OperatorsKaxil Naik
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Kaxil Naik
 
What's new in Airflow 2.3?
What's new in Airflow 2.3?What's new in Airflow 2.3?
What's new in Airflow 2.3?Kaxil Naik
 
Upgrading to Apache Airflow 2 | Airflow Summit 2021
Upgrading to Apache Airflow 2 | Airflow Summit 2021Upgrading to Apache Airflow 2 | Airflow Summit 2021
Upgrading to Apache Airflow 2 | Airflow Summit 2021Kaxil Naik
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 

Mehr von Kaxil Naik (8)

Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...
Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...
Introducing airflowctl: A CLI to streamline getting started with Airflow - Ai...
 
Airflow: Save Tons of Money by Using Deferrable Operators
Airflow: Save Tons of Money by Using Deferrable OperatorsAirflow: Save Tons of Money by Using Deferrable Operators
Airflow: Save Tons of Money by Using Deferrable Operators
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?
 
What's new in Airflow 2.3?
What's new in Airflow 2.3?What's new in Airflow 2.3?
What's new in Airflow 2.3?
 
Upgrading to Apache Airflow 2 | Airflow Summit 2021
Upgrading to Apache Airflow 2 | Airflow Summit 2021Upgrading to Apache Airflow 2 | Airflow Summit 2021
Upgrading to Apache Airflow 2 | Airflow Summit 2021
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 

Kürzlich hochgeladen

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Kürzlich hochgeladen (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Apache Airflow in the Cloud: Programmatically orchestrating workloads with Python - PyData London 2018

  • 1. Apache Airflow in the Cloud Programmatically orchestrating workloads with Python
  • 2. Basic instructions ● Join the chat room: https://tlk.io/pydata-london-airflow (All necessary links are in this chat room) ○ Please fill the google form - (For those who haven’t filled the form prior to this tutorial) - This is to add you in our Google Cloud Platform project ○ Download the JSON file for Google Cloud access ○ Clone the Github repo for this tutorial ○ Follow the link for airflow installation ○ Link to this slide deck ● Clone the repository ○ Github link: https://github.com/DataReplyUK/airflow-workshop-pydata-london-2018
  • 3. Agenda ● Who we are and why we are here ? ● Different workflow management system ● What is Apache Airflow ? ● Basic Concept and UI Walkthrough ● Tutorial 1: Basic workflow ● Tutorial 2: Dynamic workflow ● GCP introduction ● Tutorial 3: Workflow in GCP
  • 4. Hello! I am Kaxil Naik Data Engineer at Data Reply and an Airflow contributor Who we are! Hi! I am Satyasheel Data Engineer at Data Reply
  • 5. First of all Some Truth
  • 6. Data is weird and it breaks stuffs
  • 7. A bad data can make fail everything From internet to local computer to Service API’s
  • 8. Image Source: https://www.free-vectors.com/images/Objects/263-hand-wrench-tool---spanner-vector.png Robust pipelines need robust workflow ● Cron (Seriously!) ● Oozie ● Luigi ● Apache Airflow
  • 11. Pros: ○ Used by thousands of companies ○ Web api, Java api, cli and html support ○ Oldest among all Oozie Cons: ○ XML ○ Significant effort in managing ○ Difficult to customize https://i.pinimg.com/originals/31/f9/bc/31f9bc4099f1ab2c553f29b8d95274c7.jpg
  • 12. Pros: ○ Pythonic way to write a DAG ○ Pretty Stable ○ Huge Community ○ Came from Spotify engineering team Luigi Cons: ○ Have to schedule workflows externally ○ The open source Luigi UI is hard to use ○ No inbuilt monitoring, alerting https://cdn.vectorstock.com/i/1000x1000/88/00/car-sketch-vector-98800.jpg
  • 13. Pros: ○ Python Code Base ○ Active community ○ Trigger rules ○ Cool web UI and rich CLI ○ Queues & Pools ○ Zombie Cleanup ○ Easily extendable Airflow Cons: ○ No role based access control ○ Minor issues (Deleting DAGs is not straight forward) ○ Umm!!! Image Source: https://s3.envato.com/files/133677334/Preview_Image_Set/GRID_06.jpg
  • 14. Pros: ○ Python Code Base ○ Active community ○ Trigger rules ○ Cool web UI and rich CLI ○ Queues & Pools ○ Zombie Cleanup ○ Easily extendable Airflow Cons: ○ No role based access control ○ Minor issues (Deleting DAGs is not straight forward) ○ Umm!!! Image Source: https://s3.envato.com/files/133677334/Preview_Image_Set/GRID_06.jpg
  • 15. A bad data can make fail everything From internet to local computer to Service API’s
  • 16. What is Airflow ? http://www.belairdaily.com/wp-content/uploads/2018/04/Cable-Management-Market-1.jpg
  • 17. What is Airflow ? Airflow is a platform to programatically author, schedule and monitor workflows (a.k.a DAGs)
  • 18. What is Airflow ? ● Open Source ETL workflow management tool written purely in python ● It’s the glue that binds your data ecosystem together ● It can handle failures ● Alert on failures ● Monitor performance of tasks over time ● Scale! ● Developed by Airbnb ● Inspired by Facebook’s dataswarm ● It is a production ready ● It Ships with: ○ DAG scheduler ○ Web Application UI ○ Powerful CLI
  • 19. What is Airflow ? ● Open Source ETL workflow management tool written purely in python ● It’s the glue that binds your data ecosystem together ● It can handle failures ● Alert on failures ● Monitor performance of tasks over time ● Scale! ● Developed by Airbnb ● Inspired by Facebook’s dataswarm ● It is a production ready ● It Ships with: ○ DAG scheduler ○ Web Application UI ○ Powerful CLI
  • 20. What is Airflow ? ● Open Source ETL workflow management tool written purely in python ● It’s the glue that binds your data ecosystem together ● It can handle failures ● Alert on failures (Email, Slack) ● Monitor performance of tasks over time ● Scale! ● Developed by Airbnb ● Inspired by Facebook’s dataswarm ● It is a production ready ● It Ships with: ○ DAG scheduler ○ Web Application UI ○ Powerful CLI
  • 21. What is Airflow ? ● Open Source ETL workflow management tool written purely in python ● It’s the glue that binds your data ecosystem together ● It can handle failures ● Alert on failures ● Monitor performance of tasks over time ● Scale! ● Developed by Airbnb ● Inspired by Facebook’s dataswarm ● It is a production ready ● It Ships with: ○ DAG scheduler ○ Web Application UI ○ Powerful CLI
  • 22. What is Airflow ? ● Open Source ETL workflow management tool written purely in python ● It’s the glue that binds your data ecosystem together ● It can handle failures ● Alert on failures ● Monitor performance of tasks over time ● Scale! ● Developed by Airbnb ● Inspired by Facebook’s dataswarm ● It is a production ready ● It Ships with: ○ DAG scheduler ○ Web Application UI ○ Powerful CLI
  • 23. What is Airflow ? ● Open Source ETL workflow management tool written purely in python ● It’s the glue that binds your data ecosystem together ● It can handle failures ● Alert on failures ● Monitor performance of tasks over time ● Scale! ● Developed by Airbnb ● Inspired by Facebook’s dataswarm ● It is a production ready ● It Ships with: ○ DAG scheduler ○ Web Application UI ○ Powerful CLI
  • 24. What is Airflow ? ● Open Source ETL workflow management tool written purely in python ● It’s the glue that binds your data ecosystem together ● It can handle failures ● Alert on failures ● Monitor performance of tasks over time ● Scale! ● Developed by Airbnb ● Inspired by Facebook’s dataswarm ● It is production ready ● It Ships with: ○ DAG scheduler ○ Web Application UI ○ Powerful CLI
  • 26. Airflow Web UI First look of Airflow WebUI right after installation
  • 27. Similar to Cron This is how it looks once you start running your DAGs ….
  • 28. No. of successfully completed tasks No. of Queued tasks No. of failed tasks Status for recent DAG Runs
  • 29. Can pause a Dag by switching it Off No. of successful DAG runs No of DAG instance running No of times DAG fails to run
  • 30. Links to detailed DAG info
  • 31. Web UI: Tree View A tree representation of the DAG that spans across time(task run history)
  • 32. Web UI: Graph View Visualize task dependencies & current status for a specific run
  • 33. Web UI: Task Duration The duration of your different tasks over the past N runs.
  • 34. Web UI: Gantt Which task is a blocker? https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned
  • 35. Web UI: DAG details See task metadata, rendered template, execution logs etc… for debugging
  • 37. DAG Anatomy Workflow as code Python Code DAG Definition DAG Default Configurations Tasks Task Dependencies
  • 38. Workflow as code DAG Definition DAG Default Configurations Tasks
  • 40. DAG Anatomy Workflow as code Python Code DAG (Workflow) DAG Definition DAG Default Configurations Tasks Task Dependencies
  • 42. Concepts: DAG ● DAG - Directed Acyclic Graph ● Define workflow logic as shape of the graph ● It is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
  • 43. Concepts: OPERATORS ● Workflows are composed of Operators ● While DAGs describe how to run a workflow, Operators determine what actually gets done
  • 44. Concepts: OPERATORS ● 3 main types of operators: ○ Sensors are a certain type of operator that will keep running until a certain criterion is met ○ Operators that perform an action, or tell another system to perform an action ○ Transfer Operators move data from one system to another
  • 45. Concepts: TASKS ● A parameterized instance of an operator Once an operator is instantiated, it is referred as a “task”. The instantiation defines specific values when calling the abstract operator ● The parameterized task becomes a node in a DAG
  • 48. Concepts: TASK INSTANCE ● Represents a specific run of a task ● Characterized as the combination of a dag, a task, and a point in time. ● Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc.
  • 50. Architecture Airflow comes with 5 main types of built in execution modes: ● Sequential ● Local ● Celery (Out of the box) ● Mesos (Community driven) ● Kubernetes (Community driven) Runs on a single machine Runs on distributed system
  • 51. Architecture: Sequential Executor ● Default mode ● Minimum setup - work with sqlite as well ● Process 1 task at a time ● Good for demo purpose
  • 52. Architecture: Local Executor ● Spawned by scheduler process ● Vertical scaling ● Production grade ● Does not need broker or any other negotiator
  • 53. Architecture: Celery Executor ● Vertical & Horizontal scaling ● Production grade ● Can be monitored (Via Flower) ● Supports pool and queues
  • 54. Task callbacks for success / failure / SLA miss Useful Feature https://speakerdeck.com/takus/building-data-pipelines-with-apache-airflow
  • 56. Instructions to start Airflow ● SetUp Airflow installation directory $ export AIRFLOW_HOME=~/airflow ● Initiating Airflow Database $ source airflow_workshop/bin/activate $ airflow initdb ● Start the web server, default port is 8080 $ airflow webserver -p 8080 ● Start the scheduler (In another terminal) $ source airflow_workshop/bin/activate $ airflow scheduler ● Visit localhost:8080 in the browser to see Airflow Web UI
  • 57. Copy DAGs ● Clone the Git Repo $ git clone https://github.com/DataReplyUK/airflow-workshop-pydata-london-2018.git ● Copy the dags folder in AIRFLOW_HOME $ cp -r airflow-workshop-pydata-london-2018/dags $AIRFLOW_HOME/dags
  • 58. Tutorial 1: Basic Workflow
  • 60. Advanced Concepts ➔ XCom ➔ Trigger Rules ➔ Variables ➔ Branching ➔ SubDAGs
  • 61. Concepts: XCOM ● Abbreviation of “cross-communication” ● Means of communication between task instances ● Saved in database as a pickled object ● Best suited for small pieces of data (ids, etc.)
  • 62. Concepts: Trigger Rule ● Trigger condition for next upstream task ● Each operator has ‘trigger_rule’ argument ● Following are the different trigger rules: ○ all_success: (default) all parents have succeeded ○ all_failed: all parents are in a failed or upstream_failed state ○ all_done: all parents are done with their execution ○ one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done ○ one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done
  • 63. Concepts: Variables ● Generic way to store and retrieve arbitrary content ● Can be used for storing settings as a simple key value store ● Variables can be created, updated, deleted and exported into json file form the UI
  • 64. Concepts: Branching ● Branches the workflow based on condition. ● Condition can be defined using BranchPythonOperator
  • 65. Concepts: SubDAGs Perfect for repeating patterns
  • 67. GCP: Introduction ● A cloud computing service offered by Google ● Popular products that we are going to use today: ○ Google Cloud storage: File and object storage ○ BigQuery: Large scale analytics data warehouse ○ And many more.. ● Click here to access our GCP Project ○ Google Cloud Storage - link ○ BigQuery - link
  • 68. Tutorial 3: Create Connection ● Click on Admin ↠ Connections ● Or Visit http://localhost:8080/admin/connection/
  • 69. Tutorial 3: Create Connection ● Click on Create and enter following details: ○ Conn Id: airflow-service-account ○ Conn Type: Google Cloud Platform ○ Project ID: pydata2018-airflow ○ Keyfile Path: PATH_TO_YOUR_JSON_FILE ■ E.g. “/Users/kaxil/Desktop/Service_Account_Keys/sb01-service-account.json” ○ Keyfile JSON: ○ Scopes: https://www.googleapis.com/auth/cloud-platform ● Click on Save
  • 70.
  • 71. Tutorial 3: Import Variables ● Click on Admin ↠ Variables ● Or Visit http://localhost:8080/admin/variable/
  • 72. Tutorial 3: Import Variables ● Click on Choose file and select variables.json (file in the directory where you have cloned the Git repo) ● Click on Import Variables button. ● Edit bq_destination_dataset_table variable to enter: “pydata_airflow.kaxil_usa_names” after replacing kaxil with your firstname_lastname
  • 73.
  • 74. Tutorial 3 ● Objective: ○ Waits for a file to be uploaded in Google Cloud Storage ○ Once the files are uploaded, a BigQuery table is created and the data from GCS is imported to it ● Visit: ○ Folder where files would be uploaded: click here ○ Dataset where the table would be created: click here
  • 75. Tutorial 3 ● Trigger DAG_3_GCS_To_BigQuery dag and check Graph View to see the current running task.
  • 76. Summary ● Airflow = workflow as a code ● Integrates seamlessly into “pythonic” data science stack ● Easily extensible ● Clean management of workflow metadata ● Different alerting system (email, Slack) ● Huge community and under active development ● Proven real-world projects