SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Airflow Clustering
and High Availability
By: Robert Sanders
2Page:
Agenda
• Airflow Daemons
• Single Node Deployment
• Cluster Deployment
• Scaling
• Worker Nodes
• Master Nodes
• Limitations
• Airflow Scheduler Failover Controller
• Failover Controller Procedure
3Page:
Airflow Daemons
• Web Server
• Daemon that runs the Airflow Webserver
• 1 to many gunicorn processes to accept and process requests in
parallel.
• Allows you to track jobs progress, run jobs and more
• Scheduler
• Periodically runs (every X seconds) to determine if a DAG or Task
needs to be ran based off the DAG schedule
• Pushes messages to the Queuing Service to be executed
• Worker
• Daemon runs if you’re using the CeleryExecutors (as opposed to
SequentialExecutor and LocalExecutor)
• 1 to many dedicated celeryd processes which execute functions
• Pulls messages from a Queuing services to determine what
functions to execute
4Page:
Single Node Deployment
5Page:
Cluster Deployment
6Page:
Why setup a Cluster Deployment?
• Distributes heavy processes onto many machines for better
use of resources
• More Highly Available Airflow environment
• If you have many Workflows with many Tasks your executors
would not be able to get to all the messages in the queue.
Adding more executors would fix this issue.
7Page:
Scaling Workers
• Horizontally
• Add more machines to the cluster
• No need to register the machines with the master. You
just need to start up the Airflow Worker task on the new
Machine.
• Vertically
• Increase the number of executors (celeryd processes)
per node and restart the workers
8Page:
Scaling Master
9Page:
Limitations
• There can only be one scheduler running at a time
• If you have multiple Scheduler processes running, there's
a possibility that multiple instances of a single task that
will be scheduled to run.
• If the Scheduler Daemon or Machine with the process goes
down then no jobs will get scheduled
10Page:
Airflow Scheduler Failover Controller
• Dedicated Daemon that runs with Airflow on the Master
Nodes
• Ensures that there is always one and only one Scheduler
running on the Master nodes at a time
• Developed Internally and Open Sourced
• https://github.com/teamclairvoyant/airflow-scheduler-
failover-controller
• High Level Steps
• Polls (every x seconds) to check if the scheduler is
running
• If scheduler isn’t running, restart the scheduler
• If it still doesn’t start up, then try starting it up on the
other master nodes
11Page:
Failover Controller Diagram
12Page:
Start Up Scenario
13Page:
Failover Controller Process (Start Up)
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(standby)
On startup, the processes start out in STANDBY
14Page:
Failover Controller Process (Start Up)
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The first one to enter data into the Metastore is elected as the active
controller.
15Page:
Failover Controller Process (Start Up)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The Failover controller checks to see if the Scheduler is running, but it
isn’t.
16Page:
Failover Controller Process (Start Up)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller starts up the Scheduler
17Page:
Scheduler Failure
Scenario
18Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died
19Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller restarts the Scheduler
20Page:
Scheduler Failure and
Failed Restart Scenario
21Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died
22Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler, but its still not running
23Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler on a different node
24Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller succeeds to restart the scheduler and the cluster is
back to normal
25Page:
Node Failure Scenario
26Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Everything is running as expected
27Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(standby)
Master Node 1 dies and all the processes running on it are gone
28Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
Failover Controller on Master 2 becomes active because the one running
on Master Node 1 has stopped sending a heart beat
29Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The newly active Failover Controller tries to check-in with and restart the
Scheduler on the daemon the Metadata says its running on and fails.
30Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The Failover Controller then starts it on another node and it succeeds
Scheduler
31Page:
Failover Controller Process (Node Failure)
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(active)
When Master Node 1 is brought back, the old Failover Controller goes
into STANDBY state
Scheduler
32Page:
Q&A

Weitere ähnliche Inhalte

Was ist angesagt?

Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsJulian Mazzitelli
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdfBagustTriCahyo1
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
Docker Networking Deep Dive
Docker Networking Deep DiveDocker Networking Deep Dive
Docker Networking Deep DiveDocker, Inc.
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsBryan Bende
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 

Was ist angesagt? (20)

Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd products
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Docker Networking Deep Dive
Docker Networking Deep DiveDocker Networking Deep Dive
Docker Networking Deep Dive
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Redis introduction
Redis introductionRedis introduction
Redis introduction
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 

Ähnlich wie Airflow Clustering and High Availability

Docker Swarm for Beginner
Docker Swarm for BeginnerDocker Swarm for Beginner
Docker Swarm for BeginnerShahzad Masud
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demoAjith Narayanan
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in productionItai Yaffe
 
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)STePINForum
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlcAlexey Tokar
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouMariaDB plc
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAjith Narayanan
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeDocker, Inc.
 
Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3kognate
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practiceDocker, Inc.
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
RubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngineRubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngineTreasure Data, Inc.
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-toleranceRavindra Bandara
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection7mind
 
Weblogic 101 for dba
Weblogic  101 for dbaWeblogic  101 for dba
Weblogic 101 for dbaOsama Mustafa
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresDocker, Inc.
 

Ähnlich wie Airflow Clustering and High Availability (20)

Docker Swarm for Beginner
Docker Swarm for BeginnerDocker Swarm for Beginner
Docker Swarm for Beginner
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in production
 
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object Model
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlc
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 
Fail over fail_back
Fail over fail_backFail over fail_back
Fail over fail_back
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to Practice
 
Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practice
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
RubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngineRubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngine
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection
 
Weblogic 101 for dba
Weblogic  101 for dbaWeblogic  101 for dba
Weblogic 101 for dba
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
 

KĂźrzlich hochgeladen

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

KĂźrzlich hochgeladen (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Airflow Clustering and High Availability

  • 1. Airflow Clustering and High Availability By: Robert Sanders
  • 2. 2Page: Agenda • Airflow Daemons • Single Node Deployment • Cluster Deployment • Scaling • Worker Nodes • Master Nodes • Limitations • Airflow Scheduler Failover Controller • Failover Controller Procedure
  • 3. 3Page: Airflow Daemons • Web Server • Daemon that runs the Airflow Webserver • 1 to many gunicorn processes to accept and process requests in parallel. • Allows you to track jobs progress, run jobs and more • Scheduler • Periodically runs (every X seconds) to determine if a DAG or Task needs to be ran based off the DAG schedule • Pushes messages to the Queuing Service to be executed • Worker • Daemon runs if you’re using the CeleryExecutors (as opposed to SequentialExecutor and LocalExecutor) • 1 to many dedicated celeryd processes which execute functions • Pulls messages from a Queuing services to determine what functions to execute
  • 6. 6Page: Why setup a Cluster Deployment? • Distributes heavy processes onto many machines for better use of resources • More Highly Available Airflow environment • If you have many Workflows with many Tasks your executors would not be able to get to all the messages in the queue. Adding more executors would fix this issue.
  • 7. 7Page: Scaling Workers • Horizontally • Add more machines to the cluster • No need to register the machines with the master. You just need to start up the Airflow Worker task on the new Machine. • Vertically • Increase the number of executors (celeryd processes) per node and restart the workers
  • 9. 9Page: Limitations • There can only be one scheduler running at a time • If you have multiple Scheduler processes running, there's a possibility that multiple instances of a single task that will be scheduled to run. • If the Scheduler Daemon or Machine with the process goes down then no jobs will get scheduled
  • 10. 10Page: Airflow Scheduler Failover Controller • Dedicated Daemon that runs with Airflow on the Master Nodes • Ensures that there is always one and only one Scheduler running on the Master nodes at a time • Developed Internally and Open Sourced • https://github.com/teamclairvoyant/airflow-scheduler- failover-controller • High Level Steps • Polls (every x seconds) to check if the scheduler is running • If scheduler isn’t running, restart the scheduler • If it still doesn’t start up, then try starting it up on the other master nodes
  • 13. 13Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (standby) On startup, the processes start out in STANDBY
  • 14. 14Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The first one to enter data into the Metastore is elected as the active controller.
  • 15. 15Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The Failover controller checks to see if the Scheduler is running, but it isn’t.
  • 16. 16Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller starts up the Scheduler
  • 18. 18Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  • 19. 19Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller restarts the Scheduler
  • 21. 21Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  • 22. 22Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler, but its still not running
  • 23. 23Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler on a different node
  • 24. 24Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller succeeds to restart the scheduler and the cluster is back to normal
  • 26. 26Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Everything is running as expected
  • 27. 27Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (standby) Master Node 1 dies and all the processes running on it are gone
  • 28. 28Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) Failover Controller on Master 2 becomes active because the one running on Master Node 1 has stopped sending a heart beat
  • 29. 29Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The newly active Failover Controller tries to check-in with and restart the Scheduler on the daemon the Metadata says its running on and fails.
  • 30. 30Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The Failover Controller then starts it on another node and it succeeds Scheduler
  • 31. 31Page: Failover Controller Process (Node Failure) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (active) When Master Node 1 is brought back, the old Failover Controller goes into STANDBY state Scheduler