Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Airflow Clustering and High Availability

4.573 Aufrufe

Veröffentlicht am

This presentation covers how to setup an Airflow instance as a cluster which spans multiple machines instead of the traditional 1 machine distribution. In addition, it covers an added step you can take to ensure High Availability in that cluster.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Airflow Clustering and High Availability

  1. 1. Airflow Clustering and High Availability By: Robert Sanders
  2. 2. 2Page: Agenda • Airflow Daemons • Single Node Deployment • Cluster Deployment • Scaling • Worker Nodes • Master Nodes • Limitations • Airflow Scheduler Failover Controller • Failover Controller Procedure
  3. 3. 3Page: Airflow Daemons • Web Server • Daemon that runs the Airflow Webserver • 1 to many gunicorn processes to accept and process requests in parallel. • Allows you to track jobs progress, run jobs and more • Scheduler • Periodically runs (every X seconds) to determine if a DAG or Task needs to be ran based off the DAG schedule • Pushes messages to the Queuing Service to be executed • Worker • Daemon runs if you’re using the CeleryExecutors (as opposed to SequentialExecutor and LocalExecutor) • 1 to many dedicated celeryd processes which execute functions • Pulls messages from a Queuing services to determine what functions to execute
  4. 4. 4Page: Single Node Deployment
  5. 5. 5Page: Cluster Deployment
  6. 6. 6Page: Why setup a Cluster Deployment? • Distributes heavy processes onto many machines for better use of resources • More Highly Available Airflow environment • If you have many Workflows with many Tasks your executors would not be able to get to all the messages in the queue. Adding more executors would fix this issue.
  7. 7. 7Page: Scaling Workers • Horizontally • Add more machines to the cluster • No need to register the machines with the master. You just need to start up the Airflow Worker task on the new Machine. • Vertically • Increase the number of executors (celeryd processes) per node and restart the workers
  8. 8. 8Page: Scaling Master
  9. 9. 9Page: Limitations • There can only be one scheduler running at a time • If you have multiple Scheduler processes running, there's a possibility that multiple instances of a single task that will be scheduled to run. • If the Scheduler Daemon or Machine with the process goes down then no jobs will get scheduled
  10. 10. 10Page: Airflow Scheduler Failover Controller • Dedicated Daemon that runs with Airflow on the Master Nodes • Ensures that there is always one and only one Scheduler running on the Master nodes at a time • Developed Internally and Open Sourced • https://github.com/teamclairvoyant/airflow-scheduler- failover-controller • High Level Steps • Polls (every x seconds) to check if the scheduler is running • If scheduler isn’t running, restart the scheduler • If it still doesn’t start up, then try starting it up on the other master nodes
  11. 11. 11Page: Failover Controller Diagram
  12. 12. 12Page: Start Up Scenario
  13. 13. 13Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (standby) On startup, the processes start out in STANDBY
  14. 14. 14Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The first one to enter data into the Metastore is elected as the active controller.
  15. 15. 15Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The Failover controller checks to see if the Scheduler is running, but it isn’t.
  16. 16. 16Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller starts up the Scheduler
  17. 17. 17Page: Scheduler Failure Scenario
  18. 18. 18Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  19. 19. 19Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller restarts the Scheduler
  20. 20. 20Page: Scheduler Failure and Failed Restart Scenario
  21. 21. 21Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  22. 22. 22Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler, but its still not running
  23. 23. 23Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler on a different node
  24. 24. 24Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller succeeds to restart the scheduler and the cluster is back to normal
  25. 25. 25Page: Node Failure Scenario
  26. 26. 26Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Everything is running as expected
  27. 27. 27Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (standby) Master Node 1 dies and all the processes running on it are gone
  28. 28. 28Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) Failover Controller on Master 2 becomes active because the one running on Master Node 1 has stopped sending a heart beat
  29. 29. 29Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The newly active Failover Controller tries to check-in with and restart the Scheduler on the daemon the Metadata says its running on and fails.
  30. 30. 30Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The Failover Controller then starts it on another node and it succeeds Scheduler
  31. 31. 31Page: Failover Controller Process (Node Failure) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (active) When Master Node 1 is brought back, the old Failover Controller goes into STANDBY state Scheduler
  32. 32. 32Page: Q&A

×