In Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management, we discussed using Airflow to schedule tasks on a Cassandra cluster beyond what could be accomplished with the Cassandra provider package.
2. Airflow Overview
● A tool for scheduling and automating workflows and tasks
● Good for automating repeated processes
○ Common ETL tasks
○ Machine learning model training
● Write workflows in Python
○ Tools interactable via Python should work as well
○ Define dependencies for different sections of workflows
○ Workflows are DAG of tasks
● Schedule workflows or execute the processes by hand
○ Cron-like syntax or frequency tags
● Monitor tasks and collect/view logs
4. DAGs
● DAG - Directed Acyclic Graph
○ A DAG of tasks w/ dependencies as edges
○ Individual data engineering tasks combine together to form a DAG
■ Airflow allows the definition of relationships between tasks
■ Define dependencies and run order
○ DAGs are written in python, saved as a normal .py file
● DAGs are run to a specific schedule
○ They can also be triggered manually
○ Schedule defined using CRON notation
■ Also have some tags for frequencies
5. Airflow Providers
● Airflow provider
packages allow for
integration with
external systems
● They are mostly
maintained by the
Airflow community
● It is possible to create
your own provider
packages
6. Airflow Connections
● Airflow connections manage the network
connections with external systems
● Different types of connections are used to
connect with different external tools
● Connection types are added alongside their
provider package, with information
customized to their application
● Connections are ultimately JSON string
which airflow turns into python dictionaries
to pull data from
7. Airflow Operators for Cassandra
● Previously covered the Cassandra Operators (table and record sensors), the Cassandra Hooks
(give access to all python driver functionality), and Cassandra Connection (holds data for
connecting to Cassandra nodes)
● More potential Airflow Operators that might be useful with Cassandra
○ Docker Operator brings up a new Docker container on a given machine (defined via Docker API url) based on
a given image and can run defined commands on that container
○ Bash Operator runs commands on local machine, can be used to interact with local Cassandra installs or use
docker exec to interact with dockerized Cassandra installs
○ SSH Operator connects with SSHHook and SSH Connection to run bash commands on a machine with SSH
access
8. Cluster Management Tasks
● Can therefore use Airflow to trigger any given nodetool command on a schedule
○ Nodetool flush - flushes in-memory data (the commit log) to disk in the form of SSTables
○ Nodetool compact - performs compaction, resolves copies and tombstones and consolidates data into fewer
SSTable files
○ Nodetool repair - repairs data mismatches between nodes
○ Change configurations using commands like nodetool disableautocompaction, etc
○ Save status info to Airflow logs using nodetool status, etc