4. Table of contents
Agenda
Setting up
Introduction to Cloud Composer
Disaster Recovery in Cloud Composer
Data Lineage in Cloud Composer
01
02
03
04
05
7. GCP Projects used during Workshop
● Main activities/exercises will be done in environments created in your dedicated project that we set
up for you. Composer environments were pre-set up for you, as well.
These projects and environments will be deleted after the workshop.
● Composer project info to be used by you should be passed to you in the separate email.
● If your email address you used to register for the workshop doesn’t have association with active
Google Account then you will need to finish this registration as described here
● You can also get this information during the workshop
8. A voucher for Google Cloud Platform (GCP) Credits
● As part of this workshop you will receive a GCP credits voucher worth $500.
To be able to redeem the credits, in a addition to active Google Account you will need to set up your
GCP project and associate it with an active billing account.
This project and GCP credits will be owned by you.
You can use activate GCP coupon within 2 months after the workshop.
Workshop’s GCP credits are valid for 1 year since activation.
10. Cloud Composer 2 architecture
Cloud Composer 2 interacts with the following services:
● CloudSQL - running Airflow metadata storage
● Cloud Storage - user uploaded content (DAGs, user data)
● Kubernetes - runs Scheduler(s), WebServer, Redis queue,
SQL proxy and Airflow workloads
● Cloud Logging - stores and indexes components logs
● Cloud Monitoring - searchable Cloud Composer metrics
… and many more that we are managing for you.
11. Cloud Composer in a nutshell
BigQuery Data Fusion Dataflow Dataproc
Storage
Cloud Composer
100+ APIs …
Orchestrate work across Google Cloud, external SaaS services and proprietary APIs
…
14. What is a disaster?
Disaster is an event where Cloud
Composer or other components
essential for your environment's
operation are unavailable.
In case of Google Cloud the impact of
disaster can be zonal, regional or global.
��
15. HR
Highly resilient Cloud Composer
environments are Cloud Composer 2
environments that use built-in
redundancy and failover
mechanisms that reduce the
environment's susceptibility to zonal
failures and single point of failure
outages.
DR
Disaster Recovery (DR), in the context
of Cloud Composer, is a process of
restoring the environment's operation
after a disaster. The process involves
recreating the environment, possibly in
another region.
What about Composer with High Resilience?
16. Composer HR makes your application is available right
now. DR makes sure you can get it back up later.
HR != DR
High resilience is critical for the availability of Cloud
Composer, but it is often useless for recoverability.
For example, a critical historical transactions table may
be lost, but new transactions will still be processed
17. Availability should be calculated based on how long a service was unavailable
over a specified period of time. Planned downtime is still downtime.
Definition of availability
Availability =
Uptime
Total Time
18. Availability = 1 - (1 - Aa) * (1 - Ab)
Availability in distributed systems
Service A SLO
99%
Service B
SLO 99%
Availability = Aa * Ab
Service A SLO
99%
Service B
SLO 99%
98%
Parallelization (OR) Chaining (AND)
99.99%
19. DR Process: Failover
Step 1: Everything is fine
Primary
environment
Snapshots
storage
Scheduled
Snapshots
Note: User Multi-regional GCS bucket for
snapshots storage.
20. That's why Snapshots Storage should be
multi-regional.
DR Process: Failover
Step 2: Disaster!
Snapshots
storage
��
Primary
environment
21. DR Process: Failover
Step 3: Create Cloud Composer in DR region
Snapshots
storage
Primary
environment
Failover
environment
��
22. Recovery Point
Objective
Maximum acceptable length of time during which
data might be lost from your application due to a
major incident.
Recovery Time
Objective
Maximum acceptable length of time that your
application can be offline.
RTO and RPO
23. Proprietary + Confidential
DR scenarios
RTO ↘ Cost ↗
RTO ↗ Cost ↘
Warm DR scenario is a variant of Disaster
Recovery, where you use a standby failover
environment, which you create before a
disaster occurs.
Cold DR scenario is a variant of Disaster
Recovery, where you create a failover
environment after a disaster occurs.
25. DR Process: Failover
Step 4: Load snapshot and resumed workflows
Snapshots
storage
��
Primary
environment
Failover
environment
Scheduled
Snapshot
26. DR Process: Failover
Step 5: Disaster mitigated
Snapshots
storage
Primary
environment
Failover
environment
Scheduled
Snapshot
Note: Make sure to pause
DAGs in the Primary
environment
⏸
27. 🥶 Option 1: Switch Failover with Primary environment and delete Primary environment (Cold DR )
🌤Option 1a: Switch Failover with Primary Environment and keep it (Warm DR)
🥶 Option 2: Fallback to Primary Environment and delete Failover environment (Cold DR)
🌤Option 2a: Fallback to Primary Environment and delete Failover environment (Warm DR)
DR Process: Failover
Next steps
28. Creating a detailed DR plan
1. What is your RTO?
2. What is your RPO?
3. How do you want to verify your plan?
https://cloud.google.com/architecture/dr-scenarios-planning-guide
30. Step 1: Create Snapshots storage bucket
● Use a multi-regional bucket to ensure
resiliency to regional failures.
● Make sure the bucket is accessible to your
environment service account
○ Grant permissions to
lab-sa@airflow-summit-workshop-{project}.iam.gse
rviceaccount.com to access created bucket
33. [Optional] Step 3a: Setup metadata db maintenance
The Airflow metadata database must have less than 20 Gb of data to support
snapshots
1. Upload a maintenance DAG - http://bit.ly/3t1iiYJ
a. The dag is already in your environment
2. Verify database size metric.
3. [Optional] Set up an alert on database size metric.
34. Step 4: Verify a snapshot has been created
1. Visit storage bucket to observe created snapshot objects.
2. … or better delegate this effort to Cloud Monitoring.
a. https://cloud.google.com/composer/docs/composer-2/disaster-recovery-with-snapshots
36. Step 6: Load snapshot in failover environment
Note: In case of Warm-DR you should skip
some options to reduce time of loading
and therefore RTO
Secondary environment
37. [Extra topic] Anatomy of Snapshot
● Snapshot is a folder on GCS bucket
● Convention for creating folder names (convention is not validated during load)
{project}_{region}_{environment}_{ISO_DATE_AND_TIME}
Eg.: myproject_us-central1_test-env_2023-09-15T05-02-06/
● Contents:
airflow-database.postgres.sql.gz
environment.json
fernet-key.txt
gcs/
dags/
metadata.json
preliminary_metadata.json
38. Step 7: Verify Failover environment health
1. Visit Monitoring Dashboard
2. Check your DAGs are running
3. Verify DAGs history - it should have been loaded with Snapshot
39. Limitations
1. The database size cannot exceed 20GB - the metric available in Monitoring dashboard
2. Snapshot can be stored with 2h+ intervals.
40. Good practices
1. Prepare your DR plan.
2. Test your disaster recovery procedure on a regular basis.
3. Decide what to do with the primary environment afterwards.
4. Set up DB maintenance and monitor DB size.
5. Set up monitoring for scheduled snapshot operations.
43. Data lineage traces the
relationship between data
sources based on movement of
data, explaining how data was
sourced and transformed.
● Airflow gains rich lineage capabilities thanks to
the OpenLineage integration.
● Implemented by Dataplex in Google Cloud.
44. Data Lineage in Google Cloud: Dataplex
● Process
● Run (execution of Process)
● Event (emitted in a Run)
45. - Currently based on the Airflow
Lineage Backend feature.
- Backend exports lineage data to
Dataplex.
- Working on migrating to
OpenLineage.
Data Lineage in Cloud Composer
47. - A growing number of Google Cloud
services support Data Lineage (e.g.
BigQuery, Dataproc, Cloud Data
Fusion).
- Goal: Complete data lake lineage.
Data Lineage in other Google Cloud services
48. Exercise 1: Lineage from Composer
orchestrated BigQuery pipeline
This exercise covers data lineage with Cloud Composer, from a
BigQuery context with a minimum viable data engineering pipeline.
We will demonstrate lineage capture off of an Airflow DAG composed
of BigQuery actions.
We will first use an Apache Airflow DAG on Cloud Composer to
orchestrate the BigQuery jobs and observe the lineage captured by
Dataplex. Note that the lineage shows up minutes after a process is
run/an entity is created.
67. 7. Navigate back to Composer Airflow DAG
using a link in Data Lineage UI.
TODO screenshot
68. Exercise 2: Lineage from Composer
orchestrated Dataproc Spark job
In this exercise, we will repeat what we did with lineage of BigQuery
based Airflow DAG, except, we will use Apache Spark on Dataproc
Serverless instead. Note that Dataproc Serverless is not a natively
supported service with Dataplex automated lineage capture. We will
use custom lineage feature in Cloud Composer.
69. 1. Review the DAG
Navigate to the Cloud Composer UI and launch the Airflow UI
Click on the Spark DAG
72. 2. Verify inlets & outlets definition
Scroll to look at the "inlet" and "outlet" where we specify lineage for
BigQuery external tables.
74. 4. Verify the Dataproc Serverless batch jobs in
Dataproc Batches UI in Google Cloud Console.
75. 5. Analyze the lineage captured from
Composer’s environment Airflow.
The lineage captured is custom and BQ external table centric and therefore not visible in the Dataplex UI. The latency of lineage availability
is dependent on discovery settings for the asset.
Navigate to the BigQuery UI and click on the external table, oda_curated_zone.crimes_curated_spark table.