Cloud Composer workshop at Airflow Summit 2023.pdf

Leah Cole
Leah ColeDeveloper Programs Engineer um Google
Cloud Composer
Airflow Summit 2023
Workshop
September 21, 2023
Hi! It's nice to meet you!
Bartosz Jankiewicz
Engineering Manager
Filip Knapik
Group Product
Michał Modras
Engineering Manager
Leah Cole
Developer Relations
Engineer
Rafal Biegacz
Engineering Manager
Arun Vattoly
Technical Solutions
Engineer
Victor
Cloud Support
Manager
Mateusz Henc
Software Engineer
Agenda
Of workshop
01
Table of contents
Agenda
Setting up
Introduction to Cloud Composer
Disaster Recovery in Cloud Composer
Data Lineage in Cloud Composer
01
02
03
04
05
Data Lineage
● Composer + BigQuery
● Composer + Dataproc
Agenda
Introduction
● Composer Architecture
● Composer Features
Disaster Recovery
● Snapshots
15m 1h 1h
☕
Setting up
Workshop Projects
02
GCP Projects used during Workshop
● Main activities/exercises will be done in environments created in your dedicated project that we set
up for you. Composer environments were pre-set up for you, as well.
These projects and environments will be deleted after the workshop.
● Composer project info to be used by you should be passed to you in the separate email.
● If your email address you used to register for the workshop doesn’t have association with active
Google Account then you will need to finish this registration as described here
● You can also get this information during the workshop
A voucher for Google Cloud Platform (GCP) Credits
● As part of this workshop you will receive a GCP credits voucher worth $500.
To be able to redeem the credits, in a addition to active Google Account you will need to set up your
GCP project and associate it with an active billing account.
This project and GCP credits will be owned by you.
You can use activate GCP coupon within 2 months after the workshop.
Workshop’s GCP credits are valid for 1 year since activation.
Introduction to
Cloud Composer
03
Cloud Composer 2 architecture
Cloud Composer 2 interacts with the following services:
● CloudSQL - running Airflow metadata storage
● Cloud Storage - user uploaded content (DAGs, user data)
● Kubernetes - runs Scheduler(s), WebServer, Redis queue,
SQL proxy and Airflow workloads
● Cloud Logging - stores and indexes components logs
● Cloud Monitoring - searchable Cloud Composer metrics
… and many more that we are managing for you.
Cloud Composer in a nutshell
BigQuery Data Fusion Dataflow Dataproc
Storage
Cloud Composer
100+ APIs …
Orchestrate work across Google Cloud, external SaaS services and proprietary APIs
…
Cloud Composer benefits
Simple
deployment
Robust
Monitoring &
Logging
Enterprise
Security Features
DAG code
portability
Technical
Support
Managed
infrastructure
Disaster Recovery
in Cloud Composer
04
What is a disaster?
Disaster is an event where Cloud
Composer or other components
essential for your environment's
operation are unavailable.
In case of Google Cloud the impact of
disaster can be zonal, regional or global.
��
HR
Highly resilient Cloud Composer
environments are Cloud Composer 2
environments that use built-in
redundancy and failover
mechanisms that reduce the
environment's susceptibility to zonal
failures and single point of failure
outages.
DR
Disaster Recovery (DR), in the context
of Cloud Composer, is a process of
restoring the environment's operation
after a disaster. The process involves
recreating the environment, possibly in
another region.
What about Composer with High Resilience?
Composer HR makes your application is available right
now. DR makes sure you can get it back up later.
HR != DR
High resilience is critical for the availability of Cloud
Composer, but it is often useless for recoverability.
For example, a critical historical transactions table may
be lost, but new transactions will still be processed
Availability should be calculated based on how long a service was unavailable
over a specified period of time. Planned downtime is still downtime.
Definition of availability
Availability =
Uptime
Total Time
Availability = 1 - (1 - Aa) * (1 - Ab)
Availability in distributed systems
Service A SLO
99%
Service B
SLO 99%
Availability = Aa * Ab
Service A SLO
99%
Service B
SLO 99%
98%
Parallelization (OR) Chaining (AND)
99.99%
DR Process: Failover
Step 1: Everything is fine
Primary
environment
Snapshots
storage
Scheduled
Snapshots
Note: User Multi-regional GCS bucket for
snapshots storage.
That's why Snapshots Storage should be
multi-regional.
DR Process: Failover
Step 2: Disaster!
Snapshots
storage
��
Primary
environment
DR Process: Failover
Step 3: Create Cloud Composer in DR region
Snapshots
storage
Primary
environment
Failover
environment
��
Recovery Point
Objective
Maximum acceptable length of time during which
data might be lost from your application due to a
major incident.
Recovery Time
Objective
Maximum acceptable length of time that your
application can be offline.
RTO and RPO
Proprietary + Confidential
DR scenarios
RTO ↘ Cost ↗
RTO ↗ Cost ↘
Warm DR scenario is a variant of Disaster
Recovery, where you use a standby failover
environment, which you create before a
disaster occurs.
Cold DR scenario is a variant of Disaster
Recovery, where you create a failover
environment after a disaster occurs.
DR Process: Failover
Step 4: Load snapshot
Snapshots
storage
��
Primary
environment
Failover
environment
Load
Snapshot
DR Process: Failover
Step 4: Load snapshot and resumed workflows
Snapshots
storage
��
Primary
environment
Failover
environment
Scheduled
Snapshot
DR Process: Failover
Step 5: Disaster mitigated
Snapshots
storage
Primary
environment
Failover
environment
Scheduled
Snapshot
Note: Make sure to pause
DAGs in the Primary
environment
⏸
🥶 Option 1: Switch Failover with Primary environment and delete Primary environment (Cold DR )
🌤Option 1a: Switch Failover with Primary Environment and keep it (Warm DR)
🥶 Option 2: Fallback to Primary Environment and delete Failover environment (Cold DR)
🌤Option 2a: Fallback to Primary Environment and delete Failover environment (Warm DR)
DR Process: Failover
Next steps
Creating a detailed DR plan
1. What is your RTO?
2. What is your RPO?
3. How do you want to verify your plan?
https://cloud.google.com/architecture/dr-scenarios-planning-guide
Practice
Step 1: Create Snapshots storage bucket
● Use a multi-regional bucket to ensure
resiliency to regional failures.
● Make sure the bucket is accessible to your
environment service account
○ Grant permissions to
lab-sa@airflow-summit-workshop-{project}.iam.gse
rviceaccount.com to access created bucket
Step 2: Configure scheduled snapshots
Primary environment
Step 3: Create manual snapshot
[Optional] Step 3a: Setup metadata db maintenance
The Airflow metadata database must have less than 20 Gb of data to support
snapshots
1. Upload a maintenance DAG - http://bit.ly/3t1iiYJ
a. The dag is already in your environment
2. Verify database size metric.
3. [Optional] Set up an alert on database size metric.
Step 4: Verify a snapshot has been created
1. Visit storage bucket to observe created snapshot objects.
2. … or better delegate this effort to Cloud Monitoring.
a. https://cloud.google.com/composer/docs/composer-2/disaster-recovery-with-snapshots
Step 5: Disaster!
��
Step 6: Load snapshot in failover environment
Note: In case of Warm-DR you should skip
some options to reduce time of loading
and therefore RTO
Secondary environment
[Extra topic] Anatomy of Snapshot
● Snapshot is a folder on GCS bucket
● Convention for creating folder names (convention is not validated during load)
{project}_{region}_{environment}_{ISO_DATE_AND_TIME}
Eg.: myproject_us-central1_test-env_2023-09-15T05-02-06/
● Contents:
airflow-database.postgres.sql.gz
environment.json
fernet-key.txt
gcs/
dags/
metadata.json
preliminary_metadata.json
Step 7: Verify Failover environment health
1. Visit Monitoring Dashboard
2. Check your DAGs are running
3. Verify DAGs history - it should have been loaded with Snapshot
Limitations
1. The database size cannot exceed 20GB - the metric available in Monitoring dashboard
2. Snapshot can be stored with 2h+ intervals.
Good practices
1. Prepare your DR plan.
2. Test your disaster recovery procedure on a regular basis.
3. Decide what to do with the primary environment afterwards.
4. Set up DB maintenance and monitor DB size.
5. Set up monitoring for scheduled snapshot operations.
Let's take a break ☕
Data lineage
in Cloud Composer
05
Data lineage traces the
relationship between data
sources based on movement of
data, explaining how data was
sourced and transformed.
● Airflow gains rich lineage capabilities thanks to
the OpenLineage integration.
● Implemented by Dataplex in Google Cloud.
Data Lineage in Google Cloud: Dataplex
● Process
● Run (execution of Process)
● Event (emitted in a Run)
- Currently based on the Airflow
Lineage Backend feature.
- Backend exports lineage data to
Dataplex.
- Working on migrating to
OpenLineage.
Data Lineage in Cloud Composer
Operators support:
- BigQueryExecuteQueryOperator
- BigQueryInsertJobOperator
- BigQueryToBigQueryOperator
- BigQueryToCloudStorageOperator
- BigQueryToGCSOperator
- GCSToBigQueryOperator
- GoogleCloudStorageToBigQueryOperator
- DataprocSubmitJobOperator
Data Lineage in Cloud Composer
- A growing number of Google Cloud
services support Data Lineage (e.g.
BigQuery, Dataproc, Cloud Data
Fusion).
- Goal: Complete data lake lineage.
Data Lineage in other Google Cloud services
Exercise 1: Lineage from Composer
orchestrated BigQuery pipeline
This exercise covers data lineage with Cloud Composer, from a
BigQuery context with a minimum viable data engineering pipeline.
We will demonstrate lineage capture off of an Airflow DAG composed
of BigQuery actions.
We will first use an Apache Airflow DAG on Cloud Composer to
orchestrate the BigQuery jobs and observe the lineage captured by
Dataplex. Note that the lineage shows up minutes after a process is
run/an entity is created.
1. Review the existing Composer environment
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
2. Review the existing lineage graph
Cloud Composer workshop at Airflow Summit 2023.pdf
3. Review the Airflow DAG code
Go to DAGS section of the Composer environment.
Cloud Composer workshop at Airflow Summit 2023.pdf
4. Run the Airflow DAG.
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
5. Validate the creation of
tables in BigQuery.
6. Review the lineage captured in Dataplex
UI.
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
7. Navigate back to Composer Airflow DAG
using a link in Data Lineage UI.
TODO screenshot
Exercise 2: Lineage from Composer
orchestrated Dataproc Spark job
In this exercise, we will repeat what we did with lineage of BigQuery
based Airflow DAG, except, we will use Apache Spark on Dataproc
Serverless instead. Note that Dataproc Serverless is not a natively
supported service with Dataplex automated lineage capture. We will
use custom lineage feature in Cloud Composer.
1. Review the DAG
Navigate to the Cloud Composer UI and launch the Airflow UI
Click on the Spark DAG
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
2. Verify inlets & outlets definition
Scroll to look at the "inlet" and "outlet" where we specify lineage for
BigQuery external tables.
3. Run the DAG
4. Verify the Dataproc Serverless batch jobs in
Dataproc Batches UI in Google Cloud Console.
5. Analyze the lineage captured from
Composer’s environment Airflow.
The lineage captured is custom and BQ external table centric and therefore not visible in the Dataplex UI. The latency of lineage availability
is dependent on discovery settings for the asset.
Navigate to the BigQuery UI and click on the external table, oda_curated_zone.crimes_curated_spark table.
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Thank you.
1 von 80

Más contenido relacionado

Was ist angesagt?(18)

Functional Specification with Use-CasesFunctional Specification with Use-Cases
Functional Specification with Use-Cases
Prof. Amir Tomer14.5K views
SUSE Technical Webinar: Build B1 apps in the Framework of the SAP and SUSE Ca...SUSE Technical Webinar: Build B1 apps in the Framework of the SAP and SUSE Ca...
SUSE Technical Webinar: Build B1 apps in the Framework of the SAP and SUSE Ca...
SAP PartnerEdge program for Application Development2.1K views
New gl-configuration-step-by-step (2)New gl-configuration-step-by-step (2)
New gl-configuration-step-by-step (2)
Erick Gutierrez338 views
SAP Periodical Jobs And TasksSAP Periodical Jobs And Tasks
SAP Periodical Jobs And Tasks
Ajay Kumar Uppal9.2K views
Pm end-user-manualPm end-user-manual
Pm end-user-manual
navenaa29.3K views
Budget Billing : Payment planBudget Billing : Payment plan
Budget Billing : Payment plan
Rakesh Dasgupta2.2K views
Po vendor down pay process SAP Po vendor down pay process SAP
Po vendor down pay process SAP
Jay Pal19.4K views
Controlling areaControlling area
Controlling area
Barnalee Hajra9.5K views
Sap cheat sheetSap cheat sheet
Sap cheat sheet
Muhammad Jarrar Siddiqui2.5K views
Sap cs standard process document Sap cs standard process document
Sap cs standard process document
Mohit23858.5K views
1668146695188.pdf1668146695188.pdf
1668146695188.pdf
HernanNovodvorski1202 views

Similar a Cloud Composer workshop at Airflow Summit 2023.pdf(20)

Serverless and Design Patterns In GCPServerless and Design Patterns In GCP
Serverless and Design Patterns In GCP
Oliver Fierro162 views
Google Cloud Platform (GCP)  At a GlanceGoogle Cloud Platform (GCP)  At a Glance
Google Cloud Platform (GCP) At a Glance
Cloud Analogy1.2K views
TIAD : Automate everything with Google CloudTIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google Cloud
The Incredible Automation Day3.4K views
Gdsc   muk - innocentGdsc   muk - innocent
Gdsc muk - innocent
junaidhasan17133 views
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
Amazon Web Services205 views
Breaking the Monolith road to containers.pdfBreaking the Monolith road to containers.pdf
Breaking the Monolith road to containers.pdf
Amazon Web Services339 views
Session 4 GCCP.pptxSession 4 GCCP.pptx
Session 4 GCCP.pptx
DSCIITPatna44 views

Último(20)

Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum118 views
Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting170 views

Cloud Composer workshop at Airflow Summit 2023.pdf

  • 1. Cloud Composer Airflow Summit 2023 Workshop September 21, 2023
  • 2. Hi! It's nice to meet you! Bartosz Jankiewicz Engineering Manager Filip Knapik Group Product Michał Modras Engineering Manager Leah Cole Developer Relations Engineer Rafal Biegacz Engineering Manager Arun Vattoly Technical Solutions Engineer Victor Cloud Support Manager Mateusz Henc Software Engineer
  • 4. Table of contents Agenda Setting up Introduction to Cloud Composer Disaster Recovery in Cloud Composer Data Lineage in Cloud Composer 01 02 03 04 05
  • 5. Data Lineage ● Composer + BigQuery ● Composer + Dataproc Agenda Introduction ● Composer Architecture ● Composer Features Disaster Recovery ● Snapshots 15m 1h 1h ☕
  • 7. GCP Projects used during Workshop ● Main activities/exercises will be done in environments created in your dedicated project that we set up for you. Composer environments were pre-set up for you, as well. These projects and environments will be deleted after the workshop. ● Composer project info to be used by you should be passed to you in the separate email. ● If your email address you used to register for the workshop doesn’t have association with active Google Account then you will need to finish this registration as described here ● You can also get this information during the workshop
  • 8. A voucher for Google Cloud Platform (GCP) Credits ● As part of this workshop you will receive a GCP credits voucher worth $500. To be able to redeem the credits, in a addition to active Google Account you will need to set up your GCP project and associate it with an active billing account. This project and GCP credits will be owned by you. You can use activate GCP coupon within 2 months after the workshop. Workshop’s GCP credits are valid for 1 year since activation.
  • 10. Cloud Composer 2 architecture Cloud Composer 2 interacts with the following services: ● CloudSQL - running Airflow metadata storage ● Cloud Storage - user uploaded content (DAGs, user data) ● Kubernetes - runs Scheduler(s), WebServer, Redis queue, SQL proxy and Airflow workloads ● Cloud Logging - stores and indexes components logs ● Cloud Monitoring - searchable Cloud Composer metrics … and many more that we are managing for you.
  • 11. Cloud Composer in a nutshell BigQuery Data Fusion Dataflow Dataproc Storage Cloud Composer 100+ APIs … Orchestrate work across Google Cloud, external SaaS services and proprietary APIs …
  • 12. Cloud Composer benefits Simple deployment Robust Monitoring & Logging Enterprise Security Features DAG code portability Technical Support Managed infrastructure
  • 14. What is a disaster? Disaster is an event where Cloud Composer or other components essential for your environment's operation are unavailable. In case of Google Cloud the impact of disaster can be zonal, regional or global. ��
  • 15. HR Highly resilient Cloud Composer environments are Cloud Composer 2 environments that use built-in redundancy and failover mechanisms that reduce the environment's susceptibility to zonal failures and single point of failure outages. DR Disaster Recovery (DR), in the context of Cloud Composer, is a process of restoring the environment's operation after a disaster. The process involves recreating the environment, possibly in another region. What about Composer with High Resilience?
  • 16. Composer HR makes your application is available right now. DR makes sure you can get it back up later. HR != DR High resilience is critical for the availability of Cloud Composer, but it is often useless for recoverability. For example, a critical historical transactions table may be lost, but new transactions will still be processed
  • 17. Availability should be calculated based on how long a service was unavailable over a specified period of time. Planned downtime is still downtime. Definition of availability Availability = Uptime Total Time
  • 18. Availability = 1 - (1 - Aa) * (1 - Ab) Availability in distributed systems Service A SLO 99% Service B SLO 99% Availability = Aa * Ab Service A SLO 99% Service B SLO 99% 98% Parallelization (OR) Chaining (AND) 99.99%
  • 19. DR Process: Failover Step 1: Everything is fine Primary environment Snapshots storage Scheduled Snapshots Note: User Multi-regional GCS bucket for snapshots storage.
  • 20. That's why Snapshots Storage should be multi-regional. DR Process: Failover Step 2: Disaster! Snapshots storage �� Primary environment
  • 21. DR Process: Failover Step 3: Create Cloud Composer in DR region Snapshots storage Primary environment Failover environment ��
  • 22. Recovery Point Objective Maximum acceptable length of time during which data might be lost from your application due to a major incident. Recovery Time Objective Maximum acceptable length of time that your application can be offline. RTO and RPO
  • 23. Proprietary + Confidential DR scenarios RTO ↘ Cost ↗ RTO ↗ Cost ↘ Warm DR scenario is a variant of Disaster Recovery, where you use a standby failover environment, which you create before a disaster occurs. Cold DR scenario is a variant of Disaster Recovery, where you create a failover environment after a disaster occurs.
  • 24. DR Process: Failover Step 4: Load snapshot Snapshots storage �� Primary environment Failover environment Load Snapshot
  • 25. DR Process: Failover Step 4: Load snapshot and resumed workflows Snapshots storage �� Primary environment Failover environment Scheduled Snapshot
  • 26. DR Process: Failover Step 5: Disaster mitigated Snapshots storage Primary environment Failover environment Scheduled Snapshot Note: Make sure to pause DAGs in the Primary environment ⏸
  • 27. 🥶 Option 1: Switch Failover with Primary environment and delete Primary environment (Cold DR ) 🌤Option 1a: Switch Failover with Primary Environment and keep it (Warm DR) 🥶 Option 2: Fallback to Primary Environment and delete Failover environment (Cold DR) 🌤Option 2a: Fallback to Primary Environment and delete Failover environment (Warm DR) DR Process: Failover Next steps
  • 28. Creating a detailed DR plan 1. What is your RTO? 2. What is your RPO? 3. How do you want to verify your plan? https://cloud.google.com/architecture/dr-scenarios-planning-guide
  • 30. Step 1: Create Snapshots storage bucket ● Use a multi-regional bucket to ensure resiliency to regional failures. ● Make sure the bucket is accessible to your environment service account ○ Grant permissions to lab-sa@airflow-summit-workshop-{project}.iam.gse rviceaccount.com to access created bucket
  • 31. Step 2: Configure scheduled snapshots Primary environment
  • 32. Step 3: Create manual snapshot
  • 33. [Optional] Step 3a: Setup metadata db maintenance The Airflow metadata database must have less than 20 Gb of data to support snapshots 1. Upload a maintenance DAG - http://bit.ly/3t1iiYJ a. The dag is already in your environment 2. Verify database size metric. 3. [Optional] Set up an alert on database size metric.
  • 34. Step 4: Verify a snapshot has been created 1. Visit storage bucket to observe created snapshot objects. 2. … or better delegate this effort to Cloud Monitoring. a. https://cloud.google.com/composer/docs/composer-2/disaster-recovery-with-snapshots
  • 36. Step 6: Load snapshot in failover environment Note: In case of Warm-DR you should skip some options to reduce time of loading and therefore RTO Secondary environment
  • 37. [Extra topic] Anatomy of Snapshot ● Snapshot is a folder on GCS bucket ● Convention for creating folder names (convention is not validated during load) {project}_{region}_{environment}_{ISO_DATE_AND_TIME} Eg.: myproject_us-central1_test-env_2023-09-15T05-02-06/ ● Contents: airflow-database.postgres.sql.gz environment.json fernet-key.txt gcs/ dags/ metadata.json preliminary_metadata.json
  • 38. Step 7: Verify Failover environment health 1. Visit Monitoring Dashboard 2. Check your DAGs are running 3. Verify DAGs history - it should have been loaded with Snapshot
  • 39. Limitations 1. The database size cannot exceed 20GB - the metric available in Monitoring dashboard 2. Snapshot can be stored with 2h+ intervals.
  • 40. Good practices 1. Prepare your DR plan. 2. Test your disaster recovery procedure on a regular basis. 3. Decide what to do with the primary environment afterwards. 4. Set up DB maintenance and monitor DB size. 5. Set up monitoring for scheduled snapshot operations.
  • 41. Let's take a break ☕
  • 42. Data lineage in Cloud Composer 05
  • 43. Data lineage traces the relationship between data sources based on movement of data, explaining how data was sourced and transformed. ● Airflow gains rich lineage capabilities thanks to the OpenLineage integration. ● Implemented by Dataplex in Google Cloud.
  • 44. Data Lineage in Google Cloud: Dataplex ● Process ● Run (execution of Process) ● Event (emitted in a Run)
  • 45. - Currently based on the Airflow Lineage Backend feature. - Backend exports lineage data to Dataplex. - Working on migrating to OpenLineage. Data Lineage in Cloud Composer
  • 46. Operators support: - BigQueryExecuteQueryOperator - BigQueryInsertJobOperator - BigQueryToBigQueryOperator - BigQueryToCloudStorageOperator - BigQueryToGCSOperator - GCSToBigQueryOperator - GoogleCloudStorageToBigQueryOperator - DataprocSubmitJobOperator Data Lineage in Cloud Composer
  • 47. - A growing number of Google Cloud services support Data Lineage (e.g. BigQuery, Dataproc, Cloud Data Fusion). - Goal: Complete data lake lineage. Data Lineage in other Google Cloud services
  • 48. Exercise 1: Lineage from Composer orchestrated BigQuery pipeline This exercise covers data lineage with Cloud Composer, from a BigQuery context with a minimum viable data engineering pipeline. We will demonstrate lineage capture off of an Airflow DAG composed of BigQuery actions. We will first use an Apache Airflow DAG on Cloud Composer to orchestrate the BigQuery jobs and observe the lineage captured by Dataplex. Note that the lineage shows up minutes after a process is run/an entity is created.
  • 49. 1. Review the existing Composer environment
  • 52. 2. Review the existing lineage graph
  • 54. 3. Review the Airflow DAG code Go to DAGS section of the Composer environment.
  • 56. 4. Run the Airflow DAG.
  • 59. 5. Validate the creation of tables in BigQuery.
  • 60. 6. Review the lineage captured in Dataplex UI.
  • 67. 7. Navigate back to Composer Airflow DAG using a link in Data Lineage UI. TODO screenshot
  • 68. Exercise 2: Lineage from Composer orchestrated Dataproc Spark job In this exercise, we will repeat what we did with lineage of BigQuery based Airflow DAG, except, we will use Apache Spark on Dataproc Serverless instead. Note that Dataproc Serverless is not a natively supported service with Dataplex automated lineage capture. We will use custom lineage feature in Cloud Composer.
  • 69. 1. Review the DAG Navigate to the Cloud Composer UI and launch the Airflow UI Click on the Spark DAG
  • 72. 2. Verify inlets & outlets definition Scroll to look at the "inlet" and "outlet" where we specify lineage for BigQuery external tables.
  • 73. 3. Run the DAG
  • 74. 4. Verify the Dataproc Serverless batch jobs in Dataproc Batches UI in Google Cloud Console.
  • 75. 5. Analyze the lineage captured from Composer’s environment Airflow. The lineage captured is custom and BQ external table centric and therefore not visible in the Dataplex UI. The latency of lineage availability is dependent on discovery settings for the asset. Navigate to the BigQuery UI and click on the external table, oda_curated_zone.crimes_curated_spark table.