Build Data Lakes with Apache Airflow

•

0 gefällt mir•256 views

Build a simple Data Lake on AWS using a combination of services, including Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Glue, AWS Glue Studio, Amazon Athena, and Amazon S3. Blog post and link to the video: https://garystafford.medium.com/building-a-data-lake-with-apache-airflow-b48bd953c2b

Daten & Analysen

Data Lake Demonstration
Building Data Lakes with Apache Airflow
Gary A. Stafford

Twitter/LinkedIn
GaryStafford
Blog
garystafford.medium.com

Agenda
What is a Data Lake?
Dataset
Architecture
Source Code
Demonstration

What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS

Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed to demonstrate Amazon Redshift Cloud Data Warehouse
Small database consists of seven tables: two fact and five dimension tables
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html

Dataset
Table Simulated Datasource Demo Datasource
Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL
Listing COTS E-commerce Platform Amazon RDS for MySQL
Sales COTS E-commerce Platform Amazon RDS for MySQL
Date COTS E-commerce Platform Amazon RDS for MySQL
Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server

Architecture: AWS Services Used
Amazon Simple Storage Service (Amazon S3)
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)

Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): Handling changes to systems of record
Transactional Storage Layer: Managing changes to the SoR in the data lake
Streaming Data: Data continuously generated by different sources
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption

Architecture: Out of Scope (but critically important)
Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII)
DataOps: Automating testing, deployment, job execution
Infrastructure as Code (IaC): Infrastructure provisioning automation
Data Warehousing (Lake House architecture)
Data Lake Storage Tiering, Archival, and Backup

github.com/garystafford/tickit-data-lake-demo

Empfohlen

The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem

The delta architecturePrakash Chockalingam

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Delta lake and the delta architectureAdam Doyle

Intro to Delta LakeDatabricks

Lakehouse Analytics with DremioDimitarMitov4

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Azure Data Factory presentation with linksChris Testa-O'Neill

Empfohlen

The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem

The delta architecturePrakash Chockalingam

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Delta lake and the delta architectureAdam Doyle

Intro to Delta LakeDatabricks

Lakehouse Analytics with DremioDimitarMitov4

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Azure Data Factory presentation with linksChris Testa-O'Neill

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

3D: DBT using Databricks and DeltaDatabricks

Apache Hive TutorialSandeep Patil

Big data architectures and the data lakeJames Serra

Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks

Snowflake Best Practices for Elastic Data WarehousingAmazon Web Services

Introduction to Azure DatabricksJames Serra

Azure Data Factory v2inovex GmbH

Kafka Connect - debeziumKasun Don

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!

DW Migration Webinar-March 2022.pptxDatabricks

Building a Virtual Data Lake with Apache ArrowDremio Corporation

Delta Lake with Azure DatabricksDustin Vannoy

Presto: SQL-on-anythingDataWorks Summit

Productizing Structured Streaming JobsDatabricks

Databricks Delta Lake and Its BenefitsDatabricks

Achieving Lakehouse Models with Spark 3.0Databricks

Building a Data Lake on AWSGary Stafford

Using Data LakesAmazon Web Services

Weitere ähnliche Inhalte

Was ist angesagt?

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

3D: DBT using Databricks and DeltaDatabricks

Apache Hive TutorialSandeep Patil

Big data architectures and the data lakeJames Serra

Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks

Snowflake Best Practices for Elastic Data WarehousingAmazon Web Services

Introduction to Azure DatabricksJames Serra

Azure Data Factory v2inovex GmbH

Kafka Connect - debeziumKasun Don

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!

DW Migration Webinar-March 2022.pptxDatabricks

Building a Virtual Data Lake with Apache ArrowDremio Corporation

Delta Lake with Azure DatabricksDustin Vannoy

Presto: SQL-on-anythingDataWorks Summit

Productizing Structured Streaming JobsDatabricks

Databricks Delta Lake and Its BenefitsDatabricks

Achieving Lakehouse Models with Spark 3.0Databricks

Was ist angesagt? (20)

Building Lakehouses on Delta Lake with SQL Analytics Primer

3D: DBT using Databricks and Delta

Apache Hive Tutorial

Big data architectures and the data lake

Incremental View Maintenance with Coral, DBT, and Iceberg

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Introduction to AWS Glue: Data Analytics Week at the SF Loft

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Snowflake Best Practices for Elastic Data Warehousing

Introduction to Azure Databricks

Azure Data Factory v2

Kafka Connect - debezium

What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...

DW Migration Webinar-March 2022.pptx

Building a Virtual Data Lake with Apache Arrow

Delta Lake with Azure Databricks

Presto: SQL-on-anything

Productizing Structured Streaming Jobs

Databricks Delta Lake and Its Benefits

Achieving Lakehouse Models with Spark 3.0

Ähnlich wie Build Data Lakes with Apache Airflow

Building a Data Lake on AWSGary Stafford

Using Data LakesAmazon Web Services

Building your first Data lake platform Amazon Web Services

Owning Your Own (Data) Lake HouseData Con LA

Your First Data Lake on AWS_Simon ElishaHelen Rogers

AWS Big Data LandscapeCrishantha Nanayakkara

AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAmazon Web Services Korea

Scalable Data Analytics - DevDay Austin 2017 Day 2Amazon Web Services

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services

Using Data Lakes: Data Analytics Week SFAmazon Web Services

AWS March 2016 Webinar Series Building Your Data Lake on AWS Amazon Web Services

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services

Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services

Using Data Lakes Amazon Web Services

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim

Databricks Platform.pptxAlex Ivy

Deploying your Data Warehouse on AWSAmazon Web Services

Build Data Lakes and Analytics on AWS Amazon Web Services

Ähnlich wie Build Data Lakes with Apache Airflow (20)

Building a Data Lake on AWS

Using Data Lakes

Building your first Data lake platform

Owning Your Own (Data) Lake House

Your First Data Lake on AWS_Simon Elisha

AWS Big Data Landscape

AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry

Scalable Data Analytics - DevDay Austin 2017 Day 2

AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2

Using Data Lakes: Data Analytics Week SF

AWS March 2016 Webinar Series Building Your Data Lake on AWS

Understanding AWS Managed Database and Analytics Services | AWS Public Sector...

Interactively Querying Large-scale Datasets on Amazon S3

Using Data Lakes

AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...

Databricks Platform.pptx

Deploying your Data Warehouse on AWS

Build Data Lakes and Analytics on AWS

Mehr von Gary Stafford

Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...Gary Stafford

Building Open Data Lakes on AWS with Debezium and Apache HudiGary Stafford

How Mature is Your Infrastructure?Gary Stafford

Infrastructure as Code Maturity Model v1Gary Stafford

Enterprise DevOps Adoption LinkedInGary Stafford

From Zurich to the Cosmos, by Artist Steve CarpenterGary Stafford

Mehr von Gary Stafford (6)

Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...

Building Open Data Lakes on AWS with Debezium and Apache Hudi

How Mature is Your Infrastructure?

Infrastructure as Code Maturity Model v1

Enterprise DevOps Adoption LinkedIn

From Zurich to the Cosmos, by Artist Steve Carpenter

Kürzlich hochgeladen

Data-Analysis for Chicago Crime Data 2023ymrp368

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Edukaciniai dropshipping via API with DroFxolyaivanovalion

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765

April 2024 - Crypto Market Report's Analysismanisha194592

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Week-01-2.ppt BBB human Computer interactionfulawalesam

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Invezz.com - Grow your wealth with trading signalsInvezz1

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Kürzlich hochgeladen (20)

Data-Analysis for Chicago Crime Data 2023

Schema on read is obsolete. Welcome metaprogramming..pdf

Carero dropshipping via API with DroFx.pptx

Generative AI on Enterprise Cloud with NiFi and Milvus

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Edukaciniai dropshipping via API with DroFx

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl

April 2024 - Crypto Market Report's Analysis

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

BigBuy dropshipping via API with DroFx.pptx

CebaBaby dropshipping via API with DroFX.pptx

Week-01-2.ppt BBB human Computer interaction

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Invezz.com - Grow your wealth with trading signals

Capstone Project on IBM Data Analytics Program

Build Data Lakes with Apache Airflow

1. Data Lake Demonstration Building Data Lakes with Apache Airflow Gary A. Stafford

2. Twitter/LinkedIn GaryStafford Blog garystafford.medium.com

3. Agenda What is a Data Lake? Dataset Architecture Source Code Demonstration

4. What is a Data Lake?

5. What is a Data Lake? “A data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data.” - Databricks “A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.” - AWS

6. What is a Data Lake?

7. Dataset

8. Dataset TICKIT database E-commerce platform Bringing together buyers and sellers of tickets to entertainments events Designed to demonstrate Amazon Redshift Cloud Data Warehouse Small database consists of seven tables: two fact and five dimension tables Tables: Categories, Events, Venues, Users, Listings, Sales, Dates docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html

10. Dataset Table Simulated Datasource Demo Datasource Category Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Event Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Venue Software as a Service (SaaS) 3rd Party Provider Amazon RDS for PostgreSQL Listing COTS E-commerce Platform Amazon RDS for MySQL Sales COTS E-commerce Platform Amazon RDS for MySQL Date COTS E-commerce Platform Amazon RDS for MySQL Users Custom Customer Relationship Management (CRM) Amazon RDS for SQL Server

11. Dataset

12. Architecture

13. Architecture: AWS Services Used Amazon Simple Storage Service (Amazon S3) AWS Glue Studio (alt. AWS Glue DataBrew) AWS Glue Data Catalog (alt. Apache Hive on EMR) AWS Glue Crawlers (alt. CDC with AWS DMS or Kafka Connect) AWS Glue Jobs (alt. AWS Glue DataBrew, or Apache Spark or Presto on EMR) Amazon Athena (alt. Presto on EMR) Amazon Managed Workflows for Apache Airflow (MWAA) (alt. AWS Step Functions)

14.

15.

16. Architecture: Out of Scope (but critically important) Change Data Capture (CDC): Handling changes to systems of record Transactional Storage Layer: Managing changes to the SoR in the data lake Streaming Data: Data continuously generated by different sources Fine-grained Authorization: database-, table-, column-, and row-level access Data Lineage: Tracking data’s lifecycle as it flows from sources to consumption

17. Architecture: Out of Scope (but critically important) Data Discovery/Inspection: Scanning data for sensitive or unexpected content (PII) DataOps: Automating testing, deployment, job execution Infrastructure as Code (IaC): Infrastructure provisioning automation Data Warehousing (Lake House architecture) Data Lake Storage Tiering, Archival, and Backup

18. Source Code

19. github.com/garystafford/tickit-data-lake-demo

20. Demonstration