Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
2. Steps to running this tutorial
Instructions - https://dbricks.co/saiseu19-delta
1. Create an account + sign in to Databricks Community Edition
https://databricks.com/try
2. Create a cluster with Databricks Runtime 6.1
3. Import the Python notebook and attach it to the cluster
You can also use Scala notebook if you prefer
3. 1. Collect
Everything
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
3. Data Science &
Machine Learning
2. Store it all in
the Data Lake
The Promise of the Data Lake
Garbage In Garbage Stored Garbage Out
🔥
🔥
🔥
🔥🔥
🔥
🔥
Tutorial instructions - https://dbricks.co/saiseu19-delta
4. What does a typical
data lake project look like?
Tutorial instructions - https://dbricks.co/saiseu19-delta
5. Evolution of a Cutting-Edge Data Lake
Events
?
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
6. Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
7. Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1
Tutorial instructions - https://dbricks.co/saiseu19-delta
8. Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
9. Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
11. Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data
Tutorial instructions - https://dbricks.co/saiseu19-delta
12. Data Lake Distractions
No atomicity means failed production jobs
leave data in corrupt state requiring tedious
recovery
✗
No quality enforcement creates inconsistent
and unusable data
No consistency / isolation makes it almost
impossible to mix appends and reads, batch and
streaming
Tutorial instructions - https://dbricks.co/saiseu19-delta
13. Let’s try it instead with
Tutorial instructions - https://dbricks.co/saiseu19-delta
14. Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
15. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
The Architecture
Tutorial instructions - https://dbricks.co/saiseu19-delta
16. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Full ACID Transactions
Focus on your data flow, instead of worrying about failures.
Tutorial instructions - https://dbricks.co/saiseu19-delta
17. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Open Standards, Open Source
Store petabytes of data without worries of lock-in. Growing
community including Spark, Presto, Hive and more.
Tutorial instructions - https://dbricks.co/saiseu19-delta
18. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Tutorial instructions - https://dbricks.co/saiseu19-delta
Powered by
Unifies Streaming / Batch. Convert existing jobs with minimal
modifications.
19. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
UPDATE
DELETE
MERGE
OVERWRITE
INSERT
Tutorial instructions - https://dbricks.co/saiseu19-delta
Support for DMLs
Use Delete/Update/Merge operations for data
corrections, GDPR, Change Data Capture, etc.
20. Open source and open formats
Unified Batch and Streaming
sources
ACID Transactions
Schema Enforcement and
Evolution
Delete, Update, Merge
Audit History
Versioning and Time Travel
Scalable metadata management
Support from Spark, Presto, Hive
Tutorial instructions - https://dbricks.co/saiseu19-delta
21. Used by 1000s of organizations world wide
> 2 exabyte processed last month alone
Tutorial instructions - https://dbricks.co/saiseu19-delta