As presented: Sajith Appukuttan, Solution Architect, Databricks
Sept 12, 2019 at Vancouver Spark Meetup
Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake: Open Source Reliability w/ Apache Spark
1. Delta Lake: Open Source
Reliability with Apache Spark
Sajith Appukuttan
2. 1. Collect
Everything
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
3. Data Science &
Machine Learning
2. Store it all in
the Data Lake
The Promise of the Data Lake
Garbage In Garbage Stored Garbage Out
��
��
��
����
��
��
3. What does a typical
data lake project look like?
4. Evolution of a Cutting-Edge Data Lake
Events
?
AI & Reporting
Streaming
Analytics
Data Lake
5. Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
Data Lake
6. Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1
7. Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2
8. Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2
9. Reprocessing
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
10. Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data
11. Data Lake Distractions
No atomicity means failed production jobs
leave data in corrupt state requiring tedious
recovery
✗
No quality enforcement creates inconsistent
and unusable data
No consistency / isolation makes it almost
impossible to mix appends and reads, batch and
streaming
13. Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
15. AI & Reporting
Streaming
Analytics
The Architecture
Data Lake
CSV,
JSON,
TXT…
Kinesis
Full ACID Transactions on your Big Data
Focus on your data flow, instead of worrying about failures.
16. AI & Reporting
Streaming
Analytics
The Architecture
Data Lake
CSV,
JSON,
TXT…
Kinesis
Open Standards, Open Source (Apache License)
Store petabytes of data without worries of lock-in. Growing
community including Presto, Spark and more.
17. AI & Reporting
Streaming
Analytics
The Architecture
Data Lake
CSV,
JSON,
TXT…
Kinesis
Powered by
Unifies Streaming / Batch. Convert existing jobs with minimal
modifications.
18. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
19. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
•Dumping ground for raw data
•Often with long retention (years)
•Avoid error-prone parsing
��
20. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Intermediate data with some cleanup applied.
Queryable for easy debugging!
21. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Clean data, ready for consumption.
Read with Spark or Presto*
*Coming Soon
22. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Streams move data through the Delta Lake
•Low-latency or manually triggered
•Eliminates management of schedules and jobs
23. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Delta Lake also supports batch jobs
and standard DML
UPDATE
DELETE
MERGE
OVERWRITE
• Retention
• Corrections
• GDPR
INSERT
*DML released in 0.3.0
24. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
DELETE DELETE
31. Sign up for
Databricks Community Edition
Go to:
databricks.com/try
and choose Community Edition
32.
33.
34.
35.
36. Notebooks
01 - Delta Lake Primer https://dbricks.co/dlw-01
02 - Delta Lake - Introducing ML https://dbricks.co/dlw-02
03 - Delta Lake - XGBoost 0.81 https://dbricks.co/dlw-03
37. Join the Delta Lake
Community!
Slack Channel | Mailing List
38. Apache Spark™
• Use Cases
• Research
• Technical Deep Dives
AI
• Productionizing ML
• Deep Learning
Fields
• Data Science
• Data Engineering
• Enterprise
1700+ ATTENDEES
Practitioners:
Data Scientists, Data Engineers,
Analysts, Architects
Leaders:
Engineering Management, VPs,
Heads of Analytics & Data, CxOs
TRACKS
databricks.com/sparkaisummit/europe
CODE: Databricks20