We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
3. ● Solutions Architect @ Databricks
● MS Computer Science, BS Biology
● Previously: HP, Teradata, DataStax, Esgyn
● PMC and Apache Committer on Apache
Trafodion
● 5 Different Distributed Systems
● Course with Udacity on Data Engineering
Today’s Speaker
4. Agenda
● Data Engineers Nightmares and Dreams
● Data Lifecycle vs the Delta Lifecycle
● Transitioning Data Pipeline to Delta
● How Dreams Become True
● DEMO!
● How to use Delta
6. Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey…
7. Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified View
Validation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey… into a Nightmare
8. Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey… into a Nightmare
Can this be simplified?
9. A Data Engineer’s Dream...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a cost
efficient way without having to choose between batch or streaming
10. What’s missing?
1. Ability to read consistent data while data is being written
2. Ability to read incrementally from a large table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new data that arrived
5. Ability to handle late arriving data without having to delay downstream
processing
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
11. So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs
13. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
15. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
16. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle of the Past
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
17. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark
18. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The Data Lifecycle
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Data Lake
Apache
Spark
DW/OLAP
20. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
21. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
•Dumping ground for raw data
•Often with long retention (years)
•Avoid error-prone parsing
��
22. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Intermediate data with some cleanup applied.
Queryable for easy debugging!
23. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Clean data, ready for consumption.
Read with Spark or Presto*
24. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Streams move data through the Delta Lake
•Low-latency or manually triggered
•Eliminates management of schedules and jobs
25. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Delta Lake also supports batch jobs
and standard DML
UPDATE
DELETE
MERGE
OVERWRITE
• GDPR, CCPA
• Upserts
INSERT
26. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON,
TXT…
Kinesis
Easy to recompute when business logic changes:
• Clear tables
• Restart streams
DELETE DELETE
30. Connecting the dots...
Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
31. Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
32. Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
33. Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
34. 1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting
?
35. 1. Ability to read consistent data while data is
being written
2. Ability to read incrementally from a large
table with good throughput
3. Ability to rollback in case of bad writes
4. Ability to replay historical data along new
data that arrived
5. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON,
TXT…
Kinesis
AI & Reporting