Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.
6. Architecture evolution
two-tier architecture
• flexible storage
• low - cost storage
• easy access for
ML & DS
• data duplication
• additional jobs for
data movement
• maintenance
(data +
development)
11. Delta lake
what is delta?
Databricks delta is a unified data
management system that brings
reliability and performance to
existing data lakes
12. Delta lake
what is delta?
Databricks delta is a unified data
management system that brings
reliability and performance to
existing data lakes
Delta Lake is an optimized,
managed format for organizing &
working with Parquet files
13. Delta lake
what is delta?
Databricks delta is a unified data
management system that brings
reliability and performance to
existing data lakes
Delta Lake is an optimized,
managed format for organizing &
working with Parquet files
“It’s Parquet, just better!”
14. Delta lake
challenges with parquet
• Hard to append data
• Update / Merge not supported
• Metadata does not scale (a lot
of small files)
• A lot of small parquet files (no
auto-compaction)
• Jobs failing mid way
17. Delta lake
delta log – transactional layer
folder that contains:
• table-schema
• commits info
• meta-data
checkpoint file
10 commits
(transactions)
18. Delta lake
query execution plan
Query is
received
Processing
_delta_log
Find and read
latest
checkpoint file
Read
transactions
after the
checkpoint
Read data
referenced by
checkpoint and
transactions
Return results
19. Delta lake
DML operations – merge/update
_delta_log
storage
delta
lake
Product Price($)
Apple 1.5
Banana 0.53
Lemon 1.42
UPDATE DimProduct
SET Price = 1$
WHERE Product = ‘Lemon’
0000.json
“add”:{“part-01.parquet”,…}
part-01
3 rows
0001.json
“remove”:{“part-01.parquet”,…},
“add”:{“part-02.parquet”,…}
part-01
3 rows
part-02
3 rows
Product Price($)
Apple 1.5
Banana 0.53
Lemon 1
20. Delta lake
DML operations – delete
_delta_log
storage
delta
lake
Product Price($)
Apple 1.5
Banana 0.53
Lemon 1.42
DELETE from DimProduct
WHERE Product = ‘Lemon’
part-01
3 rows
0002.json
“remove”:{“part-02.parquet”,…},
“add”:{“part-03.parquet”,…}
part-01
3 rows
part-02
3 rows
Product Price($)
Apple 1.5
Banana 0.53
0001.json
“remove”:{“part-01.parquet”,…},
“add”:{“part-02.parquet”,…}
part-02
3 rows
part-03
2 rows