Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Dragan Beric

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 23 Anzeige

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Dragan Beric

Herunterladen, um offline zu lesen

Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.

Dragan Berić will take a deep dive into Lakehouse architecture, a game-changing concept bridging the best elements of data lake and data warehouse. The presentation will focus on the Delta Lake format as the foundation of the Lakehouse philosophy, and Databricks as the primary platform for its implementation.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von DataScienceConferenc1 (20)

Aktuellste (20)

Anzeige

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Dragan Beric

  1. 1. Lakehouse architecture with Delta Lake & Databricks Data engineer / Technical & Engineering lead Dragan Beric
  2. 2. Agenda ● Motivation ● Architecture evolution ● Lakehouse ● Delta lake
  3. 3. Motivation Top 3 challenges in enterprise data usage: ○ Access ○ Reliability ○ Timeliness
  4. 4. Architecture evolution two-tier architecture
  5. 5. Architecture evolution two-tier architecture • flexible storage • low - cost storage • easy access for ML & DS
  6. 6. Architecture evolution two-tier architecture • flexible storage • low - cost storage • easy access for ML & DS • data duplication • additional jobs for data movement • maintenance (data + development)
  7. 7. Architecture evolution lakehouse architecture
  8. 8. Architecture evolution lakehouse architecture • flexible storage • low - cost storage • easy access for ML & DS • single source of truth • governance • schema validation • ACID transactions over files
  9. 9. Lakehouse A lakehouse is a new, open architecture that combines the best elements of data lakes and data warehouses
  10. 10. Lakehouse Lorem ipsu datawarehouse Fast Structured Operational Governed Cheap Flexible Scalable data lake Lakehouse
  11. 11. Delta lake what is delta? Databricks delta is a unified data management system that brings reliability and performance to existing data lakes
  12. 12. Delta lake what is delta? Databricks delta is a unified data management system that brings reliability and performance to existing data lakes Delta Lake is an optimized, managed format for organizing & working with Parquet files
  13. 13. Delta lake what is delta? Databricks delta is a unified data management system that brings reliability and performance to existing data lakes Delta Lake is an optimized, managed format for organizing & working with Parquet files “It’s Parquet, just better!”
  14. 14. Delta lake challenges with parquet • Hard to append data • Update / Merge not supported • Metadata does not scale (a lot of small files) • A lot of small parquet files (no auto-compaction) • Jobs failing mid way
  15. 15. Delta lake structure folder that contains: • data in parquet • delta log • data
  16. 16. Delta lake structure folder that contains: • data in parquet • delta log • data
  17. 17. Delta lake delta log – transactional layer folder that contains: • table-schema • commits info • meta-data checkpoint file 10 commits (transactions)
  18. 18. Delta lake query execution plan Query is received Processing _delta_log Find and read latest checkpoint file Read transactions after the checkpoint Read data referenced by checkpoint and transactions Return results
  19. 19. Delta lake DML operations – merge/update _delta_log storage delta lake Product Price($) Apple 1.5 Banana 0.53 Lemon 1.42 UPDATE DimProduct SET Price = 1$ WHERE Product = ‘Lemon’ 0000.json “add”:{“part-01.parquet”,…} part-01 3 rows 0001.json “remove”:{“part-01.parquet”,…}, “add”:{“part-02.parquet”,…} part-01 3 rows part-02 3 rows Product Price($) Apple 1.5 Banana 0.53 Lemon 1
  20. 20. Delta lake DML operations – delete _delta_log storage delta lake Product Price($) Apple 1.5 Banana 0.53 Lemon 1.42 DELETE from DimProduct WHERE Product = ‘Lemon’ part-01 3 rows 0002.json “remove”:{“part-02.parquet”,…}, “add”:{“part-03.parquet”,…} part-01 3 rows part-02 3 rows Product Price($) Apple 1.5 Banana 0.53 0001.json “remove”:{“part-01.parquet”,…}, “add”:{“part-02.parquet”,…} part-02 3 rows part-03 2 rows
  21. 21. Delta lake DML operations – insert _delta_log storage delta lake INSERT INTO DimProduct VALUES (‘Mango’,3) 0003.json “add”:{“part-04.parquet”,…} part-01 3 rows part-02 3 rows Product Price($) Apple 1.5 Banana 0.53 Mango 3 part-03 2 rows Product Price($) Apple 1.5 Banana 0.53 0002.json “remove”:{“part-02.parquet”,…}, “add”:{“part-03.parquet”,…} part-01 3 rows part-02 3 rows part-03 2 rows part-04 1 rows
  22. 22. Delta lake features – conclussion ACID transactions indexing metadata scaling governance schema validation time travel
  23. 23. HTEC group | USA, San Francisco | 535 Mission St, 14th floor | +1 415 490 8175 | office-sf@htecgroup.com | htecgroup.com Q&A

×