SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Making Data Timelier and More Reliable
with Lakehouse Technology
Matei Zaharia
Databricks and Stanford University
Talk Outline
§ Many problems with data analytics today stem from the complex data
architectures we use
§ New “Lakehouse” technologies can remove this complexity by enabling
fast data warehousing, streaming & ML directly on data lake storage
The biggest challenges with data today:
data quality and staleness
Data Analyst Survey
60% reported data quality as top challenge
86% of analysts had to use stale data, with
41% using data that is >2 months old
90% regularly had unreliable data sources
Data Scientist Survey
75%
51%
42%
Getting high-quality, timely data is hard…
but it’s partly a problem of our own making!
The Evolution of
Data Management
1980s: Data Warehouses
§ ETL data directly from operational
database systems
§ Purpose-built for SQL analytics & BI:
schemas, indexes, caching, etc
§ Powerful management features such as
ACID transactions and time travel
ETL
Operational Data
Data Warehouses
BI Reports
2010s: New Problems for Data Warehouses
§ Could not support rapidly growing
unstructured and semi-structured data:
time series, logs, images, documents, etc
§ High cost to store large datasets
§ No support for data science & ML
ETL
Operational Data
Data Warehouses
BI Reports
2010s: Data Lakes
§ Low-cost storage to hold all raw data
(e.g. Amazon S3, HDFS)
▪ $12/TB/month for S3 infrequent tier!
§ ETL jobs then load specific data into
warehouses, possibly for further ELT
§ Directly readable in ML libraries (e.g.
TensorFlow) due to open file format
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
PreparationETL
Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc
§ Extra ETL steps that can go wrong
Timeliness suffers:
§ Extra ETL steps before data is available
in data warehouses
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses Data
PreparationETL
Problems with Today’s Data Lakes
Cheap to store all the data, but system architecture is much more complex!
Data reliability suffers:
§ Multiple storage systems with different
semantics, SQL dialects, etc
§ Extra ETL steps that can go wrong
Timeliness suffers:
§ Extra ETL steps before data is available
in data warehouses
Summary
At least some of the problems in modern data architectures are due to
unnecessary system complexity
§ We wanted low-cost storage for large historical data, but we
designed separate storage systems (data lakes) for that
§ Now we need to sync data across systems all the time!
What if we didn’t need to have all these different data systems?
Lakehouse Technology
New techniques to provide data warehousing features directly on
data lake storage
§ Retain existing open file formats (e.g. Apache Parquet, ORC)
§ Add management and performance features on top
(transactions, data versioning, indexes, etc)
§ Can also help eliminate other data systems, e.g. message queues
Key parts: metadata layers such as Delta Lake (from Databricks)
and Apache Iceberg (from Netflix) + new engine designs
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Lakehouse Vision
Data lake storage for all data
Single platform for every use case
Management features
(transactions, versioning, etc)
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
Metadata Layers for Data Lakes
§ A data lake is normally just a collection of files
§ Metadata layers keep track of which files are part of a table to enable
richer management features such as transactions
▪ Clients can then still access the underlying files at high speed
§ Implemented in multiple systems:
ACID
Keep metadata in the object store itself Keep metadata in a database
Problem: What if a query reads the table while the delete is running?
Example: Basic Data Lake
file1.parquet
file2.parquet
file3.parquet
“events” table Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
+ delete file1.parquet
+ delete file3.parquet
Problem: What if the query doing the delete fails partway through?
Example with
file1.parquet
file2.parquet
file3.parquet
“events” table
_delta_log / v1.parquet
/ v2.parquet
Query: delete all events data about customer #17
file1b.parquet
file3b.parquet
rewrite
rewrite
track which files are part of
each version of the table
(e.g. v2 = file1, file2, file3)
_delta_log / v3.parquet
atomically add new log file
v3 = file1b, file2, file3b
Clients now always read a
consistent table version!
• If a client reads v2 of log, it sees
file1, file2, file3 (no delete)
• If a client reads v3 of log, it sees
file1b, file2, file3b (all deleted)
See our VLDB 2020 paper for details
Other Management Features with
§ Time travel to an old table version
§ Zero-copy CLONE by forking the log
§ DESCRIBE HISTORY
§ INSERT, UPSERT, DELETE & MERGE
SELECT * FROM my_table
TIMESTAMP AS OF “2020-05-01”
CREATE TABLE my_table_dev
SHALLOW CLONE my_table
Other Management Features with
§ Streaming I/O: treat a table as a
stream of changes to remove need
for message buses like Kafka
§ Schema enforcement & evolution
§ Expectations for data quality
CREATE TABLE orders (
product_id INTEGER NOT NULL,
quantity INTEGER CHECK(quantity > 0),
list_price DECIMAL CHECK(list_price > 0),
discount_price DECIMAL
CHECK(discount_price > 0 AND
discount_price <= list_price)
);
spark.readStream
.format("delta")
.table("events")
Adoption
§ Used by thousands of companies to process exabytes of data/day
§ Grew from zero to ~50% of the Databricks workload in 3 years
§ Largest deployments: exabyte tables and 1000s of users
Available Connectors
Store
data in
Ingest
from
Query
from
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
The Challenge
§ Most data warehouses have full control over the data storage system
and query engine, so they design them together
§ The key idea in a Lakehouse is to store data in open storage formats
(e.g. Parquet) for direct access from many systems
§ How can we get great performance with these standard, open formats?
Enabling Lakehouse Performance
Even with a fixed, directly-accessible storage format, four optimizations
can enable great SQL performance:
§ Caching hot data, possibly in a different format
§ Auxiliary data structures like statistics and indexes
§ Data layout optimizations to minimize I/O
§ Vectorized execution engines for modern CPUs
New query engines such as Databricks Delta Engine use these ideas
Optimization 1: Caching
§ Most query workloads have concentrated accesses on “hot” data
▪ Data warehouses use SSD and memory caches to improve performance
§ The same techniques work in a Lakehouse if we have a metadata layer
such as Delta Lake to correctly maintain the cache
▪ Caches can even hold data in a faster format (e.g. decompressed)
§ Example: SSD cache in
Databricks Delta Engine
0 20 40 60 80
Parquet onS3
Parquet onSSD
Delta Engine cache
Values read per second per core (millions)
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: min/max statistics on Parquet files for data skipping
file1.parquet
file2.parquet
file3.parquet
year: min 2018, max 2019
uid: min 12000, max 23000
year: min 2018, max 2020
uid: min 12000, max 14000
year: min 2020, max 2020
uid: min 23000, max 25000
Query: SELECT * FROM events
WHERE year=2020 AND uid=24000
updated transactionally
with Delta table log
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: min/max statistics on Parquet files for data skipping
file1.parquet
file2.parquet
file3.parquet
year: min 2018, max 2019
uid: min 12000, max 23000
year: min 2018, max 2020
uid: min 12000, max 14000
year: min 2020, max 2020
uid: min 23000, max 25000
Query: SELECT * FROM events
WHERE year=2020 AND uid=24000
updated transactionally
with Delta table log
Optimization 2: Auxiliary Data Structures
§ Even if the base data is in Parquet, we can build many other data
structures to speed up queries and maintain them transactionally
▪ Inspired by the literature on databases for “raw” data formats
§ Example: indexes over Parquet files
file1.parquet
file2.parquet
file3.parquet
Query: SELECT * FROM events
WHERE type = “DELETE_ACCOUNT”
tree
index
Optimization 3: Data Layout
§ Query execution time primarily depends on amount of data accessed
§ Even with a fixed storage format such as Parquet, we can optimize the
data layout within tables to reduce execution time
§ Example: sorting a table for fast access
file1.parquet
file2.parquet
file3.parquet
file4.parquet
uid = 0…1000
uid = 1001…2000
uid = 2001…3000
uid = 3001…4000
Optimization 3: Data Layout
§ Query execution time primarily depends on amount of data accessed
§ Even with a fixed storage format such as Parquet, we can optimize the
data layout within tables to reduce execution time
§ Example: Z-ordering for multi-dimensional access
dimension 1
dimension2
99%
67%
0%
60%
0%
47%
0%
44%
0%
20%
40%
60%
80%
100%
Sort by col1 Z-Order by col1-4
DataFilesSkipped
Filter on col1 Filter on col2
Filter on col3 Filter on col4
Optimization 4: Vectorized Execution
§ Modern data warehouses optimize CPU time by using vector (SIMD)
instructions on modern CPUs, e.g., AVX512
§ Many of these optimizations can also be applied over Parquet
§ Databricks Delta Engine: ~10x faster
than Java-based engines
3220
25544
35861
0
10000
20000
30000
40000
Delta
Engine
Apache
Spark 3.0
Presto 230
TPC-DS Benchmark Time (s)
Putting These Optimizations Together
§ Given that (1) most reads are from a cache, (2) I/O cost is the key factor
for non-cached data, and (3) CPU time can be optimized via SIMD…
§ Lakehouse engines can offer similar performance to DWs!
0
10000
20000
30000
40000
DW1 DW2 DW3 Delta
Engine
TPC-DS 30TB Benchmark Time (s)
Key Technologies Enabling Lakehouse
1. Metadata layers for data lakes: add transactions, versioning & more
2. New query engine designs: great SQL performance on data lake
storage systems and file formats
3. Optimized access for data science & ML
ML over a Data Warehouse is Painful
§ Unlike SQL workloads, ML workloads need to process large amounts of
data with non-SQL code (e.g. TensorFlow, XGBoost, etc)
§ SQL over JDBC/ODBC interface is too slow for this at scale
§ Export data to a data lake? → adds a third ETL step and more staleness!
§ Maintain production datasets in both DW & lake? → even more complex
ML over a Lakehouse
§ Direct access to data files without overloading the SQL frontend
(e.g., just run a GPU cluster to do deep learning on S3 data)
▪ ML frameworks already support reading Parquet!
§ New declarative APIs for ML data prep enable further optimization
Example: Spark’s Declarative DataFrame API
Users write DataFrame code in Python, R or Java
users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“start_date”, “zip”, “quantity”]
.fillna(0)
Example: Spark’s Declarative DataFrame API
Users write DataFrame code in Python, R or Java
...
model.fit(train_set)
Lazily evaluated query plan
Optimized execution using
cache, statistics, index, etc
users
SELECT(kind = “buyer”)
PROJECT(start_date, zip, …)
PROJECT(NULL → 0)users = spark.table(“users”)
buyers = users[users.kind == “buyer”]
train_set = buyers[“start_date”, “zip”, “quantity”]
.fillna(0)
ML over Lakehouse: Management Features
Lakehouse systems’ management features also make ML easier!
§ Use time travel for data versioning and reproducible experiments
§ Use transactions to reliably update tables
§ Always access the latest data from streaming I/O
Example: organizations using Delta Lake as an ML “feature store”
Summary
Lakehouse systems combine the benefits of data warehouses & lakes
§ Management features via metadata
layers (transactions, CLONE, etc)
§ Performance via new query engines
§ Direct access via open file formats
§ Low cost equal to cloud storage
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured & Unstructured Data
Result: simplify data architectures to
improve both reliability & timeliness
Before and After Lakehouse
Typical Architecture with Many Data Systems
ETL Job ETL Job
ETL Job
Delta Lake
Table 1
Delta Lake
Table 2
Delta Lake
Table 3
Streaming
Analytics
Data
Scientists
BI Users
Cloud Object Store
input
Cloud Object Store
ETL Job ETL Job
ETL Job
Message
Queue
Parquet
Table 1
Parquet
Table 2
Parquet
Table 3
Data
Warehouse
Data
Warehouse
Streaming
Analytics
Data
Scientists
BI Users
input
Lakehouse Architecture: All Data in Object Store
Fewer copies of the data, fewer ETL
steps, no divergence & faster results!
Learn More
Download and learn Delta Lake at delta.io
View free content from our conferences at spark-summit.org:

Weitere ähnliche Inhalte

Was ist angesagt?

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptxTarekHamdi8
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...Cathrine Wilhelmsen
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 

Was ist angesagt? (20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 

Ähnlich wie Making Data Timelier and More Reliable with Lakehouse Technology

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Denodo
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeDenodo
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Data center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingData center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingkotatsu
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeTorsten Steinbach
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxPuneet Vijwani
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 

Ähnlich wie Making Data Timelier and More Reliable with Lakehouse Technology (20)

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Data Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data LakeData Virtualization: An Essential Component of a Cloud Data Lake
Data Virtualization: An Essential Component of a Cloud Data Lake
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Data center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loadingData center design standards for cabinet and floor loading
Data center design standards for cabinet and floor loading
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 

Kürzlich hochgeladen

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Kürzlich hochgeladen (20)

Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Making Data Timelier and More Reliable with Lakehouse Technology

  • 1. Making Data Timelier and More Reliable with Lakehouse Technology Matei Zaharia Databricks and Stanford University
  • 2. Talk Outline § Many problems with data analytics today stem from the complex data architectures we use § New “Lakehouse” technologies can remove this complexity by enabling fast data warehousing, streaming & ML directly on data lake storage
  • 3. The biggest challenges with data today: data quality and staleness
  • 4. Data Analyst Survey 60% reported data quality as top challenge 86% of analysts had to use stale data, with 41% using data that is >2 months old 90% regularly had unreliable data sources
  • 6. Getting high-quality, timely data is hard… but it’s partly a problem of our own making!
  • 8. 1980s: Data Warehouses § ETL data directly from operational database systems § Purpose-built for SQL analytics & BI: schemas, indexes, caching, etc § Powerful management features such as ACID transactions and time travel ETL Operational Data Data Warehouses BI Reports
  • 9. 2010s: New Problems for Data Warehouses § Could not support rapidly growing unstructured and semi-structured data: time series, logs, images, documents, etc § High cost to store large datasets § No support for data science & ML ETL Operational Data Data Warehouses BI Reports
  • 10. 2010s: Data Lakes § Low-cost storage to hold all raw data (e.g. Amazon S3, HDFS) ▪ $12/TB/month for S3 infrequent tier! § ETL jobs then load specific data into warehouses, possibly for further ELT § Directly readable in ML libraries (e.g. TensorFlow) due to open file format BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data PreparationETL
  • 11. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc § Extra ETL steps that can go wrong Timeliness suffers: § Extra ETL steps before data is available in data warehouses BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data PreparationETL
  • 12. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc § Extra ETL steps that can go wrong Timeliness suffers: § Extra ETL steps before data is available in data warehouses
  • 13. Summary At least some of the problems in modern data architectures are due to unnecessary system complexity § We wanted low-cost storage for large historical data, but we designed separate storage systems (data lakes) for that § Now we need to sync data across systems all the time! What if we didn’t need to have all these different data systems?
  • 14. Lakehouse Technology New techniques to provide data warehousing features directly on data lake storage § Retain existing open file formats (e.g. Apache Parquet, ORC) § Add management and performance features on top (transactions, data versioning, indexes, etc) § Can also help eliminate other data systems, e.g. message queues Key parts: metadata layers such as Delta Lake (from Databricks) and Apache Iceberg (from Netflix) + new engine designs
  • 15. Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Lakehouse Vision Data lake storage for all data Single platform for every use case Management features (transactions, versioning, etc)
  • 16. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 17. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 18. Metadata Layers for Data Lakes § A data lake is normally just a collection of files § Metadata layers keep track of which files are part of a table to enable richer management features such as transactions ▪ Clients can then still access the underlying files at high speed § Implemented in multiple systems: ACID Keep metadata in the object store itself Keep metadata in a database
  • 19. Problem: What if a query reads the table while the delete is running? Example: Basic Data Lake file1.parquet file2.parquet file3.parquet “events” table Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite + delete file1.parquet + delete file3.parquet Problem: What if the query doing the delete fails partway through?
  • 20. Example with file1.parquet file2.parquet file3.parquet “events” table _delta_log / v1.parquet / v2.parquet Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite track which files are part of each version of the table (e.g. v2 = file1, file2, file3) _delta_log / v3.parquet atomically add new log file v3 = file1b, file2, file3b Clients now always read a consistent table version! • If a client reads v2 of log, it sees file1, file2, file3 (no delete) • If a client reads v3 of log, it sees file1b, file2, file3b (all deleted) See our VLDB 2020 paper for details
  • 21. Other Management Features with § Time travel to an old table version § Zero-copy CLONE by forking the log § DESCRIBE HISTORY § INSERT, UPSERT, DELETE & MERGE SELECT * FROM my_table TIMESTAMP AS OF “2020-05-01” CREATE TABLE my_table_dev SHALLOW CLONE my_table
  • 22. Other Management Features with § Streaming I/O: treat a table as a stream of changes to remove need for message buses like Kafka § Schema enforcement & evolution § Expectations for data quality CREATE TABLE orders ( product_id INTEGER NOT NULL, quantity INTEGER CHECK(quantity > 0), list_price DECIMAL CHECK(list_price > 0), discount_price DECIMAL CHECK(discount_price > 0 AND discount_price <= list_price) ); spark.readStream .format("delta") .table("events")
  • 23. Adoption § Used by thousands of companies to process exabytes of data/day § Grew from zero to ~50% of the Databricks workload in 3 years § Largest deployments: exabyte tables and 1000s of users
  • 25. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 26. The Challenge § Most data warehouses have full control over the data storage system and query engine, so they design them together § The key idea in a Lakehouse is to store data in open storage formats (e.g. Parquet) for direct access from many systems § How can we get great performance with these standard, open formats?
  • 27. Enabling Lakehouse Performance Even with a fixed, directly-accessible storage format, four optimizations can enable great SQL performance: § Caching hot data, possibly in a different format § Auxiliary data structures like statistics and indexes § Data layout optimizations to minimize I/O § Vectorized execution engines for modern CPUs New query engines such as Databricks Delta Engine use these ideas
  • 28. Optimization 1: Caching § Most query workloads have concentrated accesses on “hot” data ▪ Data warehouses use SSD and memory caches to improve performance § The same techniques work in a Lakehouse if we have a metadata layer such as Delta Lake to correctly maintain the cache ▪ Caches can even hold data in a faster format (e.g. decompressed) § Example: SSD cache in Databricks Delta Engine 0 20 40 60 80 Parquet onS3 Parquet onSSD Delta Engine cache Values read per second per core (millions)
  • 29. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: min/max statistics on Parquet files for data skipping file1.parquet file2.parquet file3.parquet year: min 2018, max 2019 uid: min 12000, max 23000 year: min 2018, max 2020 uid: min 12000, max 14000 year: min 2020, max 2020 uid: min 23000, max 25000 Query: SELECT * FROM events WHERE year=2020 AND uid=24000 updated transactionally with Delta table log
  • 30. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: min/max statistics on Parquet files for data skipping file1.parquet file2.parquet file3.parquet year: min 2018, max 2019 uid: min 12000, max 23000 year: min 2018, max 2020 uid: min 12000, max 14000 year: min 2020, max 2020 uid: min 23000, max 25000 Query: SELECT * FROM events WHERE year=2020 AND uid=24000 updated transactionally with Delta table log
  • 31. Optimization 2: Auxiliary Data Structures § Even if the base data is in Parquet, we can build many other data structures to speed up queries and maintain them transactionally ▪ Inspired by the literature on databases for “raw” data formats § Example: indexes over Parquet files file1.parquet file2.parquet file3.parquet Query: SELECT * FROM events WHERE type = “DELETE_ACCOUNT” tree index
  • 32. Optimization 3: Data Layout § Query execution time primarily depends on amount of data accessed § Even with a fixed storage format such as Parquet, we can optimize the data layout within tables to reduce execution time § Example: sorting a table for fast access file1.parquet file2.parquet file3.parquet file4.parquet uid = 0…1000 uid = 1001…2000 uid = 2001…3000 uid = 3001…4000
  • 33. Optimization 3: Data Layout § Query execution time primarily depends on amount of data accessed § Even with a fixed storage format such as Parquet, we can optimize the data layout within tables to reduce execution time § Example: Z-ordering for multi-dimensional access dimension 1 dimension2 99% 67% 0% 60% 0% 47% 0% 44% 0% 20% 40% 60% 80% 100% Sort by col1 Z-Order by col1-4 DataFilesSkipped Filter on col1 Filter on col2 Filter on col3 Filter on col4
  • 34. Optimization 4: Vectorized Execution § Modern data warehouses optimize CPU time by using vector (SIMD) instructions on modern CPUs, e.g., AVX512 § Many of these optimizations can also be applied over Parquet § Databricks Delta Engine: ~10x faster than Java-based engines 3220 25544 35861 0 10000 20000 30000 40000 Delta Engine Apache Spark 3.0 Presto 230 TPC-DS Benchmark Time (s)
  • 35. Putting These Optimizations Together § Given that (1) most reads are from a cache, (2) I/O cost is the key factor for non-cached data, and (3) CPU time can be optimized via SIMD… § Lakehouse engines can offer similar performance to DWs! 0 10000 20000 30000 40000 DW1 DW2 DW3 Delta Engine TPC-DS 30TB Benchmark Time (s)
  • 36. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  • 37. ML over a Data Warehouse is Painful § Unlike SQL workloads, ML workloads need to process large amounts of data with non-SQL code (e.g. TensorFlow, XGBoost, etc) § SQL over JDBC/ODBC interface is too slow for this at scale § Export data to a data lake? → adds a third ETL step and more staleness! § Maintain production datasets in both DW & lake? → even more complex
  • 38. ML over a Lakehouse § Direct access to data files without overloading the SQL frontend (e.g., just run a GPU cluster to do deep learning on S3 data) ▪ ML frameworks already support reading Parquet! § New declarative APIs for ML data prep enable further optimization
  • 39. Example: Spark’s Declarative DataFrame API Users write DataFrame code in Python, R or Java users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“start_date”, “zip”, “quantity”] .fillna(0)
  • 40. Example: Spark’s Declarative DataFrame API Users write DataFrame code in Python, R or Java ... model.fit(train_set) Lazily evaluated query plan Optimized execution using cache, statistics, index, etc users SELECT(kind = “buyer”) PROJECT(start_date, zip, …) PROJECT(NULL → 0)users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“start_date”, “zip”, “quantity”] .fillna(0)
  • 41. ML over Lakehouse: Management Features Lakehouse systems’ management features also make ML easier! § Use time travel for data versioning and reproducible experiments § Use transactions to reliably update tables § Always access the latest data from streaming I/O Example: organizations using Delta Lake as an ML “feature store”
  • 42. Summary Lakehouse systems combine the benefits of data warehouses & lakes § Management features via metadata layers (transactions, CLONE, etc) § Performance via new query engines § Direct access via open file formats § Low cost equal to cloud storage Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Result: simplify data architectures to improve both reliability & timeliness
  • 43. Before and After Lakehouse Typical Architecture with Many Data Systems ETL Job ETL Job ETL Job Delta Lake Table 1 Delta Lake Table 2 Delta Lake Table 3 Streaming Analytics Data Scientists BI Users Cloud Object Store input Cloud Object Store ETL Job ETL Job ETL Job Message Queue Parquet Table 1 Parquet Table 2 Parquet Table 3 Data Warehouse Data Warehouse Streaming Analytics Data Scientists BI Users input Lakehouse Architecture: All Data in Object Store Fewer copies of the data, fewer ETL steps, no divergence & faster results!
  • 44. Learn More Download and learn Delta Lake at delta.io View free content from our conferences at spark-summit.org: