SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
STRATEGIC PARTNER
14.05.2023 SQLDay 2023 1
14.05.2023 SQLDay 2023 2
Azure Databricks –
performance tuning
Adrian Chodkowski
O mnie
• Adrian Chodkowski
• Microsoft Data Platform MVP
• Data engineer & Architekt w Elitmind
• Specjalizacja: Platforma danych Microsoft (Azure & On-premise)
• Data Community
• seequality.net
• Adrian.Chodkowski@outlook.com
• @Twitter: Adrian_SQL
• LinkedIn: http://tinyurl.com/adrian-sql
14.05.2023 SQLDay 2023 3
AGENDA
• Databricks disk cache
• AutoLoader
• Static & Dynamic Partition pruning
• File pruning
• Z-Ordering
• Additional tips & summary
14.05.2023 SQLDay 2023 4
Level: 300
14.05.2023 SQLDay 2023 5
Databricks disk cache
aka Delta cache, DBIO
Disk caching
• Disk caching behavior is a proprietary Databricks feature
previously called Delta Cache & DBIO,
• It is different than Spark cache!
• Works with parquet and delta,
• Databricks uses disk caching to accelerate data reads by
creating copies of remote Parquet data files in nodes’ local
storage,
• Successive reads of the same data are then performed
locally,
• After turning on it can be managed automatically or it can
be forced by using CACHE SELECT syntax,
• You benefit most by using cache-accelerated worker
instance types
14.05.2023 SQLDay 2023 6
Disk cache vs Spark
cache
• Prefer use of disk cache!
14.05.2023 SQLDay 2023 7
Disk caching -
configuration
• spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")
• Be careful with autoscaling :
• Spark.databricks.io.cache.maxDiskUsage – disk space per node reserved for cached data in bytes,
• Spark.databricks.io.cache.maxMetaDataCache – disk space per node reserved for cached metadata in
bytes,
• Spark.databricks.io.cache.compression.enabled – should the cached data be stored in compressed
format.
14.05.2023 SQLDay 2023 8
14.05.2023 SQLDay 2023 12
Ingestion
With Autoloader
Autoloader
• Functionality based on structured
streaming that can identify and load
new files,
• Works in two modes:
• Directory Listing – internally lists
structure of files and partitions. Can be
done incrementally
• File Notification - leverage Event Grid
and storage queues to track new files
appearance
• Avoid overwrites the file,
• Have a backfill strategy defined!
• In many cases consider it as a default
loading mechanism!
• There is also COPY INTO!
14.05.2023 SQLDay 2023 15
14.05.2023 SQLDay 2023 16
Partitioning
Dynamic Pruning & Design
Partitioning
• Division of a table to the hierarchy of folders,
• Can be beneficial when reading data (partition discovery)
and optimizing (OPTIMIZE WHERE partition key)
• Databricks recommendation:
• don’t use it for tables smaller than 1 TB,
• Each partition >1GB,
• When reading and filter by partition key then only
specific partition can be read (rest will be skipped)
• Can be created using PARTITIONED BY or Partition
keywords,
• Don’t use CREATE TABLE PARTITION BY AS SELECT – it will
add hive overhead and can take ages to finish!
14.05.2023 SQLDay 2023 17
Year=2003
Month=01
Month=02
Month=03
Month=04
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
*.parquet
Dynamic Partition
Pruning
14.05.2023 18
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".
Dynamic Partition
Pruning
14.05.2023 19
• Especially useful in star schema
• Introduced in Spark 3.0
• Let’s assume query:
SELECT d.c1, f.c2
FROM Fact AS f
JOIN Dimension as d
ON f.join_key = d.join_key
WHERE d.c2 =10
• Since there is a filter on one table (d.c2 = 10),
internally DPP can create a subquery:
SELECT d.join_key FROM Dimension AS d
WHERE d.c2=10;
• and then broadcast this sub-query result, so that we can use this
result to prune partitions for "t1".
14.05.2023 SQLDay 2023 20
File pruning
By using stats
File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible.
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access)
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 21
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File 1
File 2
File 3
File 4
File 5
File 6
File Column Min Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
File 6 Category_id 1 1
File pruning
• In addition to eliminating data at partition granularity, Delta Lake on
Databricks dynamically skips unnecessary files when possible,
• Delta Lake automatically collects metadata about data files (files can be
skipped without data file access),
• Statistics are taken for the first 32 columns (can be changed).
14.05.2023 SQLDay 2023 22
SELECT * FROM
Dimension AS d
WHERE
d.category_id IN
(1,2,5);
File Column Min Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
File 6 Category_id 1 1
File 1
File 2
File 3
File 4
File 5
File 6
14.05.2023 SQLDay 2023 23
Zordering
By using OPTIMIZE
OPTIMIZE Z-Order
• OPTIMIZE optimizes the layout of Delta Lake,
• By Default OPTIMIZE set max file size to 1GB (1073741824),
• You can control size by using spark.databricks.delta.optimize.maxFileSize
• Can be used based on column or bin-packing optimization:
• Bin-packing – idempotent technique that aims to produce evenly-balanced data files with
respect to their size on disk (not number of rows),
• Z-ordering not-idempotent technique that aims to produce evenly-balanced data files with
respect to the number of rows (not size on disk)
14.05.2023 SQLDay 2023 24
OPTIMIZE events OPTIMIZE events WHERE
>= '2017-01-01' OPTIMIZE events WHERE
>= current_timestamp() -
1 day ZORDER BY (eventType)
OPTIMIZE Z-Order
14.05.2023 SQLDay 2023 25
File 1(800MB)
File 2 (2200MB)
File 3 (900 MB)
File 4 (300 MB)
File 5 (100 MB)
File 1
(1GB)
File 2
(1GB)
File 3
(1GB)
File Column Mi
n
Max
File 1 Category_id 1 2
File 2 Category_id 1 10
File 3 Category_id 6 10
File 4 Category_id 8 20
File 5 Category_id 4 100
OPTIMIZE table
ZORDER BY (category_id)
File Column Mi
n
Max
File 1 Category_id 1 10
File 2 Category_id 11 20
File 3 Category_id 20 100
Dynamic File pruning
• Files can be skipped based on join not literal values
• To make it happen following requirements must be met:
• Inner table (probe) being joined is in Delta format
• Joint type is INNER or LEFT-SEMI
• Join Strategy is BROADCAST HASH JOIN
• Number of files in the inner table is greater than value set in
spark.databricks.optimizer.deltaTableFilesThreshold (default
1000)
• Spark.databricks.optimizer.dynamicFilePruning should be True
(default)
• Size of inner table should be more than
spark.databricks.optimizer.deltaTableSize
(default 10GB)
14.05.2023 SQLDay 2023 26
Other
• Use newest Databricks Runtime,
• For fast Update/Merge re-write the least amount of files:
spark.databricks.delta.optimize.maxfilesize 16-128MB or turn on optimized writes.
• Don’t use Python or Scala UDF if native function exists – transfer data between Python
and Spark = serialization is needed = drastically slows down queries
• Move numericals, keys, high cardinality query predicates to the left, long string that are
not distinct enough for stats collection move to the left (only 32 columns has statistics)
• OPTIMIZE benefits from Compute Optimized clusters (because of a lot of encoding and
decoding parquet files)
• Think about spot instances,
• For some operations consider Databricks Standard.
14.05.2023 SQLDay 2023 27
Adaptive Query
Processing
• Game changer in Spark 3.x
• For example: Initially SortMergeJoin chosen
but once the ingest stages completes plan
will be updated to use BroadcastHashJoin
14.05.2023 SQLDay 2023 28
Other
• Turn Adaptive Query Execution (default)
• Turn Coalesce Partitions on
(spark.sql.adaptive.coalescePartitions.enable
d)
• Turn Skew Join On
(spark.sql.adaptive.skewJoin.enabled)
• Turn Local Shuffle Reader on
(spark.sql.adaptive.localShuffleReader.enable
d)
• Broadcast Join threshold
(spark.sql.autoBroadcastJoinThreshold)
• Don’t prefer Sort MergeJoin
spark.sql.join.prefersortmergejoin -> false
14.05.2023 SQLDay 2023 29
• Turn off stats collection
• dataSkippingNumIndexedCols 0
• Optimize Zorder by merge keys between Bronze & Silver
• Turn Optimized Writes
• Restructure columns for skipping
• Optimize ZOrder by join keys or High Cardinality columns used
in WHERE
• Turn Optimized Writes
• Enable Databricks IO cache
• Consider using Photon
• Consider using Premium Storage
• Build preaggregate tables
14.05.2023 SQLDay 2023 30
Dziękuję!
14.05.2023 SQLDay 2023 34
STRATEGIC PARTNER

Weitere ähnliche Inhalte

Ähnlich wie SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning

COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 

Ähnlich wie SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning (20)

Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)Geek Sync I Polybase and Time Travel (Temporal Tables)
Geek Sync I Polybase and Time Travel (Temporal Tables)
 
External & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptxExternal & Managed Tables In Fabric Lakehouse.pptx
External & Managed Tables In Fabric Lakehouse.pptx
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013
 
Splitgraph: AHL talk
Splitgraph: AHL talkSplitgraph: AHL talk
Splitgraph: AHL talk
 
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL ServerGeek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 

Kürzlich hochgeladen

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning

  • 2. 14.05.2023 SQLDay 2023 2 Azure Databricks – performance tuning Adrian Chodkowski
  • 3. O mnie • Adrian Chodkowski • Microsoft Data Platform MVP • Data engineer & Architekt w Elitmind • Specjalizacja: Platforma danych Microsoft (Azure & On-premise) • Data Community • seequality.net • Adrian.Chodkowski@outlook.com • @Twitter: Adrian_SQL • LinkedIn: http://tinyurl.com/adrian-sql 14.05.2023 SQLDay 2023 3
  • 4. AGENDA • Databricks disk cache • AutoLoader • Static & Dynamic Partition pruning • File pruning • Z-Ordering • Additional tips & summary 14.05.2023 SQLDay 2023 4 Level: 300
  • 5. 14.05.2023 SQLDay 2023 5 Databricks disk cache aka Delta cache, DBIO
  • 6. Disk caching • Disk caching behavior is a proprietary Databricks feature previously called Delta Cache & DBIO, • It is different than Spark cache! • Works with parquet and delta, • Databricks uses disk caching to accelerate data reads by creating copies of remote Parquet data files in nodes’ local storage, • Successive reads of the same data are then performed locally, • After turning on it can be managed automatically or it can be forced by using CACHE SELECT syntax, • You benefit most by using cache-accelerated worker instance types 14.05.2023 SQLDay 2023 6
  • 7. Disk cache vs Spark cache • Prefer use of disk cache! 14.05.2023 SQLDay 2023 7
  • 8. Disk caching - configuration • spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]") • Be careful with autoscaling : • Spark.databricks.io.cache.maxDiskUsage – disk space per node reserved for cached data in bytes, • Spark.databricks.io.cache.maxMetaDataCache – disk space per node reserved for cached metadata in bytes, • Spark.databricks.io.cache.compression.enabled – should the cached data be stored in compressed format. 14.05.2023 SQLDay 2023 8
  • 9. 14.05.2023 SQLDay 2023 12 Ingestion With Autoloader
  • 10. Autoloader • Functionality based on structured streaming that can identify and load new files, • Works in two modes: • Directory Listing – internally lists structure of files and partitions. Can be done incrementally • File Notification - leverage Event Grid and storage queues to track new files appearance • Avoid overwrites the file, • Have a backfill strategy defined! • In many cases consider it as a default loading mechanism! • There is also COPY INTO! 14.05.2023 SQLDay 2023 15
  • 11. 14.05.2023 SQLDay 2023 16 Partitioning Dynamic Pruning & Design
  • 12. Partitioning • Division of a table to the hierarchy of folders, • Can be beneficial when reading data (partition discovery) and optimizing (OPTIMIZE WHERE partition key) • Databricks recommendation: • don’t use it for tables smaller than 1 TB, • Each partition >1GB, • When reading and filter by partition key then only specific partition can be read (rest will be skipped) • Can be created using PARTITIONED BY or Partition keywords, • Don’t use CREATE TABLE PARTITION BY AS SELECT – it will add hive overhead and can take ages to finish! 14.05.2023 SQLDay 2023 17 Year=2003 Month=01 Month=02 Month=03 Month=04 *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet *.parquet
  • 13. Dynamic Partition Pruning 14.05.2023 18 • Especially useful in star schema • Introduced in Spark 3.0 • Let’s assume query: SELECT d.c1, f.c2 FROM Fact AS f JOIN Dimension as d ON f.join_key = d.join_key WHERE d.c2 =10 • Since there is a filter on one table (d.c2 = 10), internally DPP can create a subquery: SELECT d.join_key FROM Dimension AS d WHERE d.c2=10; • and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1".
  • 14. Dynamic Partition Pruning 14.05.2023 19 • Especially useful in star schema • Introduced in Spark 3.0 • Let’s assume query: SELECT d.c1, f.c2 FROM Fact AS f JOIN Dimension as d ON f.join_key = d.join_key WHERE d.c2 =10 • Since there is a filter on one table (d.c2 = 10), internally DPP can create a subquery: SELECT d.join_key FROM Dimension AS d WHERE d.c2=10; • and then broadcast this sub-query result, so that we can use this result to prune partitions for "t1".
  • 15. 14.05.2023 SQLDay 2023 20 File pruning By using stats
  • 16. File pruning • In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible. • Delta Lake automatically collects metadata about data files (files can be skipped without data file access) • Statistics are taken for the first 32 columns (can be changed). 14.05.2023 SQLDay 2023 21 SELECT * FROM Dimension AS d WHERE d.category_id IN (1,2,5); File 1 File 2 File 3 File 4 File 5 File 6 File Column Min Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 File 6 Category_id 1 1
  • 17. File pruning • In addition to eliminating data at partition granularity, Delta Lake on Databricks dynamically skips unnecessary files when possible, • Delta Lake automatically collects metadata about data files (files can be skipped without data file access), • Statistics are taken for the first 32 columns (can be changed). 14.05.2023 SQLDay 2023 22 SELECT * FROM Dimension AS d WHERE d.category_id IN (1,2,5); File Column Min Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 File 6 Category_id 1 1 File 1 File 2 File 3 File 4 File 5 File 6
  • 18. 14.05.2023 SQLDay 2023 23 Zordering By using OPTIMIZE
  • 19. OPTIMIZE Z-Order • OPTIMIZE optimizes the layout of Delta Lake, • By Default OPTIMIZE set max file size to 1GB (1073741824), • You can control size by using spark.databricks.delta.optimize.maxFileSize • Can be used based on column or bin-packing optimization: • Bin-packing – idempotent technique that aims to produce evenly-balanced data files with respect to their size on disk (not number of rows), • Z-ordering not-idempotent technique that aims to produce evenly-balanced data files with respect to the number of rows (not size on disk) 14.05.2023 SQLDay 2023 24 OPTIMIZE events OPTIMIZE events WHERE >= '2017-01-01' OPTIMIZE events WHERE >= current_timestamp() - 1 day ZORDER BY (eventType)
  • 20. OPTIMIZE Z-Order 14.05.2023 SQLDay 2023 25 File 1(800MB) File 2 (2200MB) File 3 (900 MB) File 4 (300 MB) File 5 (100 MB) File 1 (1GB) File 2 (1GB) File 3 (1GB) File Column Mi n Max File 1 Category_id 1 2 File 2 Category_id 1 10 File 3 Category_id 6 10 File 4 Category_id 8 20 File 5 Category_id 4 100 OPTIMIZE table ZORDER BY (category_id) File Column Mi n Max File 1 Category_id 1 10 File 2 Category_id 11 20 File 3 Category_id 20 100
  • 21. Dynamic File pruning • Files can be skipped based on join not literal values • To make it happen following requirements must be met: • Inner table (probe) being joined is in Delta format • Joint type is INNER or LEFT-SEMI • Join Strategy is BROADCAST HASH JOIN • Number of files in the inner table is greater than value set in spark.databricks.optimizer.deltaTableFilesThreshold (default 1000) • Spark.databricks.optimizer.dynamicFilePruning should be True (default) • Size of inner table should be more than spark.databricks.optimizer.deltaTableSize (default 10GB) 14.05.2023 SQLDay 2023 26
  • 22. Other • Use newest Databricks Runtime, • For fast Update/Merge re-write the least amount of files: spark.databricks.delta.optimize.maxfilesize 16-128MB or turn on optimized writes. • Don’t use Python or Scala UDF if native function exists – transfer data between Python and Spark = serialization is needed = drastically slows down queries • Move numericals, keys, high cardinality query predicates to the left, long string that are not distinct enough for stats collection move to the left (only 32 columns has statistics) • OPTIMIZE benefits from Compute Optimized clusters (because of a lot of encoding and decoding parquet files) • Think about spot instances, • For some operations consider Databricks Standard. 14.05.2023 SQLDay 2023 27
  • 23. Adaptive Query Processing • Game changer in Spark 3.x • For example: Initially SortMergeJoin chosen but once the ingest stages completes plan will be updated to use BroadcastHashJoin 14.05.2023 SQLDay 2023 28
  • 24. Other • Turn Adaptive Query Execution (default) • Turn Coalesce Partitions on (spark.sql.adaptive.coalescePartitions.enable d) • Turn Skew Join On (spark.sql.adaptive.skewJoin.enabled) • Turn Local Shuffle Reader on (spark.sql.adaptive.localShuffleReader.enable d) • Broadcast Join threshold (spark.sql.autoBroadcastJoinThreshold) • Don’t prefer Sort MergeJoin spark.sql.join.prefersortmergejoin -> false 14.05.2023 SQLDay 2023 29 • Turn off stats collection • dataSkippingNumIndexedCols 0 • Optimize Zorder by merge keys between Bronze & Silver • Turn Optimized Writes • Restructure columns for skipping • Optimize ZOrder by join keys or High Cardinality columns used in WHERE • Turn Optimized Writes • Enable Databricks IO cache • Consider using Photon • Consider using Premium Storage • Build preaggregate tables
  • 25. 14.05.2023 SQLDay 2023 30 Dziękuję!
  • 26. 14.05.2023 SQLDay 2023 34 STRATEGIC PARTNER