SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Best Practices for building Robust Data Platform
with Apache Spark & Delta
Vini Jaiswal
Spark+AI Summit - June 2020
https://www.linkedin.com/in/vinijaiswal/
▪ Data Strategy
Optimizing the cost to drive Business value
▪ Performance and tuning with Delta Lake & Apache Spark
▪ Governance and security controls
Bringing it all together - A reference architecture
Agenda
Data Strategy
Data Challenges
Data Warehouse limits the potential of
intelligence
Data Volume is growing rapidly
More Variety of data -> Different
applications
Need for faster processing and scalability
Data silos limits innovation
Promise of the Data Lake
1. Collect
Everything
2. Store it all in
the Data Lake
🔥
🔥🔥
3. Data
Science &
Machine
Learning
🔥
🔥
Usual Data Lake
Garbage
In
Garbage
Out
Garbage Stored
Ideal Data Lake with
Ideal data lakes with
No atomicity
No quality enforcement
No consistency /
isolation
✗ Reliability - High Quality Data
● Schema Enforcement
● ACID Transactions
● Time Travel
● Open Standards, Open Source
● Powered by
● Unifies Streaming / Batch
Usual Data Lake
References: https://youtu.be/qtCxNSmTejk
Getting the Data Right
Audience Segmentation
CSV,
JSON, TXT…
Data
Types
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
Table
Categorization
Align with
Business
Outcomes
Is my data use
case worthy?
Is my data ready
for Analytics / ML?
Optimizing the Cost to Drive Business Value
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Workload Type AWS
Type
Azure
Type
Recommended Use Case
Memory Optimized r5 Dsv2 Memory-intensive applications
Use Case: ML workload with data caching
Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data
Science Applications
Use Case: ETL with full file scans and no data reuse
Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO
Use Case: Analytics - Storage Optimized i3 class with
Delta IO Cache
Selection of Instance Types
Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
Best Practices for Cluster Sizing & Selection
1. Selection of Instance Types
a. Workload type
b. Use cases
2. Selection of node size
a. Observe Metrics
b. Tweak workloads
Selection of node size
Rule of thumb
1. Fewer big instances > more small instances
a. (larger heap = larger GC)
b. Multiple executors per machine
2. Size based on the number of tasks initially, tweak later
a. Run the job with a small cluster to get idea of # of tasks
b. Observe Cluster metrics for CPU, memory and network utilization
Observe Spark UI & tweak the workloads
Fully cached with room to spare?
> decrease instances
Almost completely cached?
> Increase cluster size
Not even close to cached?
> Consider instance with SSD
instead of EBS or use R class
Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
Observe Ganglia Metrics & tweak the workloads
○ Are we compute bound?
○ Are we network bound?
○ Are we spilling a ton?
Performance and Tuning with
Delta Lake & Apache Spark
Performance Symptoms
Look for these 4 symptoms
Shuffle
Spill
Skew
Small Files
Can I make
Spark application run faster?
Use broadcast join
Review Join order
I found Shuffle, now what?
Query completion time
28 Minutes
Sort Merge Join
rows
output:
2,509,189,31
3
Before
1.8 Minutes
rows
output:
1023
After
Reference: https://spark.apache.org/docs/latest/sql-
performance-tuning.html#broadcast-hint-for-sql-queries
● Increase Shuffle Partitions
(for this example: 48)
● Reduce the number of cores
spark.executor.cores < total
cores per worker
● Larger cluster - faster disk
SSDs
Shuffle Partitions = 16
I found Spill, now what?
set spark.sql.shuffle.partitions=48
More spill you can remove, larger
the impact!
Symptom
● Ganglia CPU usage becomes low for long time after
initial high usage
● Task duration -> Significant difference in max than
75% and 25% values
● Input Size/Records
What to do?
● Use broadcast join
● Use Skew Join
● Filter out large keys/salt keys and set
up multiple reduce steps
● Explicitly repartition the data on a
different field
I found Skew, now what?
Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
Adaptive Query Execution
Reduced manual effort of tuning spark.sql.shuffle.partitions
By default it is turned off, Set spark.sql.adaptive.enabled=true
Dynamically change sort-merge join into broadcast-hash join
▪ Dynamically optimizing skew joins
*Available in DBR 7.x/Spark 3.0
Upstream
● Fix the upstream application building tons of files
● Use a seperate tool to compact them before
processing with Spark
Changes in Spark Application
● Write your own compaction job
● Delta solves this problem!
I found a lot of small files, now what?
Achieving Performance with
Compaction
● Improves the Read
Performance
● Solves Small Files problem
Reference: https://docs.delta.io/latest/best-practices.html#compact-files
● Optimizes Apache Spark partition
● Maximizes the throughput of data being
written
● Compacts files for partitions
Auto Optimize
Auto Optimize consists of two complementary features:
Optimized Writes and Auto Compaction.
Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
Reference:https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering
Z-order Sorting
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Z-Ordering
A technique to colocate related information in the same set of files
● Safely skips more data
● Faster queries
Governance & Security Controls
Data Governance with Delta Lake
Create retention policy to age out and
erase raw data that may contain
personal information
High Level Aggregates
(e.g. # of users that took an action)
Historical Data Repository
● Easy to navigate
● Pseudonymization
Data Lake
Satisfy Compliance requests using
UPDATE / DELETE commands
Create tables that don't contain
personal data
Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
Audit & Monitoring
▪ Use cluster tags for chargeback
▪ Audit logs
▪ Monitor Databricks DBU usage
▪ Delta Transactional Logs
Governance - The Who/What/Where
Perform standard extraction,
transformation and loading
tasks (ETL) and apply best
coding practices including
source control, unit test, and
automation
drives product innovation
with state-of-the-art
Machine Learning models
applied to big data
Improves business process
through providing standardized
and ad-hoc business analysis.
Acts as intermediary between
Analytics and Business team
Performs automated jobs based
on Data Engineering configs.
Data Scientist Data Engineer Data/Business
Analyst Automated Jobs
Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Business Unit
Serving
Operations
& Security
Data Science & MLIngest
OrchestrationCI/CD
Bringing it together - A reference pipeline
APIs
Jobs
Models
Notebooks
Dashboards
ML Runtime
Delta Pipelines
BLOB
DB/DW
Streaming
Massively scalable data cleansing & transformation
ETL/Data
Processing
Bronze
Silver
Gold
Execution
Databricks Runtime
Reliability & Performance
Optimized Spark
Clusters
Storage
Data Strategy
Cost Optimization &
Performance Tuning
Business Value
Security
THANK YOU!!!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Weitere ähnliche Inhalte

Was ist angesagt?

Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 

Was ist angesagt? (20)

Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Etude sur le Big Data
Etude sur le Big DataEtude sur le Big Data
Etude sur le Big Data
 
BigData_TP5 : Neo4J
BigData_TP5 : Neo4JBigData_TP5 : Neo4J
BigData_TP5 : Neo4J
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
NiFi 시작하기
NiFi 시작하기NiFi 시작하기
NiFi 시작하기
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
BigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-ReduceBigData_TP1: Initiation à Hadoop et Map-Reduce
BigData_TP1: Initiation à Hadoop et Map-Reduce
 
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSISMicrosoft Data Integration Pipelines: Azure Data Factory and SSIS
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Data Science : Méthodologie, Outillage et Application - MS Cloud Summit Paris...
Data Science : Méthodologie, Outillage et Application - MS Cloud Summit Paris...Data Science : Méthodologie, Outillage et Application - MS Cloud Summit Paris...
Data Science : Méthodologie, Outillage et Application - MS Cloud Summit Paris...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 

Ähnlich wie Best Practices for Building Robust Data Platform with Apache Spark and Delta

Sql server-performance-hafi
Sql server-performance-hafiSql server-performance-hafi
Sql server-performance-hafi
zabi-babi
 
Ssis Best Practices Israel Bi U Ser Group Itay Braun
Ssis Best Practices   Israel Bi U Ser Group   Itay BraunSsis Best Practices   Israel Bi U Ser Group   Itay Braun
Ssis Best Practices Israel Bi U Ser Group Itay Braun
sqlserver.co.il
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
Piyush Kumar
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Tim Vaillancourt
 

Ähnlich wie Best Practices for Building Robust Data Platform with Apache Spark and Delta (20)

Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
11g R2
11g R211g R2
11g R2
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Sql server-performance-hafi
Sql server-performance-hafiSql server-performance-hafi
Sql server-performance-hafi
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Ssis Best Practices Israel Bi U Ser Group Itay Braun
Ssis Best Practices   Israel Bi U Ser Group   Itay BraunSsis Best Practices   Israel Bi U Ser Group   Itay Braun
Ssis Best Practices Israel Bi U Ser Group Itay Braun
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Kürzlich hochgeladen (20)

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

Best Practices for Building Robust Data Platform with Apache Spark and Delta

  • 1. Best Practices for building Robust Data Platform with Apache Spark & Delta Vini Jaiswal Spark+AI Summit - June 2020 https://www.linkedin.com/in/vinijaiswal/
  • 2. ▪ Data Strategy Optimizing the cost to drive Business value ▪ Performance and tuning with Delta Lake & Apache Spark ▪ Governance and security controls Bringing it all together - A reference architecture Agenda
  • 4. Data Challenges Data Warehouse limits the potential of intelligence Data Volume is growing rapidly More Variety of data -> Different applications Need for faster processing and scalability Data silos limits innovation Promise of the Data Lake 1. Collect Everything 2. Store it all in the Data Lake 🔥 🔥🔥 3. Data Science & Machine Learning 🔥 🔥
  • 7. Ideal data lakes with No atomicity No quality enforcement No consistency / isolation ✗ Reliability - High Quality Data ● Schema Enforcement ● ACID Transactions ● Time Travel ● Open Standards, Open Source ● Powered by ● Unifies Streaming / Batch Usual Data Lake References: https://youtu.be/qtCxNSmTejk
  • 8. Getting the Data Right Audience Segmentation CSV, JSON, TXT… Data Types Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold Table Categorization Align with Business Outcomes Is my data use case worthy? Is my data ready for Analytics / ML?
  • 9. Optimizing the Cost to Drive Business Value
  • 10. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 11. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 12. Workload Type AWS Type Azure Type Recommended Use Case Memory Optimized r5 Dsv2 Memory-intensive applications Use Case: ML workload with data caching Compute Optimized c5 Fsv2 Structured Streaming, Distributed Analytics, Data Science Applications Use Case: ETL with full file scans and no data reuse Storage Optimized i3 Lsv2 Use cases that require higher disk throughput and IO Use Case: Analytics - Storage Optimized i3 class with Delta IO Cache Selection of Instance Types Reference for Azure Type:: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes Reference for AWS Type::https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html
  • 13. Best Practices for Cluster Sizing & Selection 1. Selection of Instance Types a. Workload type b. Use cases 2. Selection of node size a. Observe Metrics b. Tweak workloads
  • 14. Selection of node size Rule of thumb 1. Fewer big instances > more small instances a. (larger heap = larger GC) b. Multiple executors per machine 2. Size based on the number of tasks initially, tweak later a. Run the job with a small cluster to get idea of # of tasks b. Observe Cluster metrics for CPU, memory and network utilization
  • 15. Observe Spark UI & tweak the workloads Fully cached with room to spare? > decrease instances Almost completely cached? > Increase cluster size Not even close to cached? > Consider instance with SSD instead of EBS or use R class Reference: https://docs.databricks.com/delta/optimizations/delta-cache.html#delta-and-rdd-cache-comparison
  • 16. Observe Ganglia Metrics & tweak the workloads ○ Are we compute bound? ○ Are we network bound? ○ Are we spilling a ton?
  • 17. Performance and Tuning with Delta Lake & Apache Spark
  • 18. Performance Symptoms Look for these 4 symptoms Shuffle Spill Skew Small Files Can I make Spark application run faster?
  • 19. Use broadcast join Review Join order I found Shuffle, now what? Query completion time 28 Minutes Sort Merge Join rows output: 2,509,189,31 3 Before 1.8 Minutes rows output: 1023 After Reference: https://spark.apache.org/docs/latest/sql- performance-tuning.html#broadcast-hint-for-sql-queries
  • 20. ● Increase Shuffle Partitions (for this example: 48) ● Reduce the number of cores spark.executor.cores < total cores per worker ● Larger cluster - faster disk SSDs Shuffle Partitions = 16 I found Spill, now what? set spark.sql.shuffle.partitions=48 More spill you can remove, larger the impact!
  • 21. Symptom ● Ganglia CPU usage becomes low for long time after initial high usage ● Task duration -> Significant difference in max than 75% and 25% values ● Input Size/Records What to do? ● Use broadcast join ● Use Skew Join ● Filter out large keys/salt keys and set up multiple reduce steps ● Explicitly repartition the data on a different field I found Skew, now what? Reference: https://docs.databricks.com/delta/join-performance/skew-join.html
  • 22. Adaptive Query Execution Reduced manual effort of tuning spark.sql.shuffle.partitions By default it is turned off, Set spark.sql.adaptive.enabled=true Dynamically change sort-merge join into broadcast-hash join ▪ Dynamically optimizing skew joins *Available in DBR 7.x/Spark 3.0
  • 23. Upstream ● Fix the upstream application building tons of files ● Use a seperate tool to compact them before processing with Spark Changes in Spark Application ● Write your own compaction job ● Delta solves this problem! I found a lot of small files, now what?
  • 25. Compaction ● Improves the Read Performance ● Solves Small Files problem Reference: https://docs.delta.io/latest/best-practices.html#compact-files
  • 26. ● Optimizes Apache Spark partition ● Maximizes the throughput of data being written ● Compacts files for partitions Auto Optimize Auto Optimize consists of two complementary features: Optimized Writes and Auto Compaction. Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-optimize
  • 27. Reference:https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering Z-order Sorting 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Z-Ordering A technique to colocate related information in the same set of files ● Safely skips more data ● Faster queries
  • 29. Data Governance with Delta Lake Create retention policy to age out and erase raw data that may contain personal information High Level Aggregates (e.g. # of users that took an action) Historical Data Repository ● Easy to navigate ● Pseudonymization Data Lake Satisfy Compliance requests using UPDATE / DELETE commands Create tables that don't contain personal data Reference: https://www.youtube.com/watch?v=tCPslvUjG1w
  • 30. Audit & Monitoring ▪ Use cluster tags for chargeback ▪ Audit logs ▪ Monitor Databricks DBU usage ▪ Delta Transactional Logs
  • 31. Governance - The Who/What/Where Perform standard extraction, transformation and loading tasks (ETL) and apply best coding practices including source control, unit test, and automation drives product innovation with state-of-the-art Machine Learning models applied to big data Improves business process through providing standardized and ad-hoc business analysis. Acts as intermediary between Analytics and Business team Performs automated jobs based on Data Engineering configs. Data Scientist Data Engineer Data/Business Analyst Automated Jobs Many players in the Org. Managing Access, roles and responsibilities, as well as managing usage is a must.
  • 32. Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage
  • 33. Business Unit Serving Operations & Security Data Science & MLIngest OrchestrationCI/CD Bringing it together - A reference pipeline APIs Jobs Models Notebooks Dashboards ML Runtime Delta Pipelines BLOB DB/DW Streaming Massively scalable data cleansing & transformation ETL/Data Processing Bronze Silver Gold Execution Databricks Runtime Reliability & Performance Optimized Spark Clusters Storage Data Strategy Cost Optimization & Performance Tuning Business Value Security
  • 35. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.