SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Migration from Redshift to
Spark
Sky Yin
Data scientist @ Stitch Fix
What’s Stitch Fix?
• An online personal clothes styling service
since 2011
A loop of recommendation
• We recommend clothes through a
combination of human stylists and
algorithms
Inventory
Clients
Humans
Algorithms
What do I do?
• As part of a ~80 people data team, I work on
inventory data infrastructure and analysis
• The biggest table I manage adds ~500-800M
rows per day
Inventory
Data infrastructure: computation
• Data science jobs are mostly done in Python
or R, and later deployed through Docker
Data infrastructure: computation
• Data science jobs are mostly done in Python
or R, and later deployed through Docker
– Gap between data scientist daily work flow and
production deployment
– Not every job requires big data
Data infrastructure: computation
• Data science jobs are mostly done in Python
or R, and later deployed through Docker
• As business grows, Redshift was brought in
to help extract relevant data
Redshift: the good parts
• Can be very fast
• Familiar interface: SQL
• Managed by AWS
• Can scale up and down on demand
• Cost effective*
Redshift: the troubles
• Congestion: too many people running
queries in the morning (~80 people data
team)
Redshift: the troubles
• Congestion: too many people running
queries in the morning
– Scale up and down on demand?
Redshift: the troubles
• Congestion: too many people running
queries in the morning
– Scale up and down on demand?
– Scalable =/= easy to scale
Redshift: the troubles
• Congestion: too many people running
queries in the morning
• Production pipelines and exploratory queries
share the same cluster
Redshift: the troubles
• Congestion: too many people running
queries in the morning
• Production pipelines and ad-hoc queries
share the same cluster
– Yet another cluster to isolate dev from prod?
Redshift: the troubles
• Congestion: too many people running
queries in the morning
• Production pipelines and dev queries share
the same cluster
– Yet another cluster to isolate dev from prod?
• Sync between two clusters
• Will hit scalability problem again (even on prod)
Redshift: the troubles
• Congestion: too many people running queries in
the morning
• Production pipelines and ad-hoc queries share
the same cluster
• There’s no isolation between computation and
storage
– Can’t scale computation without paying for storage
– Can’t scale storage without paying for unneeded CPUs
Data Infrastructure: storage
• Single source of truth: S3, not Redshift
• Compatible interface: Hive metastore
• With the foundation in storage, the transition
from Redshift to Spark is much easier
S3
Journey to Spark
• Netflix Genie as a “job server” on top of a
group of Spark clusters in EMR
Internal Python/R
libs to return DF
Job execution
service: Flotilla
Dev
Prod
On-demand
Manual job
submission
Journey to Spark: SQL
• Heavy on SparkSQL with a mix of PySpark
• Lower migration cost and gentle learning
curve
• Data storage layout is transparent to users
– Every write we create a new folder with timestamp
as batch_id, to avoid consistency problem in S3
Journey to Spark: SQL
• Difference in functions and syntax
– Redshift
– SparkSQL
Journey to Spark: SQL
• Difference in functions and syntax
• Partition
– In Redshift, it’s defined by dist_key in schema
– In Spark, the term is a bit confusing
• There’s another kind of partition on storage level
• In our case, it’s defined in Hive metastore
– S3://bucket/table/country=US/year=2012/
Journey to Spark: SQL
• Functions and syntax
• Partition
• SparkSQL
– DataFrame API
Journey to Spark: SQL
• Functions and syntax
• Partition
• Sorting
– Redshift defines sort_key in schema
– Spark can specify that in runtime
OR != ORDER BY
Journey to Spark: Performance
• There’s a price in performance if we naively
treat Spark as yet another SQL data
warehouse
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
• FAQ #2: “Why is my Spark job so slow?”
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
• FAQ #2: “Why is my Spark job so slow?”
– Aggressive caching
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
• FAQ #2: “Why is my Spark job so slow?”
– Aggressive cache
– Use partition and filtering
– Detect data skew
– Small file problem in S3
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? Hmm, the log
says some executors were killed by
YARN???”
Journey to Spark: Performance
• FAQ #1: “Why did my job fail?”
– Executor memory and GC tuning
• Adding more memory doesn’t always work
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? “
– Executor memory and GC tuning
– Data skew in shuffling
• Repartition and salting
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? “
– Executor memory and GC tuning
– Data skew in shuffling
• Repartition and salting
• spark.sql.shuffle.partitions
– On big data, 200 is too small: 2G limit on shuffle partition
– On small data, 200 is too big: small file problem in S3
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? “
– Executor memory and GC tuning
– Data skew in shuffling
– Broadcasting a large table unfortunately
• Missing table statistics in Hive metastore
Journey to Spark: Productivity
• SQL, as a language, doesn’t scale well
– Nested structure in sub-queries
– Lack of tools to manage complexity
Journey to Spark: Productivity
• SQL, as a language, doesn’t scale well
• CTE: one trick pony
Journey to Spark: Productivity
• SQL, as a language, doesn’t scale well
• CTE: one trick pony
=> Spark temp tables
Journey to Spark: Productivity
• Inspiration from data flow languages
– Spark DataFrames API
– dplyr in R
– Pandas pipe
• Need for a higher level abstraction
Journey to Spark: Productivity
• Dataset.transform() to chain functions
Thank You.
Contact me @piggybox or syin@stitchfix.com

Weitere ähnliche Inhalte

Was ist angesagt?

Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
joeharris76
 

Was ist angesagt? (20)

FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from Oracle
 
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
Applications in the Cloud
Applications in the CloudApplications in the Cloud
Applications in the Cloud
 
Scaling Pinterest
Scaling PinterestScaling Pinterest
Scaling Pinterest
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQL
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
What's New in Amazon Aurora
What's New in Amazon AuroraWhat's New in Amazon Aurora
What's New in Amazon Aurora
 

Ähnlich wie Migration from Redshift to Spark

2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
asya999
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 

Ähnlich wie Migration from Redshift to Spark (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and M
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Migration from Redshift to Spark

  • 1. Migration from Redshift to Spark Sky Yin Data scientist @ Stitch Fix
  • 2. What’s Stitch Fix? • An online personal clothes styling service since 2011
  • 3. A loop of recommendation • We recommend clothes through a combination of human stylists and algorithms Inventory Clients Humans Algorithms
  • 4. What do I do? • As part of a ~80 people data team, I work on inventory data infrastructure and analysis • The biggest table I manage adds ~500-800M rows per day Inventory
  • 5. Data infrastructure: computation • Data science jobs are mostly done in Python or R, and later deployed through Docker
  • 6. Data infrastructure: computation • Data science jobs are mostly done in Python or R, and later deployed through Docker – Gap between data scientist daily work flow and production deployment – Not every job requires big data
  • 7. Data infrastructure: computation • Data science jobs are mostly done in Python or R, and later deployed through Docker • As business grows, Redshift was brought in to help extract relevant data
  • 8. Redshift: the good parts • Can be very fast • Familiar interface: SQL • Managed by AWS • Can scale up and down on demand • Cost effective*
  • 9. Redshift: the troubles • Congestion: too many people running queries in the morning (~80 people data team)
  • 10. Redshift: the troubles • Congestion: too many people running queries in the morning – Scale up and down on demand?
  • 11. Redshift: the troubles • Congestion: too many people running queries in the morning – Scale up and down on demand? – Scalable =/= easy to scale
  • 12. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and exploratory queries share the same cluster
  • 13. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and ad-hoc queries share the same cluster – Yet another cluster to isolate dev from prod?
  • 14. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and dev queries share the same cluster – Yet another cluster to isolate dev from prod? • Sync between two clusters • Will hit scalability problem again (even on prod)
  • 15. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and ad-hoc queries share the same cluster • There’s no isolation between computation and storage – Can’t scale computation without paying for storage – Can’t scale storage without paying for unneeded CPUs
  • 16. Data Infrastructure: storage • Single source of truth: S3, not Redshift • Compatible interface: Hive metastore • With the foundation in storage, the transition from Redshift to Spark is much easier S3
  • 17. Journey to Spark • Netflix Genie as a “job server” on top of a group of Spark clusters in EMR Internal Python/R libs to return DF Job execution service: Flotilla Dev Prod On-demand Manual job submission
  • 18. Journey to Spark: SQL • Heavy on SparkSQL with a mix of PySpark • Lower migration cost and gentle learning curve • Data storage layout is transparent to users – Every write we create a new folder with timestamp as batch_id, to avoid consistency problem in S3
  • 19. Journey to Spark: SQL • Difference in functions and syntax – Redshift – SparkSQL
  • 20. Journey to Spark: SQL • Difference in functions and syntax • Partition – In Redshift, it’s defined by dist_key in schema – In Spark, the term is a bit confusing • There’s another kind of partition on storage level • In our case, it’s defined in Hive metastore – S3://bucket/table/country=US/year=2012/
  • 21. Journey to Spark: SQL • Functions and syntax • Partition • SparkSQL – DataFrame API
  • 22. Journey to Spark: SQL • Functions and syntax • Partition • Sorting – Redshift defines sort_key in schema – Spark can specify that in runtime OR != ORDER BY
  • 23. Journey to Spark: Performance • There’s a price in performance if we naively treat Spark as yet another SQL data warehouse
  • 24. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not
  • 25. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not • FAQ #2: “Why is my Spark job so slow?”
  • 26. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not • FAQ #2: “Why is my Spark job so slow?” – Aggressive caching
  • 27. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not • FAQ #2: “Why is my Spark job so slow?” – Aggressive cache – Use partition and filtering – Detect data skew – Small file problem in S3
  • 28. Journey to Spark: Performance • FAQ #1: “Why did my job fail? Hmm, the log says some executors were killed by YARN???”
  • 29. Journey to Spark: Performance • FAQ #1: “Why did my job fail?” – Executor memory and GC tuning • Adding more memory doesn’t always work
  • 30. Journey to Spark: Performance • FAQ #1: “Why did my job fail? “ – Executor memory and GC tuning – Data skew in shuffling • Repartition and salting
  • 31. Journey to Spark: Performance • FAQ #1: “Why did my job fail? “ – Executor memory and GC tuning – Data skew in shuffling • Repartition and salting • spark.sql.shuffle.partitions – On big data, 200 is too small: 2G limit on shuffle partition – On small data, 200 is too big: small file problem in S3
  • 32. Journey to Spark: Performance • FAQ #1: “Why did my job fail? “ – Executor memory and GC tuning – Data skew in shuffling – Broadcasting a large table unfortunately • Missing table statistics in Hive metastore
  • 33. Journey to Spark: Productivity • SQL, as a language, doesn’t scale well – Nested structure in sub-queries – Lack of tools to manage complexity
  • 34. Journey to Spark: Productivity • SQL, as a language, doesn’t scale well • CTE: one trick pony
  • 35. Journey to Spark: Productivity • SQL, as a language, doesn’t scale well • CTE: one trick pony => Spark temp tables
  • 36. Journey to Spark: Productivity • Inspiration from data flow languages – Spark DataFrames API – dplyr in R – Pandas pipe • Need for a higher level abstraction
  • 37. Journey to Spark: Productivity • Dataset.transform() to chain functions
  • 38. Thank You. Contact me @piggybox or syin@stitchfix.com