SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Ashwin Shankar
Nezih Yigitbasi
Productionizing
Spark on Yarn
for ETL
Scale
81+ million
members
Global 1000+ devices
supported
125 million
hours / day
Netflix Key Business Metrics
40 PB DW Read 3PB Write 300TB700B Events
Netflix Key Platform Metrics
Outline
● Big Data Platform Architecture
● Technical Challenges
● ETL
Big Data Platform Architecture
Cloud
apps
Kafka Ursula
Cassandra
Aegisthus
Dimension Data
Event Data
~ 1 min
Daily
S3
SS
Tables
Data Pipeline
Storage
Compute
Service
Tools
Big Data APIBig Data Portal
S3 Parquet
Transport VisualizationQuality PigWorkflowVis Job/ClusterVis
Interface
Execution Metadata
Notebooks
• 3000 EC2 nodes on two clusters (d2.4xlarge)
• Multiple Spark versions
• Share the same infrastructure with MapReduce jobs
S M
S M
S M
M
…
16 vcores
120 GB
M
S
MapReduceS
S M
S M
S M
MS
Spark
S M
S M
S M
MS
S M
S M
S M
MS
Spark on YARN at Netflix
$ spark-shell --ver 1.6 …
s3://…/spark-1.6.tar.gz
s3://…/spark-2.0.tar.gz
s3://…/spark-2.0-unstable.tar.gz
s3://…/1.6/spark-defaults.conf
…
s3://…/prod/yarn-site.xml
s3://../prod/core-site.xml
ConfigurationApplication
Multi-version Support
Technical Challenges
YARN
Resource
Manager
Node
Manager
Spark
AM
RDD
Custom Coalescer Support [SPARK-14042]
• coalesce() can only “merge” using the given number of partitions
– how to merge by size?
• CombineFileInputFormat with Hive
• Support custom partition coalescing strategies
• Parent RDD partitions are listed sequentially
• Slow for tables with lots of partitions
• Parallelize listing of parent RDD partitions
UnionRDD Parallel Listing [SPARK-9926]
YARN
Resource
Manager
S3Filesystem
RDD
Node
Manager
Spark
AM
• Unnecessary getFileStatus() call
• SPARK-9926 and HADOOP-12810 yield faster startup
• ~20x speedup in input split calculation
Optimize S3 Listing Performance [HADOOP-12810]
• Each task writes output to a temp directory
• Rename first successful task’s temp directory to final destination
• Problems with S3
• S3 rename => copy + delete
• S3 is eventually consistent
Hadoop Output Committer
• Each task writes output to local disk
• Copy first successful task’s output to S3
• Advantages
• avoid redundant S3 copy
• avoid eventual consistency
S3 Output Committer
YARN
Resource
Manager
Dynamic
Allocation
S3Filesystem
RDD
Node
Manager
Spark
AM
• Broadcast joins/variables
• Replicas can be removed with dynamic allocation
Poor Broadcast Read Performance [SPARK-13328]
...
16/02/13 01:02:27WARN BlockManager:
Failed to fetch remote block broadcast_18_piece0 (failed attempt 70)
...
16/02/13 01:02:27 INFOTorrentBroadcast:
Reading broadcast variable 18 took 1051049 ms
• Refresh replica locations from the driver on multiple failures
• Cancel & resend pending container requests
• if the locality preference is no longer needed
• if no locality preference is set
• No locality information with S3
• Do not cancel requests without locality preference
Incorrect Locality Optimization [SPARK-13779]
YARN
Resource
Manager
Parquet R/W
Dynamic
Allocation
S3Filesystem
RDD
Node
Manager
Spark
AM
A B C D
a1 b1 c1 d1
… … … …
aN bN cN dN
A B C D
dictionary
from “Analytic Data Storage in Hadoop”, Ryan Blue
Parquet Dictionary Filtering [PARQUET-384*]
0
10
20
30
40
50
60
70
80
DF disabled DF enabled
64MB split
DF enabled
1G split
DF disabled
DF enabled
64MB split
DF enabled
1G split
~8x ~18x
Parquet Dictionary Filtering [PARQUET-384*]
Avg.CompletionTime[m]
Property Value Description
spark.sql.hive.convertMetastoreParquet true enable native Parquet read path
parquet.filter.statistics.enabled true enable stats filtering
parquet.filter.dictionary.enabled true enable dictionary filtering
spark.sql.parquet.filterPushdown true enable Parquet filter pushdown optimization
spark.sql.parquet.mergeSchema false disable schema merging
spark.sql.hive.convertMetastoreParquet.mergeSchema false use Hive SerDe instead of built-in Parquet support
How to Enable Dictionary Filtering?
Efficient Dynamic Partition Inserts [SPARK-15420*]
• Parquet buffers row group data for each file during writes
• Spark already sorts before writes, but has some limitations
• Detect if the data is already sorted
• Expose the ability to repartition data before write
YARN
Resource
Manager
Parquet R/W
Dynamic
Allocation
Spark
HistoryServer
S3Filesystem
RDD
Node
Manager
Spark
AM
Spark History Server – Where is My Job?
• A large application can prevent new applications from showing up
• not uncommon to see event logs of GBs
• SPARK-13988 makes the processing multi-threaded
• GC tuning helps further
• move from CMS to G1 GC
• allocate more space to young generation
Spark History Server – Where is My Job?
Extract
Transform
Load
group
foreach
join
foreach + filter + store
join
foreach foreach
join
join
join
load + filter load + filter load + filter load + filter load + filter load + filter
Pig vs. Spark: Job #1
0
100
200
300
400
Pig Spark PySpark
~2.4x ~2x
Avg.CompletionTime[s]
Pig vs. Spark (Scala) vs. PySpark
load + filter
cogroup
join
order by + store
foreach
load + filter load + filter
join
foreach
foreach
foreach foreach foreach
Pig vs. Spark: Job #2
0
100
200
300
400
500
Pig Spark PySpark
~3.2x ~1.6x
Avg.CompletionTime[s]
Pig vs. Spark (Scala) vs. PySpark
Prototype DeployBuild Run
S3
Production Workflow
• A rapid innovation platform for targeting algorithms
• 5 hours (vs. 10s of hours) to compute similarity for all Netflix
profiles for 30-day window of new arrival titles
• 10 minutes to score 4M profiles for 14-day window of new
arrival titles
Production Spark Application #1: Yogen
• Personalized ordering of rows of titles
• Enrich page/row/title features with play history
• 14 stages, ~10Ks of tasks, severalTBs
Production Spark Application #2: ARO
What’s Next?
• Improved Parquet support
• Better visibility
• Explore new use cases
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringDatabricks
 
Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Databricks
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...Databricks
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkEren Avşaroğulları
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Eva Tse
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
 

Was ist angesagt? (20)

Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data EngineeringSputnik: Airbnb’s Apache Spark Framework for Data Engineering
Sputnik: Airbnb’s Apache Spark Framework for Data Engineering
 
Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!Scale-Out Using Spark in Serverless Herd Mode!
Scale-Out Using Spark in Serverless Herd Mode!
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
TuneIn: How to Get Your Hadoop/Spark Jobs Tuned While You’re Sleeping with Ma...
 
Spark Summit EU talk by William Benton
Spark Summit EU talk by William BentonSpark Summit EU talk by William Benton
Spark Summit EU talk by William Benton
 
Cassandra and Spark SQL
Cassandra and Spark SQLCassandra and Spark SQL
Cassandra and Spark SQL
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014Next Generation Big Data Platform at Netflix 2014
Next Generation Big Data Platform at Netflix 2014
 
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...
 

Andere mochten auch

Distributed tracing - get a grasp on your production
Distributed tracing - get a grasp on your productionDistributed tracing - get a grasp on your production
Distributed tracing - get a grasp on your productionnklmish
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbJen Aman
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleJen Aman
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 

Andere mochten auch (6)

Distributed tracing - get a grasp on your production
Distributed tracing - get a grasp on your productionDistributed tracing - get a grasp on your production
Distributed tracing - get a grasp on your production
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Airstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At AirbnbAirstream: Spark Streaming At Airbnb
Airstream: Spark Streaming At Airbnb
 
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte ScaleNetflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

Ähnlich wie Spark on Yarn for ETL at Netflix

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftLi Gao
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014Philippe Fierens
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 

Ähnlich wie Spark on Yarn for ETL at Netflix (20)

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
What we unlearned_and_learned_by_moving_from_m9000_to_ssc_ukoug2014
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 

Kürzlich hochgeladen

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Kürzlich hochgeladen (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Spark on Yarn for ETL at Netflix

Hinweis der Redaktion

  1. Good afternoon everybody. I'm Ashwin and this is Nezih. We work for the big data platform team at Netflix. Today we are going to talk about our experience productionizing spark for running ETL workloads on YARN
  2. I'm sure all of you have seen this netflix page before. Do you also know that spark contributes pipeline that does personalized ordering of the rows in this page. We will talk about this application in the second part of the presentation. Not just that, spark is also used to train some of machine learning models that powers our recommendations alogorithms.
  3. Netflix is a data driven company collect a lot of data from user interactions, platforms, services One example where we use this data is A/B testing. Lets say we have UI change actually improves the experience The biggest challenge that we face…
  4. Just to give an idea of at what scale we operate, I'm going to share with you some numbers.
  5. this would generate a lot of data. But how much
  6. Today, we'll walk you through three sections
  7. We need to answer two questions. First ques how do we get data in platform? It looks like this
  8. two pipelines. event data, so this is all the data from devices and microservices running out in the cloud. This goes into a kakfa clusters,ursula. This pipeline runs under a min. At the bottom, dimension data pipeline. live data in casssandra. Everyday this data is backed up and a tool called Aegisthus processes these backups and writes into our data warehouse. In the end, we have all of event and dimension data on s3, ready for processing.
  9. At this point, our data is on s3 which is our source of truth, parquet .. At the very top in our interface. hundreds of internal users using platform. bigdataportal, one stop shop to access our big data services. From this portal, users pick a compute engine and run their jobs. cool features like schema browsing and accessing s3. The portal uses the Bigdata api under the hood..
  10. Here is how our spark on yarn deployment looks like Thanks to genie, multiple spark versions. This helps experiment with new features, bug fixes To make best use of our resources, spark alongside mapreduce
  11. Now lets move on to the technical challenges. We are going to walk through the lifecycle of a spark job, as we do that we'll talk about the challenges that we had.
  12. resource manager, master process, managing the cluster resources. driver via the AM requests from initial set of executors from the RM, and NM launches In a typical spark job, probably use the RDD interface.. We faced two technical challenges here. the first one is a limitation of RDD interface and second one is perf optimization. T : Here is the first one
  13. Recently we had a usecase where we wanted to merge small files. merges partitions according to a given number of paritions, but we want to merge by file size added support for passing a custom coalescer to this interface. and we implemented a size based coalescing strategy T : Now lets look at the optimization in UnionRDD.
  14. We noticed select * from table limit 10, slow against partitioned hive table Found each hive partition, spark hadoopRDD,UnionRDD. reason poor performance sequential listing of parent hadoopRDD partitions. resolved this by computing the partitions of unionRDD in parallel. T : This fix was good but we found another important problem in s3 filesystem that was affecting the listing performance.
  15. in hadoop we found, Filesystem making an unnecssary rpc call to s3, which is expensive. We fixed this issue by removing this redundant rpc call. This along with Union RDD fix improved in split calculation performance by 20x. T: : This problem was on the read path, we hit another issue on the write path, which was with Output committers.
  16. Output committer writes the output of single task in an all or nothing fashion. Each task writes output to a temp directory. First successful.. But this doesn't work well with S3. To solve this we use our internal s3 output committer, which we plan to open source in the future.
  17. At this point, driver has finished listing and it starts running tasks on the executors. dynamic allocation will kick in, and scale job's execturos up and down based on the workload. We had several major problems with dyn alloc and I'll now talk about two of them here. T : Here is the first one.
  18. When dynamic allocation is enabled, we found that reading broadcast data can take a long time. reason is executors retrieve block locations only once from the driver during the read. But with dynamic allocation , replicas can be removed resulting in stale entries in the retrived block locations. Production job example We fixed this problem on the executor side, now the executors refresh the block locations from the driver after multiple failed attempts. T: The second problem we had is with locality optimization.
  19. noticed a spark application failed due to timeouts. On debugging we found that AM would cancel and resend container requests if the one, locality preference is no longer needed or two if locality preference is not set. Problem is we run on s3 and s3 doesn't have locality information set, which was why we see this thrashing. We fixed this by not cancelling container requests that don't have a locality pref set.
  20. See also https://issues.apache.org/jira/browse/SPARK-8890 Inserting a repartition minimizes the number of output files and the memory consumption. In progress, not merged.
  21. Due to sequential processing one large application event log can block new applications from showing up. Other configs which helped fix OOM errors Move from CMS to G1GC Garbage collector -XX:NewRatio=1 (give more space to YoungGen)
  22. Available in Spark 2.0