Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

Ashwin Shankar
Nezih Yigitbasi
Productionizing
Spark onYarn
for ETL

81+ million
members
Global 1000+ devices
supported
125 million
hours / day
Netflix Key Business Metrics

40 PB DW Read 3PB Write 300TB700B Events
Netflix Key Platform Metrics

Outline
● Big Data Platform Architecture
● Technical Challenges
● ETL

Big Data Platform
Architecture

Cloud
apps
Kafka Ursula
Cassandra
Aegisthus
Dimension Data
Event Data
~ 1 min
Daily
S3
SS
Tables
Data Pipeline

Storage
Compute
Service
Tools
Big Data APIBig Data Portal
S3 Parquet
Transport VisualizationQuality Pig Workflow Vis Job/Cluster Vis
Interface
Execution Metadata
Notebooks

• 3000 EC2 nodes on two clusters (d2.4xlarge)
• Multiple Spark versions
• Share the same infrastructure with MapReduce jobs
S M
S M
S M
M
…
16 vcores
120 GB
M
S
MapReduceS
S M
S M
S M
MS
Spark
S M
S M
S M
MS
S M
S M
S M
MS
Spark onYARN at Netflix

YARN
Resource
Manager
Node
Manager
Spark
AM
RDD

Custom Coalescer Support [SPARK-14042]
• coalesce() can only“merge” using the given number of partitions
– how to merge by size?
• CombineFileInputFormat with Hive
• Support custom partition coalescingstrategies

• Parent RDD partitions are listed sequentially
• Slow for tables with lots of partitions
• Parallelizelistingof parent RDD partitions
UnionRDD Parallel Listing [SPARK-9926]

YARN
Resource
Manager
S3Filesystem
RDD
Node
Manager
Spark
AM

• UnnecessarygetFileStatus() call
• SPARK-9926 and HADOOP-12810 yield faster startup
• ~20x speedup in input split calculation
Optimize S3 Listing Performance [HADOOP-12810]

Output Committers
Hadoop Output Committer
• Write to a temp directory and rename to destination on success
• S3 rename => copy + delete
• S3 is eventually consistent
S3 Output Committer
• Write to localdisk and upload to S3 on success
• avoid redundant S3 copy
• avoid eventual consistency

YARN
Resource
Manager
Dynamic
Allocation
S3Filesystem
RDD
Node
Manager
Spark
AM

• Broadcast joins/variables
• Replicascan be removed with dynamicallocation
Poor Broadcast Read Performance [SPARK-13328]
...
16/02/13 01:02:27WARN BlockManager:
Failed to fetch remote block broadcast_18_piece0 (failed attempt70)
...
16/02/13 01:02:27INFOTorrentBroadcast:
Reading broadcast variable 18 took 1051049 ms
• Refresh replicalocations from the driver on multiple failures

• Cancel & resend pending container requests
• if the locality preference is no longer needed
• if no locality preference is set
• No localityinformationwith S3
• Do not cancelrequests without localitypreference
Incorrect Locality Optimization [SPARK-13779]

YARN
Resource
Manager
Parquet R/W
Dynamic
Allocation
S3Filesystem
RDD
Node
Manager
Spark
AM

A B C D
a1 b1 c1 d1
… … … …
aN bN cN dN
A B C D
dictionary
from “Analytic Data Storage in Hadoop”, Ryan Blue
Parquet Dictionary Filtering [PARQUET-384*]

0
10
20
30
40
50
60
70
80
DF disabled DF enabled
64MB split
DF enabled
1G split
DF disabled
DF enabled
64MB split
DF enabled
1G split
~8x ~18x
Parquet Dictionary Filtering [PARQUET-384*]
Avg. Completion Time [m]

Property Value Description
spark.sql.hive.convertMetastoreParquet true enable native Parquet read path
parquet.filter.statistics.enabled true enable stats filtering
parquet.filter.dictionary.enabled true enable dictionary filtering
spark.sql.parquet.filterPushdown true enable Parquet filter pushdown optimization
spark.sql.parquet.mergeSchema false disable schema merging
spark.sql.hive.convertMetastoreParquet.mergeSchema false use Hive SerDe instead of built-in Parquet support
How to Enable Dictionary Filtering?

Efficient Dynamic Partition Inserts [SPARK-15420*]
• Parquet buffers row group data for each file during writes
• Spark alreadysorts before writes, but has some limitations
• Detect if the data is already sorted
• Expose the abilityto repartition data before write

YARN
Resource
Manager
Parquet R/W
Dynamic
Allocation
Spark
HistoryServer
S3Filesystem
RDD
Node
Manager
Spark
AM

Spark History Server – Where is My Job?

• A large applicationcan prevent new applicationsfrom showing up
• not uncommon to see event logs of GBs
• SPARK-13988 makes the processingmulti-threaded
• GC tuning helps further
• move fromCMS to G1GC
• allocate more space to young generation
Spark History Server – Where is My Job?

group
foreach
join
foreach + filter + store
join
foreach foreach
join
join
join
load + filter load + filter load + filter load + filter load + filter load + filter
Pig vs. Spark

0
100
200
300
400
Pig Spark PySpark
~2.4x ~2x
Avg. Completion Time [s]
Pig vs. Spark (Scala) vs. PySpark

Prototype DeployBuild Run
S3
Production Workflow

• A rapid innovation platform for targeting algorithms
• 5 hours (vs. 10s of hours) to compute similarityfor all Netflix
profilesfor 30-day window of new arrival titles
• 10 minutes to score 4M profilesfor 14-day window of new
arrival titles
Production Spark Application #1: Yogen

• Personalizedordering of rows of titles
• Enrich page/row/title features with playhistory
• 14 stages, ~10Ks of tasks, severalTBs
Production Spark Application #2: ARO

What’s Next?
• Improved Parquet support
• Better visibility
• Explore new use cases

Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale

Similar to Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale (20)

More from Jen Aman

More from Jen Aman (20)

Recently uploaded

Recently uploaded (20)

Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale