Netflix is a data-driven organization that places emphasis on data quality, availability and agility to capture and process that data. Some of our recommendation algorithms are computed as events happen in real time. Such streaming applications are long running tasks that need to be resilient. This is especially true in a cloud deployment due to the ephemeral nature of resources. In this talk, we will cover the What, the Why and the How of our resiliency exercise with Spark Streaming in an AWS cloud deployment. A Netflix ChaosMonkey based approach, which randomly terminated instances or processes, was employed to simulate failures. We hope that this exercise will help build confidence in the resiliency on Spark Streaming for similar contexts.
3. ● Deployment Setup
● Background
Agenda
● Use cases for Real Time Stream Processing
● Creating Chaos
● Motivations for Spark
● Spark Streaming Primer
● Injecting Chaos in Spark
● Future
4. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future
7. Scale at Netflix
● 400 Billion events per day
● 8 Million events/sec during peak
● Numerous types of events (UI
Events, Play Events, Impression
events etc)
8. What do we do with it?
● Event logs are captured into Hadoop (EMR)
● Run ETL jobs using Hive/Presto to
○ Provide input to pre-compute recommendations
○ User behavior analysis
○ Data analysis and Reporting
9. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future
10. Use Cases for Stream Processing
Recommendations based on collective real time signals
11. Use Cases for Stream Processing
Faster identification of Data Anomalies and Regressions
Bad iPhone push
12. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future
13. Motivations for Spark
● Popular compute engine for
batch processing
● Widely used for Offline
Experimentations at Netflix
● Improves agility with
Interactive queries Interactive Experimenter’s Notebook
14. Motivations for Spark
Single platform to build batch and real-time applications
S3
Micro Services
Spark
Spark Streaming
Recommender
Systems
Batch
Data
Streaming Data
15. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future
16. Challenges in Cloud
● Ephemeral Resources
● Cannot rely on local state
● No fixed IP
17. Chaos Monkey Approach
● Simulate failures by randomly
killing components
● Failures inevitably happen when
least desired
● Lather, Rinse, Repeat!
19. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future
24. How does streaming work?
● Data Streams are processed in batches
● Each batch processed in Spark
● Results are pushed out in batch
25. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future
26. Application Details
● Process subset of UI Events from Kafka
● Compute aggregate metrics
● Publish metrics to Atlas
● Spark 1.2.0
27. Standalone Cluster Manager
● Provide resource management and resiliency
● All in one package
○ Built-in, easy to deploy
○ Troubleshoot issues with single team
(Databricks)
29. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future
59. Agenda
● Background
● Use cases for Real Time Stream Processing
● Motivations for Spark
● Creating Chaos
● Spark Streaming Primer
● Deployment Setup
● Injecting Chaos in Spark
● Future