(BDT318) How Netflix Handles Up To 8 Million Events Per Second

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Peter Bakas, Director of Engineering, Event and Data Pipelines, Netflix
October 2015
BDT318
Netflix Keystone
How Netflix Handles Data Streams Up to 8 Million Events Per Second

Peter Bakas
@ Netflix : Cloud Platform Engineering - Event and Data Pipelines
@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure
@ Yahoo : Display Advertising, Behavioral Targeting, Payments
@ PayPal : Site Engineering and Architecture
@ Play : Advisor to various Startups (Data, Security, Containers)
Who is this guy?

What to Expect from the Session
• Architectural design and principles for Keystone
• Current state of technologies that Keystone is leveraging
• Best practices in operating Kafka and Samza

Publish, Collect, Process, Aggregate & Move Data

• 550 billion events per day
• 8.5 million events & 21 GB per second during peak
• 1+ PB per day
• Hundreds of event types
By the numbers

Split Fronting Kafka Clusters
Normal-priority (majority)
• 2 copies, 12 hour retention
High-priority (streaming activities etc.)
• 3 copies, 24 hour retention

Split Fronting Kafka Clusters
Instance type - D2XL
• Large disk (6TB) with 450-475MB/s of sequential I/O
throughput measured
• Large memory (30GB)
• Medium network capability (~ 700Mbps)
• Replication lag starts to show when bytes in above
18MB/second per broker with thousands of partition

• PR is available to Apache Kafka
• https://github.com/apache/kafka/pull/132
• https://issues.apache.org/jira/browse/KAFKA-1215
• Improved availability
• Reduce cost of maintenance
Kafka Zone Aware Replica Assignment

Control Plane + Data Plane
• Control plane for router is job manager
• Infrastructure is data plane
• Declarative, reconciliation driven
• Smart scheduling managing tradeoffs
• Auto Scaling based on traffic
• Fault tolerance
• Application (router) faults
• AWS hardware faults

Routing Service - Samza
Amazon S3 Routing
• ~5800 long running containers for Amazon S3 routing
• ~500 C3-4XL AWS instances for Amazon S3 routing
Elasticsearch Routing
• ~850 long running containers for Elasticsearch routing
• ~70 C3-4XL AWS instances for Elasticsearch routing
Kafka Routing
• ~3200 long running containers for Kafka routing
• ~280 C3-4XL AWS instances for Kafka routing

Container Footprint:
• 2G - 5G memory
• 160 mbps max network bandwidth
• 1 CPU Share
• 20G disk for buffer & logs
• Processes 1-12 partitions
• Periodically reports health to infrastructure

Observed Numbers:
• Avg memory usage of ~1.8G per container
• Avg memory usage per node ~20G(Range: 7G - 25G)
• Avg CPU utilization of 8% per node
• Avg NetworkIn 256Mbps per node
• Avg NetworkOut 156Mbps per node

Publish to Amazon S3 sink:
• Every 200mb or 2 mins
• S3 average upload latency 200ms
Producer to Router latency:
• 30 percentile topics under 500 ms
• 70 percentile topics under 1 sec
• 90 percentile under 2 sec
• Overall average about 2.5 seconds
Kafka to Router consumer lag (est time to catch up):
• 65 percentile under 500ms
• 90 percentile under 5 seconds

+ Alterations
• Internal build of Samza version 0.9.1
• Fixed SAMZA-41 in 0.9.1
• Support static partition range assignment
• Added SAMZA-775 in 0.9.1
• Prefetch buffer specified based on heap to use
• Backported SAMZA-655 to 0.9.1
• Environment variable configuration rewriter
• Backported SAMZA-540 to version 0.9.1
• Expose latency related metrics in OffsetManager
• Integration with Netflix Alert & Monitoring systems

• Streaming jobs to analyze movie plays, A/B tests, etc.
• Direct API for Kafka in 1.3
• Observed 2x performance improvement compared to 1.2
• Additional improvement possible with prefetching and connection pooling
(not available yet)
• Campaigned for backpressure support
• Result - Spark 1.5 release has community developed back pressure
support SPARK-7398

Annotation-based event definition
@Resource(type = ConsumerStorageType.DB, name =
"S3Diagnostics")
public class S3Diagnostics implements Annotatable {
....
S3Diagnostics s3Diagnostics = new S3Diagnostics();
....
LogManager.logEvent(s3Diagnostics); // log this diagnostic
event
Java

{
"eventName" : "ksproxytest",
"payload" : {
"k1" : "v1",
"k2" : "v2"
}
}
Non-Java : Keystone Proxy

Wire format
• Extensible
• Currently supports JSON
• Will support Avro
• Encapsulated as a shareable jar
• Immutable message through the pipeline

Producer Resilience
• Outage should never disrupt existing instances from serving business
purpose
• Outage should never prevent new instances from starting up
• After service is restored, event producing should resume
automatically

Fail, but never block
block.on.buffer.full=false
handle potential blocking of first metadata request

• Broker monitoring
• Alert on offline broker from ZooKeeper
• Consumer monitoring
• Alert on consumer lag/stuck and unconsumed partitions
• Heart-beating
• Produce/consume messages and measure latency
• Broker performance testing
• Produce tens of thousands messages per second on single instance
• Create multiple consumer groups to test consumer impact on broker
Auditor

Consumer Offset
Stuck Consumer Unconsumed Partitions
Auditor - Consumer Monitoring
Consumer Lag

New Internal Automation Engine:
• Collect diagnostic information based on alerts & other operational
events
• Help services self heal
• Reduce MTTR
• Reduce pager fatigue
• Improve productivity for developer
Winston

• Performance tuning + optimizations
• Self service
• Schemas + registry
• Event discovery + visualization
• Open Source Auditor/Kaffee
Near Term

Global real-time data stream + stream processing network

Office Hours
Wed 4:00PM – 5:30PM
@ Booth
pbakas@netflix.com
@peter_bakas

Remember to complete
your evaluations!

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Ähnlich wie (BDT318) How Netflix Handles Up To 8 Million Events Per Second (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

(BDT318) How Netflix Handles Up To 8 Million Events Per Second