Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Netflix Data Pipeline
with Kafka
Allen Wang & Steven Wu
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What is Netflix?
Netflix is a logging company
that occasionally streams video
Numbers
● 400 billion events per day
● 8 million events & 17 GB per second during
peak
● hundreds of event types
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
Mission of Data Pipeline
Publish, Collect, Aggregate, Move Data @
Cloud Scale
In the old days ...
S3
EMR
Event
Producer
Nowadays ...
S3
Router
Druid
EMR
Existing Data Pipeline
Event
Producer
Stream
Consumers
In to the Future ...
New Data Pipeline
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Serving Consumers off Diff Clusters
S3
Router
Druid
EMR
Event
Producer
Stream
Consumers
Fronting
Kafka
Consumer
Kafka
Split Fronting Kafka Clusters
● Low-priority (error log, request trace, etc.)
o 2 copies, 1-2 hour retention
● Medium-prio...
Producer Resilience
● Kafka outage should never disrupt existing
instances from serving business purpose
● Kafka outage sh...
Fail but Never Block
● block.on.buffer.full=false
● handle potential blocking of first meta data
request
● Periodical chec...
Agenda
● Introduction
● Evolution of Netflix data pipeline
● How do we use Kafka
What Does It Take to Run In Cloud
● Support elasticity
● Respond to scaling events
● Resilience to failures
o Favors archi...
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our ro...
Netflix Kafka Container
Kafka
Metric reporting Health check
service Bootstrap
Kafka JVM
Bootstrap
● Broker ID assignment
o Instances obtain sequential numeric IDs using Curator’s locks recipe
persisted in ZK
o ...
Metric Reporting
● We use Servo and Atlas from NetflixOSS
Kafka
MetricReporter
(Yammer → Servo adaptor)
JMX
Atlas Service
Kafka Atlas Dashboard
Health check service
● Use Curator to periodically read ZooKeeper
data to find signs of unhealthiness
● Export metrics to ...
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our ro...
ZooKeeper
● Dedicated 5 node cluster for our data
pipeline services
● EIP based
● SSD instance
Auditor
● Highly configurable producers and
consumers with their own set of topics and
metadata in messages
● Built as a s...
Auditor
● Broker monitoring (Heartbeating)
Auditor
● Broker performance testing
o Produce tens of thousands messages per second on
single instance
o As consumers to ...
Kafka admin UI
● Still searching …
● Currently trying out KafkaManager
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our ro...
Challenges
● ZooKeeper client issues
● Cluster scaling
● Producer/consumer/broker tuning
ZooKeeper Client
● Challenges
o Broker/consumer cannot survive ZooKeeper cluster
rolling push due to caching of private IP...
ZooKeeper Client
● Solutions
o Created our internal fork of Apache ZooKeeper
client
o Periodically refresh private IP reso...
Scaling
● Provisioned for peak traffic
o … and we have regional fail-over
Strategy #1 Add Partitions to New
Brokers
● Caveat
o Most of our topics do not use keyed messages
o Number of partitions i...
Strategy #1 Add Partitions to new
brokers
● Challenges: existing admin tools does not
support atomic adding partitions and...
Strategy #1 Add Partitions to new
brokers
● Solutions: created our own tool to do it in
one ZooKeeper change and repeat fo...
Strategy #2 Move Partitions
● Should work without precondition, but ...
● Huge increase of network I/O affecting
incoming ...
Scale down strategy
● There is none
● Look for more support to automatically move
all partitions from a set of brokers to ...
Client tuning
● Producer
o Batching is important to reduce CPU and network
I/O on brokers
o Stick to one partition for a w...
Effect of batching
partitioner batched records
per request
broker cpu util
[1]
random without
lingering
1.25 75%
sticky wi...
Broker tuning
● Use G1 collector
● Use large page cache and memory
● Increase max file descriptor if you have
thousands of...
Kafka in AWS - How do we make it
happen
● Inside our Kafka JVM
● Services supporting Kafka
● Challenges/Solutions
● Our ro...
Road map
● Work with Kafka community on rack/zone
aware replica assignment
● Failure resilience testing
o Chaos Monkey
o C...
Thank you!
http://netflix.github.io/
http://techblog.netflix.com/
@NetflixOSS
@allenxwang
@stevenzwu
Netflix Data Pipeline With Kafka
Nächste SlideShare
Wird geladen in …5
×

Netflix Data Pipeline With Kafka

34.304 Aufrufe

Veröffentlicht am

Netflix Data Pipeline and how we operate Kafka in AWS

Veröffentlicht in: Software
  • Unlock Her Legs - How to Turn a Girl On In 10 Minutes or Less...  http://ishbv.com/unlockher/pdf
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Settling for less on valentine's? Then you need VigRX pLUS!  https://tinyurl.com/yy3nfggr
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • ➤➤ 3 Reasons Why You Shouldn't take Pills for ED (important) ♥♥♥ https://tinyurl.com/rockhardxx
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • For Kafka Certification Courses , Visit http://www.todaycourses.com
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Netflix Data Pipeline With Kafka

  1. 1. Netflix Data Pipeline with Kafka Allen Wang & Steven Wu
  2. 2. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  3. 3. What is Netflix?
  4. 4. Netflix is a logging company
  5. 5. that occasionally streams video
  6. 6. Numbers ● 400 billion events per day ● 8 million events & 17 GB per second during peak ● hundreds of event types
  7. 7. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  8. 8. Mission of Data Pipeline Publish, Collect, Aggregate, Move Data @ Cloud Scale
  9. 9. In the old days ...
  10. 10. S3 EMR Event Producer
  11. 11. Nowadays ...
  12. 12. S3 Router Druid EMR Existing Data Pipeline Event Producer Stream Consumers
  13. 13. In to the Future ...
  14. 14. New Data Pipeline S3 Router Druid EMR Event Producer Stream Consumers Fronting Kafka Consumer Kafka
  15. 15. Serving Consumers off Diff Clusters S3 Router Druid EMR Event Producer Stream Consumers Fronting Kafka Consumer Kafka
  16. 16. Split Fronting Kafka Clusters ● Low-priority (error log, request trace, etc.) o 2 copies, 1-2 hour retention ● Medium-priority (majority) o 2 copies, 4 hour retention ● High-priority (streaming activities etc.) o 3 copies, 12-24 hour retention
  17. 17. Producer Resilience ● Kafka outage should never disrupt existing instances from serving business purpose ● Kafka outage should never prevent new instances from starting up ● After kafka cluster restored, event producing should resume automatically
  18. 18. Fail but Never Block ● block.on.buffer.full=false ● handle potential blocking of first meta data request ● Periodical check whether KafkaProducer was opened successfully
  19. 19. Agenda ● Introduction ● Evolution of Netflix data pipeline ● How do we use Kafka
  20. 20. What Does It Take to Run In Cloud ● Support elasticity ● Respond to scaling events ● Resilience to failures o Favors architecture without single point of failure o Retries, smart routing, fallback ...
  21. 21. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  22. 22. Netflix Kafka Container Kafka Metric reporting Health check service Bootstrap Kafka JVM
  23. 23. Bootstrap ● Broker ID assignment o Instances obtain sequential numeric IDs using Curator’s locks recipe persisted in ZK o Cleans up entry for terminated instances and reuse its ID o Same ID upon restart ● Bootstrap Kafka properties from Archaius o Files o System properties/Environment variables o Persisted properties service ● Service registration o Register with Eureka for internal service discovery o Register with AWS Route53 DNS service
  24. 24. Metric Reporting ● We use Servo and Atlas from NetflixOSS Kafka MetricReporter (Yammer → Servo adaptor) JMX Atlas Service
  25. 25. Kafka Atlas Dashboard
  26. 26. Health check service ● Use Curator to periodically read ZooKeeper data to find signs of unhealthiness ● Export metrics to Servo/Atlas ● Expose the service via embedded Jetty
  27. 27. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  28. 28. ZooKeeper ● Dedicated 5 node cluster for our data pipeline services ● EIP based ● SSD instance
  29. 29. Auditor ● Highly configurable producers and consumers with their own set of topics and metadata in messages ● Built as a service deployable on single or multiple instances ● Runs as producer, consumer or both ● Supports replay of preconfigured set of messages
  30. 30. Auditor ● Broker monitoring (Heartbeating)
  31. 31. Auditor ● Broker performance testing o Produce tens of thousands messages per second on single instance o As consumers to test consumer impact
  32. 32. Kafka admin UI ● Still searching … ● Currently trying out KafkaManager
  33. 33. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  34. 34. Challenges ● ZooKeeper client issues ● Cluster scaling ● Producer/consumer/broker tuning
  35. 35. ZooKeeper Client ● Challenges o Broker/consumer cannot survive ZooKeeper cluster rolling push due to caching of private IP o Temporary DNS lookup failure at new session initialization kills future communication
  36. 36. ZooKeeper Client ● Solutions o Created our internal fork of Apache ZooKeeper client o Periodically refresh private IP resolution o Fallback to last good private IP resolution upon DNS lookup failure
  37. 37. Scaling ● Provisioned for peak traffic o … and we have regional fail-over
  38. 38. Strategy #1 Add Partitions to New Brokers ● Caveat o Most of our topics do not use keyed messages o Number of partitions is still small o Require high level consumer
  39. 39. Strategy #1 Add Partitions to new brokers ● Challenges: existing admin tools does not support atomic adding partitions and assigning to new brokers
  40. 40. Strategy #1 Add Partitions to new brokers ● Solutions: created our own tool to do it in one ZooKeeper change and repeat for all or selected topics ● Reduced the time to scale up from a few hours to a few minutes
  41. 41. Strategy #2 Move Partitions ● Should work without precondition, but ... ● Huge increase of network I/O affecting incoming traffic ● A much longer process than adding partitions ● Sometimes confusing error messages ● Would work if pace of replication can be controlled
  42. 42. Scale down strategy ● There is none ● Look for more support to automatically move all partitions from a set of brokers to a different set
  43. 43. Client tuning ● Producer o Batching is important to reduce CPU and network I/O on brokers o Stick to one partition for a while when producing for non-keyed messages o “linger.ms” works well with sticky partitioner ● Consumer o With huge number of consumers, set proper fetch.wait.max.ms to reduce polling traffic on broker
  44. 44. Effect of batching partitioner batched records per request broker cpu util [1] random without lingering 1.25 75% sticky without lingering 2.0 50% sticky with 100ms lingering 15 33% [1] 10 MB & 10K msgs / second per broker, 1KB per message
  45. 45. Broker tuning ● Use G1 collector ● Use large page cache and memory ● Increase max file descriptor if you have thousands of producers or consumers
  46. 46. Kafka in AWS - How do we make it happen ● Inside our Kafka JVM ● Services supporting Kafka ● Challenges/Solutions ● Our roadmap
  47. 47. Road map ● Work with Kafka community on rack/zone aware replica assignment ● Failure resilience testing o Chaos Monkey o Chaos Gorilla ● Contribute to open source o Kafka o Schlep -- our messaging library including SQS and Kafka support o Auditor
  48. 48. Thank you! http://netflix.github.io/ http://techblog.netflix.com/ @NetflixOSS @allenxwang @stevenzwu

×