Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Running Kafka
For Maximum Pain
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
To All The Tech Debt
I’ve Loved Before
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn
T E C H N I C A L D E B T
The cost of the rework required by choosing
an easy solution now.
SRE
vs
SWE
SRE
❤️s
SWE
• Both roles are critical
• Work together to balance
operability and features
• SRE’s job is to enable SWE to
...
How Big?
• Produced
• Every day
2Trillion
Messages
• Single cluster
• Unique data
5Gbps
Inbound
• Average 3x
consumption
•...
Sources of Pain
Exponentially
increase your
problems by sharing
them
Multitenancy
Kafka’s great!
Everything else
around it...
Multitenancy
Sharing is Caring
• Reduces the hardware footprint
• Less administrative overhead
• One bad actor makes everyone’s life
ha...
Types of Data
• Member-related
Activity
• Data schemas
are managed by
DMRC
• Aggregated to
some
datacenters
Tracking Metri...
Multitenancy Woes
• Auto topic creation means
nobody knows who created it
• Multiple producers further
clouds the issue
• ...
Improvements
• Added an ownership metadata
service
• One committee with control
over shared data schemas
• Moving to disab...
Infrastructure
Mirror Maker
• Every change requires a
restart
• Grows n2 with number of sites
• Inefficient since 0.8
• Loses key to part...
Mirror Maker
Performance
• Added identity handler for
fixed partition mapping
• Eliminated compression
• Finally off old c...
Message Auditing
• Required to assure mirroring
works
• Makes infrastructure care
about data schema
• Only tracks producer...
Streaming
Audit
• Moving audit data to headers
• Utilizing Samza for processing
counts
• Adding “cost to serve”
information
Management
Topic Configuration
• No way to manage configs
across multiple clusters
• Creating a new datacenter is a
manual process
• ...
Nuage
• One-stop shop for Data
Infrastructure
• Allows creation of topics with
ownership and ACLs
• Uses our Kafka REST
in...
Cluster Membership
• No tool to remove brokers
• New brokers take no traffic
• Partition reassignment is
basic
• Automatic...
Round 1:
kafka-tools
• kafka-assigner:
• Remove broker
• Rebalance replicas
• Fix replication factor
• Protocol CLI tool
•...
Round 2:
Cruise Control
• Dynamic workload rebalancing
• Self-healing clusters
• Manages multiple goals
(network, disk, CP...
What Else?
What Needs Attention?
• Very few metrics
• One bad partition breaks it
Log Compaction Client Config Upgrading
• Client and...
Make It Easier
• Cruise Control
• https://github.com/linkedin/cruise-control
• Kafka Monitor
• https://github.com/linkedin...
Thank you
Nächste SlideShare
Wird geladen in …5
×

Running Kafka for Maximum Pain

1.776 Aufrufe

Veröffentlicht am

Kafka makes so many things easier to do, from managing metrics to processing streams of data. Yet it seems that so many things we have done to this point in configuring and managing it have been object studies in how to make our lives, as the plumbers who keep the data flowing, more difficult than they have to be. What are some of our favorites?
* Kafka without access controls
* Multitenant clusters with no capacity controls
* Worrying about message schemas
* MirrorMaker inefficiencies
* Hope and pray log compaction
* Configurations as shared secrets
* One-way upgrades

We’ve made a lot of progress over the last few years improving the situation, in part by focusing some of this incredibly talented community towards operational concerns. We’ll talk about the big mistakes you can avoid when setting up multi-tenant Kafka, and some that you still can’t. And we will talk about how to continue down the path of marrying the hot, new features with operational stability so we can all continue to come back here every year to talk about it.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Running Kafka for Maximum Pain

  1. 1. Running Kafka For Maximum Pain Todd Palino Senior Staff Engineer, Site Reliability LinkedIn
  2. 2. To All The Tech Debt I’ve Loved Before Todd Palino Senior Staff Engineer, Site Reliability LinkedIn
  3. 3. T E C H N I C A L D E B T The cost of the rework required by choosing an easy solution now.
  4. 4. SRE vs SWE
  5. 5. SRE ❤️s SWE • Both roles are critical • Work together to balance operability and features • SRE’s job is to enable SWE to move as quickly as possible while meeting SLOs
  6. 6. How Big? • Produced • Every day 2Trillion Messages • Single cluster • Unique data 5Gbps Inbound • Average 3x consumption • Before mirroring 18Gbps Outbound • Largest clusters are 250k • Up to 10k partitions per broker 2.5M Partitions
  7. 7. Sources of Pain Exponentially increase your problems by sharing them Multitenancy Kafka’s great! Everything else around it sucks Infrastructure What do you mean I have to do it myself? Management
  8. 8. Multitenancy
  9. 9. Sharing is Caring • Reduces the hardware footprint • Less administrative overhead • One bad actor makes everyone’s life hard
  10. 10. Types of Data • Member-related Activity • Data schemas are managed by DMRC • Aggregated to some datacenters Tracking Metrics Queuing Logging • Application metrics, service calls, logs • Mostly produced by application containers • Only aggregated to backend datacenters • Internal application data, messaging • Largest users are Samza and Search • Limited aggregation in production only • Dedicated cluster for application logs going to ELK • High volume, low retention • Not aggregated
  11. 11. Multitenancy Woes • Auto topic creation means nobody knows who created it • Multiple producers further clouds the issue • Who makes decisions? • Who is responsible for problems? Ownership Capacity Security • No controls means it’s free! • Getting one person to project growth is hard • Getting 100 people to do it is impossible • Storage hardware is not commodity • Started with zero security • Impossible to handle sensitive data
  12. 12. Improvements • Added an ownership metadata service • One committee with control over shared data schemas • Moving to disable automatic topic creation Ownership Capacity Security • Quotas to limit bandwidth • Retention by both time and bytes to restrict disk usage • Also forces customers to talk to us about data usage • Move all clients to SSL • Add ACLs for existing usage (after review) • Starting to evaluate encryption
  13. 13. Infrastructure
  14. 14. Mirror Maker • Every change requires a restart • Grows n2 with number of sites • Inefficient since 0.8 • Loses key to partition affinity
  15. 15. Mirror Maker Performance • Added identity handler for fixed partition mapping • Eliminated compression • Finally off old consumer • Coming soon to a KIP near you
  16. 16. Message Auditing • Required to assure mirroring works • Makes infrastructure care about data schema • Only tracks producers (mostly) • Relational database doesn’t cut it for storing audit data
  17. 17. Streaming Audit • Moving audit data to headers • Utilizing Samza for processing counts • Adding “cost to serve” information
  18. 18. Management
  19. 19. Topic Configuration • No way to manage configs across multiple clusters • Creating a new datacenter is a manual process • Changes need to be propagated in a specific order • Administrative commands are not protected
  20. 20. Nuage • One-stop shop for Data Infrastructure • Allows creation of topics with ownership and ACLs • Uses our Kafka REST interface for CRUD
  21. 21. Cluster Membership • No tool to remove brokers • New brokers take no traffic • Partition reassignment is basic • Automatic leader election kills the cluster
  22. 22. Round 1: kafka-tools • kafka-assigner: • Remove broker • Rebalance replicas • Fix replication factor • Protocol CLI tool • Adding an admin client • github.com/linkedin/kafka-tools
  23. 23. Round 2: Cruise Control • Dynamic workload rebalancing • Self-healing clusters • Manages multiple goals (network, disk, CPU, rack) • Requires no additional code • Open source now!
  24. 24. What Else?
  25. 25. What Needs Attention? • Very few metrics • One bad partition breaks it Log Compaction Client Config Upgrading • Client and broker cannot negotiate • Configurations are essentially shared secrets • No information on the version of clients connecting • Message format changes are still troubling • Broker upgrades must be carefully ordered • Often no clear way to roll back
  26. 26. Make It Easier • Cruise Control • https://github.com/linkedin/cruise-control • Kafka Monitor • https://github.com/linkedin/kafka-monitor • Burrow • https://github.com/linkedin/Burrow • kafka-tools • https://github.com/linkedin/kafka-tools LinkedIn Open Source Get Involved • Community • users@kafka.apache.org • dev@kafka.apache.org • Bugs and Work: • https://issues.apache.org/jira/projects/KAFK A
  27. 27. Thank you

×