SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Storage Capacity Management
@Booking.com
Nurettin OMEROGLU
What happens
when broker
disk is FULL?
A)Only some producers fail
B)
C)
All producers fail
Kafka service fails
Streaming Infra Team
Nurettin OMEROGLU
Senior Software Engineer
I am a member of Streaming Infra Team (10
people) and have more than 4 years of expertise
on Apache Kafka client and server side
components. We manage on-prem Kafka solution
serving to clients running on variety of platforms
such as bare-metal, kubernetes and also Cloud
Agenda
1. Introduction
2. Before the project
3. Step by step capacity project
4. Future plans
Introduction
100M
monthly active
app users
155,000
destinations around the world
Car hire available in 140+countries
and pre-booked taxis in
over 500cities across 120+
countries
243M+
verified guest reviews
and 24/7
customer service
in 45
languages and dialects
Since 2010,
Booking.com has
welcomed
4.5B+
guest arrivals
28M
total reported
listings
worldwide
6.6M
options in homes,
apartments and
other unique
places to stay
30
different types of
places to stay,
including homes,
apartments, B&Bs,
hostels, farm stays,
bungalows, even
boats, igloos and
treehouses
140offices in 70countries over
5,000employees in Amsterdam
Payments
A/B Tests
MySQL
Cassandra
Hadoop
Cloud
...
Events
Logs
Online ML
Fraud detection
Personalization
Bookings FPA reporting
Data Streaming
Platform
MySQL
Cassandra
Hadoop
Cloud
...
● Transports and transposes data via pub/sub;
● Connects application through data pipeline
● Resilient, scalable, fault tolerant, secure, with SLO guarantees;
Real-time
analytics
Scale of Streaming @Booking.com
How much data? ~2.2PB
produced and consumed per day
How many clusters? 62
How many topics? ~34K
How many partitions? ~138K
How many servers? 900 kafka brokers
+75 zk
Before the project
Setup
● On-premise multi-tenant kafka clusters running on bare-metal
● Local SSD storage (~3.5TB per broker)
● 32 thread CPU / 256MB memory / 10 Gb network
Existing Components
● Custom Configuration validations
● Custom Quota validations
○ Topics per principal
○ Partitions per principal
…
● Topics
● Custom quotas
(booking-specific)
…
● Specific Configurations
● Custom Quotas
…
● Custom PrincipalBuilder
● Custom Policies
(AbstractPolicy)
○ AlterConfigPolicy
○ CreateTopicPolicy
…
Mysql
(Metadata
Store)
Bkstreaming CLI
(Self-service, home-built)
Kontrole
(Control Center, home-built)
Kafka Cluster
Example Scenario for Custom Quota Validations
(2) Auth: OK
(3) Topics per principal quota: OK
(4) Partitions per principal quota: OK
(1) Add topic for a service
(5) Create topic
Mysql
(Metadata
Store)
Kontrole
(Control Center)
Kafka Cluster
Reactive Approach
● Clients use retention.ms configuration
retention.ms - which deletes messages after a
certain amount of time.
● Dangerous situations if traffic spikes
● We were the middleman handling the toil /
issues between multiple tenants
○ Increase number of brokers, or
○ Determine noisy neighbors and
■ Throttle, or
■ Communicate with clients (night?)
● Lack of visibility and forecasting to plan ahead
reserved space for safety
Topic 1
Shared broker disk among topics
Topic 4
Topic 2
Topic 5
Topic 3
Topic 6
Step by Step
Capacity Project
IDEA?
retention.bytes - which deletes the oldest messages
when the total size of a partition exceeds a threshold.
● Reserve storage per principal (quota)
● Let the clients manage their reserved storage
● Make retention.bytes mandatory on topic
● Feedback to clients around their usage/growth
Discarded Options:
● Kubernetes elasticity
● Network attached or remote storage options
reserved space for safety
Reserved quotas per principal
Principal
quota
Principal
quota
Principal
quota
Determine cluster capacity
1) Periodically fetch
information from Cruise
Control about the cluster
Number of available
brokers, disk information …
2) Use min disk capacity
among brokers to calculate
cluster capacity
3) Target 90% disk usage
(headroom)
Total capacity = (min broker disk * number of brokers) *
0.9
Kontrole
Cruise
Control
Graphite
(1) Periodic cron job
(2) Available brokers,
disk information
(3) Calculate capacity,
Publish metrics
New Quota + Topic level configuration
● Reserve storage per principal (quota) (default 500MB)
● Add property `topic_capacity_bytes` per Kafka topic (not visible to
Kafka brokers) to manage retention.bytes
● We do all the calculations under this value (including retention.bytes)
topic_capacity_bytes = retention.bytes * partition_count * replica_count
● Whenever there is a partition count increase (i.e. done via Kontrole),
retention.bytes (per partition) is re-calculated accordingly.
New Quota Creation
Kontrole
Cruise
Control
mysql
(1) Create principal quota
(2) Get available brokers,
disk information
(3) Get existing quotas
(5) Save quota
(4) Validate if new quota fits into cluster
New Topic Creation
Kontrole
mysql
(1) Create topic
with topic_capacity_bytes
(2) Get principal’s quota
(3) Enough space for the new topic?
(4) No, reject. Ask for quota increase
(4) Yes, topic fits, go on!
Create topic with relevant
retention.bytes
Kafka Cluster
Dashboards for Admins
Dashboards for Clients
Add Alerting
● Warn/notify before topic_capacity_bytes configuration kicks in and start
deleting data.
● Actions:
○ reduce the retention.ms configuration, or
○ increase the topic capacity.
Onboard Existing Clusters
● Simulating scenarios on test cluster
● Operational documentation
● Stakeholder management
● Documentation for clients
● Enable capacity project on a cluster
○ Calculate / Add topic_capacity_bytes to each topic (with extra)
○ Calculate / Add quotas per principal
Migration Challenges
● Revert strategy
○ Dynamic flag to disable the project on cluster
● Sanity check if cluster is suitable
○ Brokers may have non-uniform storage capacity
○ With extras, all quotas may not fit into the available capacity
Future Work
What is next?
● Allow teams to extend their quota if there is enough capacity
(self service)
● Send usage report to the teams, with the capacity allocated to the
principal vs. their usage
(cost attribution)
Booking.com
Facebook: facebook.com/booking.com
Instagram: @bookingcom
Twitter: @booking.com; @bookingcomnews
Linkedin: nl.linkedin.com/company/booking.com
Youtube: youtube.com/booking
Join Booking.com as a partner
join.booking.com
Join the Booking.com team
careers.booking.com
Questions?
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
 
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisCapacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 

Ähnlich wie Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
Timothy Spann
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
confluent
 

Ähnlich wie Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu (20)

Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support PerspectiveApache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
Apache Kafka's Common Pitfalls & Intricacies: A Customer Support Perspective
 
Kubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz RiceKubernetes 1.12 Update and Container Security with Liz Rice
Kubernetes 1.12 Update and Container Security with Liz Rice
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
Uber Real Time Data Analytics
Uber Real Time Data AnalyticsUber Real Time Data Analytics
Uber Real Time Data Analytics
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Workday's Next Generation Private Cloud
Workday's Next Generation Private CloudWorkday's Next Generation Private Cloud
Workday's Next Generation Private Cloud
 
Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022
Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022
Knock Knock, Who’s There? With Justin Chen and Dhruv Jauhar | Current 2022
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
 
(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference(Current22) Let's Monitor The Conditions at the Conference
(Current22) Let's Monitor The Conditions at the Conference
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 

Mehr von HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 

Mehr von HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu

  • 2. What happens when broker disk is FULL? A)Only some producers fail B) C) All producers fail Kafka service fails
  • 3. Streaming Infra Team Nurettin OMEROGLU Senior Software Engineer I am a member of Streaming Infra Team (10 people) and have more than 4 years of expertise on Apache Kafka client and server side components. We manage on-prem Kafka solution serving to clients running on variety of platforms such as bare-metal, kubernetes and also Cloud
  • 4. Agenda 1. Introduction 2. Before the project 3. Step by step capacity project 4. Future plans
  • 6. 100M monthly active app users 155,000 destinations around the world Car hire available in 140+countries and pre-booked taxis in over 500cities across 120+ countries 243M+ verified guest reviews and 24/7 customer service in 45 languages and dialects Since 2010, Booking.com has welcomed 4.5B+ guest arrivals 28M total reported listings worldwide 6.6M options in homes, apartments and other unique places to stay 30 different types of places to stay, including homes, apartments, B&Bs, hostels, farm stays, bungalows, even boats, igloos and treehouses 140offices in 70countries over 5,000employees in Amsterdam
  • 7. Payments A/B Tests MySQL Cassandra Hadoop Cloud ... Events Logs Online ML Fraud detection Personalization Bookings FPA reporting Data Streaming Platform MySQL Cassandra Hadoop Cloud ... ● Transports and transposes data via pub/sub; ● Connects application through data pipeline ● Resilient, scalable, fault tolerant, secure, with SLO guarantees; Real-time analytics
  • 8. Scale of Streaming @Booking.com How much data? ~2.2PB produced and consumed per day How many clusters? 62 How many topics? ~34K How many partitions? ~138K How many servers? 900 kafka brokers +75 zk
  • 10. Setup ● On-premise multi-tenant kafka clusters running on bare-metal ● Local SSD storage (~3.5TB per broker) ● 32 thread CPU / 256MB memory / 10 Gb network
  • 11. Existing Components ● Custom Configuration validations ● Custom Quota validations ○ Topics per principal ○ Partitions per principal … ● Topics ● Custom quotas (booking-specific) … ● Specific Configurations ● Custom Quotas … ● Custom PrincipalBuilder ● Custom Policies (AbstractPolicy) ○ AlterConfigPolicy ○ CreateTopicPolicy … Mysql (Metadata Store) Bkstreaming CLI (Self-service, home-built) Kontrole (Control Center, home-built) Kafka Cluster
  • 12. Example Scenario for Custom Quota Validations (2) Auth: OK (3) Topics per principal quota: OK (4) Partitions per principal quota: OK (1) Add topic for a service (5) Create topic Mysql (Metadata Store) Kontrole (Control Center) Kafka Cluster
  • 13. Reactive Approach ● Clients use retention.ms configuration retention.ms - which deletes messages after a certain amount of time. ● Dangerous situations if traffic spikes ● We were the middleman handling the toil / issues between multiple tenants ○ Increase number of brokers, or ○ Determine noisy neighbors and ■ Throttle, or ■ Communicate with clients (night?) ● Lack of visibility and forecasting to plan ahead reserved space for safety Topic 1 Shared broker disk among topics Topic 4 Topic 2 Topic 5 Topic 3 Topic 6
  • 15. IDEA? retention.bytes - which deletes the oldest messages when the total size of a partition exceeds a threshold. ● Reserve storage per principal (quota) ● Let the clients manage their reserved storage ● Make retention.bytes mandatory on topic ● Feedback to clients around their usage/growth Discarded Options: ● Kubernetes elasticity ● Network attached or remote storage options reserved space for safety Reserved quotas per principal Principal quota Principal quota Principal quota
  • 16. Determine cluster capacity 1) Periodically fetch information from Cruise Control about the cluster Number of available brokers, disk information … 2) Use min disk capacity among brokers to calculate cluster capacity 3) Target 90% disk usage (headroom) Total capacity = (min broker disk * number of brokers) * 0.9 Kontrole Cruise Control Graphite (1) Periodic cron job (2) Available brokers, disk information (3) Calculate capacity, Publish metrics
  • 17. New Quota + Topic level configuration ● Reserve storage per principal (quota) (default 500MB) ● Add property `topic_capacity_bytes` per Kafka topic (not visible to Kafka brokers) to manage retention.bytes ● We do all the calculations under this value (including retention.bytes) topic_capacity_bytes = retention.bytes * partition_count * replica_count ● Whenever there is a partition count increase (i.e. done via Kontrole), retention.bytes (per partition) is re-calculated accordingly.
  • 18. New Quota Creation Kontrole Cruise Control mysql (1) Create principal quota (2) Get available brokers, disk information (3) Get existing quotas (5) Save quota (4) Validate if new quota fits into cluster
  • 19. New Topic Creation Kontrole mysql (1) Create topic with topic_capacity_bytes (2) Get principal’s quota (3) Enough space for the new topic? (4) No, reject. Ask for quota increase (4) Yes, topic fits, go on! Create topic with relevant retention.bytes Kafka Cluster
  • 22. Add Alerting ● Warn/notify before topic_capacity_bytes configuration kicks in and start deleting data. ● Actions: ○ reduce the retention.ms configuration, or ○ increase the topic capacity.
  • 23. Onboard Existing Clusters ● Simulating scenarios on test cluster ● Operational documentation ● Stakeholder management ● Documentation for clients ● Enable capacity project on a cluster ○ Calculate / Add topic_capacity_bytes to each topic (with extra) ○ Calculate / Add quotas per principal
  • 24. Migration Challenges ● Revert strategy ○ Dynamic flag to disable the project on cluster ● Sanity check if cluster is suitable ○ Brokers may have non-uniform storage capacity ○ With extras, all quotas may not fit into the available capacity
  • 26. What is next? ● Allow teams to extend their quota if there is enough capacity (self service) ● Send usage report to the teams, with the capacity allocated to the principal vs. their usage (cost attribution)
  • 27. Booking.com Facebook: facebook.com/booking.com Instagram: @bookingcom Twitter: @booking.com; @bookingcomnews Linkedin: nl.linkedin.com/company/booking.com Youtube: youtube.com/booking Join Booking.com as a partner join.booking.com Join the Booking.com team careers.booking.com Questions?