SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka at half the price
Dong Lin
Streams Infrastructure
©2017 LinkedIn Corporation. All Rights Reserved. 2
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 3
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
©2017 LinkedIn Corporation. All Rights Reserved. 4
RAID-10 setup with RF=2
producer
Broker 1 Broker 2
A
B
C
A
B
C
A
B
C
A
B
C
- Tolerate only one broker failure
©2017 LinkedIn Corporation. All Rights Reserved. 5
RAID-10 setup with RF=3
producer
Broker 1 Broker 3
A
B
C
A
B
C
A
B
C
A
B
C
Broker 2
A
B
C
A
B
C
- Tolerate up to two broker failures
- 50% more storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 6
JBOD setup with RF=2
producer
Broker 1
A
B
C
Broker 2
A
B
C
- Tolerate only one broker failure
- 50% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 7
JBOD setup with RF=3
producer
Broker 1
A
B
C
Broker 3
A
B
C
Broker 2
A
B
C
- Tolerate up to two broker failures
- 25% less storage cost
©2017 LinkedIn Corporation. All Rights Reserved. 8
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
©2017 LinkedIn Corporation. All Rights Reserved. 9
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
©2017 LinkedIn Corporation. All Rights Reserved. 10
RAID vs. JBOD
Setup Replication Storage cost
Broker failure
tolerance
Disk failure
tolerance
RAID-10
2 (baseline) 4X 1 (too small) 3
3 6X (50% up) 2 5
JBOD
2 2X (50% down) 1 (too small) 1 (too small)
3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
4 4X 3 (300% up) 3
©2017 LinkedIn Corporation. All Rights Reserved. 11
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 12
Problem 1: All replicas become offline if any log directory fails
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 13
Solution: Only replicas on the failed disk become offline
Broker
Disk A
IOException
when accessing disk B
Disk B
Disk C
Broker
Disk A
Disk B
Disk C
©2017 LinkedIn Corporation. All Rights Reserved. 14
Problem 2: Controller does not recognize disk failure
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
X No further
leader election
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
©2017 LinkedIn Corporation. All Rights Reserved. 15
Solution: Broker notifies and provides partition list to controller
Zookeeper
Controller
Broker 1
Partition 1
Partition 2
STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure
X
STEP 3:
Become leader for
partitions 1 and 2
STEP 4:
partition 2 is offline
STEP 5: Elect
another broker as
leader for partition 2
©2017 LinkedIn Corporation. All Rights Reserved. 16
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 17
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
STEP 4:
Created partition 2
(problematic)
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 18
Problem 3: Broker always creates log for partition if not exist
Zookeeper
Controller
STEP 3:
Become follower for partition 2
Create partition 2 if non-existent
Broker 1
Partition 1
Partition 2
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
Good disk may
become overloaded STEP 1: I am online
STEP 4:
Created partition 2
(problematic)
©2017 LinkedIn Corporation. All Rights Reserved. 19
Solution: Controller specifies whether to create log for partition
Zookeeper
Controller
STEP 3:
Become follower for partition 2
This is NOT a new partition
STEP 4:
Partition 2 is not available
and there is offline log dir
Broker 1
Partition 1
Partition 2
X
STEP 2:
- Broker -> is alive?
- Broker -> partition list
- Broker -> is new partition?
STEP 5:
Exclude broker 1 from
leader election
for partition 2
STEP 1: I am online
©2017 LinkedIn Corporation. All Rights Reserved. 20
Problem 4: No mechanism to move replicas between disks
Broker 1
P1 P2 P3
P5P4 P6
P7
Disk 1 Disk 2
©2017 LinkedIn Corporation. All Rights Reserved. 21
Example workflow to move replicas between disks
Broker
Client
STEP 1: DescribeDirRequest
STEP 2: DescribeDirResponse
Partition list and size
STEP 3: ChangeDirRequest
Disk 1 Disk 2
STEP 4: create p1.move
STEP 5: ChangeDirResponse
(Inprogress)
STEP 6: copy data from
p1.log to p1.move
STEP 7: delete p1.log and
rename p1.move to p1.log
STEP 8: Verify new assignment
via DescribeDirRequest
©2017 LinkedIn Corporation. All Rights Reserved. 22
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 23
Alternatives
▪ RAID-0 doesn’t provide disk fault tolerance
– Assume each broker has 10 disks and RF = 2
– RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD
▪ RAID-5 and RAID-6 have poor performance
▪ Hardware RAID is expensive
▪ One broker per disk
©2017 LinkedIn Corporation. All Rights Reserved. 24
one-broker-per-machine vs. one-broker-per-disk
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1
Physical Machine
Disk 1 Disk 2 Disk 3
Broker 1 Broker 2 Broker 3
V.S.
One-broker-per-machine One-broker-per-disk
©2017 LinkedIn Corporation. All Rights Reserved. 25
one-broker-per-machine vs. one-broker-per-disk
▪ Both solutions use JBOD as disk configuration
▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine)
– 100X threads and 100X sockets per machine
– 10X control plane traffic from the controller to brokers (e.g. MetadataRequest)
– 10X broker instances and configuration files to manage
– 10X time to bounce a cluster if we bounce one broker at a time
– 10X load on external service (e.g. a service used to query per-topic ACL)
– Less efficient quota enforcement
– Less efficient rebalance across disks on the same machine
– Lower throughput
©2017 LinkedIn Corporation. All Rights Reserved. 26
Experimental setup
▪ Brokers deployed on 15 machines with 10 disks per machine
IO threads Network threads Replica-fetcher threads
One-broker-per-machine 160 120 140
One-broker-per-disk 16 12 14
▪ Producers deployed on 15 machines
acks threads sync retries retry backoff message size batch size request timeout
all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT
▪ Topic configuration
partition replication factor min-insync-replicas
512 3 3
©2017 LinkedIn Corporation. All Rights Reserved. 27
One-broker-per-machine throughput
Average throughput is 2.3 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 28
One-broker-per-disk throughput
Average throughput is 2 GBps
©2017 LinkedIn Corporation. All Rights Reserved. 29
Agenda
▪ Motivation
– Why switch from RAID-10 to JBOD?
– Tradeoff between cost and fault-tolerance
▪ Design
– How to run Kafka with disk failure
– How to move replicas between disks
▪ Alternatives
▪ Evaluation
▪ Changes in operational procedures
▪ Future work
▪ Reference
©2017 LinkedIn Corporation. All Rights Reserved. 30
Changes in operational procedure
▪ Adjust replication factor and min.insync.replicas
▪ Configure num.replica.move.threads for broker
▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
©2017 LinkedIn Corporation. All Rights Reserved. 31
Future work
▪ Use more intelligent solution to select log directory for new replica
▪ Automatic load balancing across log directories on the same broker
– Reduced operational overhead
▪ Distribute segments of a given replica across multiple log directories
– Less overhead for rebalance between disks
– Higher partition size limit
▪ Handle partial disk failure, e.g. disk with degraded performance.
©2017 LinkedIn Corporation. All Rights Reserved. 32
References
▪ KIP-112: Handle disk failure for JBOD (link)
▪ KIP-113: Support replicas movement between log directories (link)
©2017 LinkedIn Corporation. All Rights Reserved. 33

Weitere ähnliche Inhalte

Was ist angesagt?

Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
 
(Ab)Using GPOs for Active Directory Pwnage
(Ab)Using GPOs for Active Directory Pwnage(Ab)Using GPOs for Active Directory Pwnage
(Ab)Using GPOs for Active Directory PwnagePetros Koutroumpis
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessDerek Collison
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...LibbySchulze
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 

Was ist angesagt? (20)

Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
(Ab)Using GPOs for Active Directory Pwnage
(Ab)Using GPOs for Active Directory Pwnage(Ab)Using GPOs for Active Directory Pwnage
(Ab)Using GPOs for Active Directory Pwnage
 
Grafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for LogsGrafana Loki: like Prometheus, but for Logs
Grafana Loki: like Prometheus, but for Logs
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 

Ähnlich wie Kafka at half the price with JBOD setup

Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Mydbops
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016panagenda
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixThe Linux Foundation
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...Mydbops
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle The Linux Foundation
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on AndroidTetsuyuki Kobayashi
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Shuo LI
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedInAllen Wittenauer
 
Fusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentFusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentVMUG IT
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Amazon Web Services
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitFrederic Descamps
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS Gluster.org
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScyllaDB
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledgesqlserver.co.il
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantSwiss Data Forum Swiss Data Forum
 
Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎YUCHENG HU
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Labs
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 

Ähnlich wie Kafka at half the price with JBOD setup (20)

Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
Migrate your EOL MySQL servers to HA Complaint GR Cluster / InnoDB Cluster Wi...
 
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 20161049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
1049: Best and Worst Practices for Deploying IBM Connections - IBM Connect 2016
 
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, CitrixXPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
 
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
MySQL Transformation Case Study: 80% Cost Savings & Uninterrupted Availabilit...
 
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
XPDDS17: To Grant or Not to Grant? - João Martins, Oracle
 
Reusing your existing software on Android
Reusing your existing software on AndroidReusing your existing software on Android
Reusing your existing software on Android
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
Fusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN EnvironmentFusion-IO - Building a High Performance and Reliable VSAN Environment
Fusion-IO - Building a High Performance and Reliable VSAN Environment
 
Galera Cluster 3.0 Features
Galera Cluster 3.0 FeaturesGalera Cluster 3.0 Features
Galera Cluster 3.0 Features
 
OpenStack Days Krakow
OpenStack Days KrakowOpenStack Days Krakow
OpenStack Days Krakow
 
Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration Containerize Legacy .NET Framework Web Apps for Cloud Migration
Containerize Legacy .NET Framework Web Apps for Cloud Migration
 
Loadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkitLoadays managing my sql with percona toolkit
Loadays managing my sql with percona toolkit
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod ColledgeDb As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎Percona 服务器与 XtraDB 存储引擎
Percona 服务器与 XtraDB 存储引擎
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 

Mehr von Dong Lin

FeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxFeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxDong Lin
 
FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxDong Lin
 
FeatHub_FFA_2022
FeatHub_FFA_2022FeatHub_FFA_2022
FeatHub_FFA_2022Dong Lin
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统Dong Lin
 
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021Dong Lin
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInAn introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInDong Lin
 

Mehr von Dong Lin (6)

FeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptxFeatHub_DataFun_2023.pptx
FeatHub_DataFun_2023.pptx
 
FeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptxFeatHub_GAIDC_2022.pptx
FeatHub_GAIDC_2022.pptx
 
FeatHub_FFA_2022
FeatHub_FFA_2022FeatHub_FFA_2022
FeatHub_FFA_2022
 
基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统基于 Flink 和 AI Flow 的实时推荐系统
基于 Flink 和 AI Flow 的实时推荐系统
 
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021为实时机器学习设计的算法接口与迭代引擎_FFA_2021
为实时机器学习设计的算法接口与迭代引擎_FFA_2021
 
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedInAn introduction to Apache Kafka and Kafka ecosystem at LinkedIn
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
 

Kürzlich hochgeladen

Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliNimot Muili
 
Machine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfMachine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfadeyimikaipaye
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackinghadarpinhas1
 
Introduction of Object Oriented Programming Language using Java. .pptx
Introduction of Object Oriented Programming Language using Java. .pptxIntroduction of Object Oriented Programming Language using Java. .pptx
Introduction of Object Oriented Programming Language using Java. .pptxPoonam60376
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studydhruvamdhruvil123
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxLina Kadam
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...gerogepatton
 
10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptx10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptxAdityaGoogle
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunicationnovrain7111
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
input buffering in lexical analysis in CD
input buffering in lexical analysis in CDinput buffering in lexical analysis in CD
input buffering in lexical analysis in CDHeadOfDepartmentComp1
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Indian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfIndian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfalokitpathak01
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...shreenathji26
 

Kürzlich hochgeladen (20)

Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot MuiliStructural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
Structural Integrity Assessment Standards in Nigeria by Engr Nimot Muili
 
Machine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdfMachine Learning 5G Federated Learning.pdf
Machine Learning 5G Federated Learning.pdf
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and tracking
 
Introduction of Object Oriented Programming Language using Java. .pptx
Introduction of Object Oriented Programming Language using Java. .pptxIntroduction of Object Oriented Programming Language using Java. .pptx
Introduction of Object Oriented Programming Language using Java. .pptx
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain study
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptx
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
 
10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptx10 AsymmetricKey Cryptography students.pptx
10 AsymmetricKey Cryptography students.pptx
 
The Satellite applications in telecommunication
The Satellite applications in telecommunicationThe Satellite applications in telecommunication
The Satellite applications in telecommunication
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
input buffering in lexical analysis in CD
input buffering in lexical analysis in CDinput buffering in lexical analysis in CD
input buffering in lexical analysis in CD
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Indian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfIndian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdf
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
 

Kafka at half the price with JBOD setup

  • 1. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka at half the price Dong Lin Streams Infrastructure
  • 2. ©2017 LinkedIn Corporation. All Rights Reserved. 2 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 3. ©2017 LinkedIn Corporation. All Rights Reserved. 3 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C
  • 4. ©2017 LinkedIn Corporation. All Rights Reserved. 4 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C - Tolerate only one broker failure
  • 5. ©2017 LinkedIn Corporation. All Rights Reserved. 5 RAID-10 setup with RF=3 producer Broker 1 Broker 3 A B C A B C A B C A B C Broker 2 A B C A B C - Tolerate up to two broker failures - 50% more storage cost
  • 6. ©2017 LinkedIn Corporation. All Rights Reserved. 6 JBOD setup with RF=2 producer Broker 1 A B C Broker 2 A B C - Tolerate only one broker failure - 50% less storage cost
  • 7. ©2017 LinkedIn Corporation. All Rights Reserved. 7 JBOD setup with RF=3 producer Broker 1 A B C Broker 3 A B C Broker 2 A B C - Tolerate up to two broker failures - 25% less storage cost
  • 8. ©2017 LinkedIn Corporation. All Rights Reserved. 8 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5
  • 9. ©2017 LinkedIn Corporation. All Rights Reserved. 9 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
  • 10. ©2017 LinkedIn Corporation. All Rights Reserved. 10 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down) 4 4X 3 (300% up) 3
  • 11. ©2017 LinkedIn Corporation. All Rights Reserved. 11 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 12. ©2017 LinkedIn Corporation. All Rights Reserved. 12 Problem 1: All replicas become offline if any log directory fails Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 13. ©2017 LinkedIn Corporation. All Rights Reserved. 13 Solution: Only replicas on the failed disk become offline Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  • 14. ©2017 LinkedIn Corporation. All Rights Reserved. 14 Problem 2: Controller does not recognize disk failure Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online X No further leader election STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline
  • 15. ©2017 LinkedIn Corporation. All Rights Reserved. 15 Solution: Broker notifies and provides partition list to controller Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure X STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline STEP 5: Elect another broker as leader for partition 2
  • 16. ©2017 LinkedIn Corporation. All Rights Reserved. 16 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 17. ©2017 LinkedIn Corporation. All Rights Reserved. 17 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent STEP 4: Created partition 2 (problematic) Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  • 18. ©2017 LinkedIn Corporation. All Rights Reserved. 18 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list Good disk may become overloaded STEP 1: I am online STEP 4: Created partition 2 (problematic)
  • 19. ©2017 LinkedIn Corporation. All Rights Reserved. 19 Solution: Controller specifies whether to create log for partition Zookeeper Controller STEP 3: Become follower for partition 2 This is NOT a new partition STEP 4: Partition 2 is not available and there is offline log dir Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list - Broker -> is new partition? STEP 5: Exclude broker 1 from leader election for partition 2 STEP 1: I am online
  • 20. ©2017 LinkedIn Corporation. All Rights Reserved. 20 Problem 4: No mechanism to move replicas between disks Broker 1 P1 P2 P3 P5P4 P6 P7 Disk 1 Disk 2
  • 21. ©2017 LinkedIn Corporation. All Rights Reserved. 21 Example workflow to move replicas between disks Broker Client STEP 1: DescribeDirRequest STEP 2: DescribeDirResponse Partition list and size STEP 3: ChangeDirRequest Disk 1 Disk 2 STEP 4: create p1.move STEP 5: ChangeDirResponse (Inprogress) STEP 6: copy data from p1.log to p1.move STEP 7: delete p1.log and rename p1.move to p1.log STEP 8: Verify new assignment via DescribeDirRequest
  • 22. ©2017 LinkedIn Corporation. All Rights Reserved. 22 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 23. ©2017 LinkedIn Corporation. All Rights Reserved. 23 Alternatives ▪ RAID-0 doesn’t provide disk fault tolerance – Assume each broker has 10 disks and RF = 2 – RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD ▪ RAID-5 and RAID-6 have poor performance ▪ Hardware RAID is expensive ▪ One broker per disk
  • 24. ©2017 LinkedIn Corporation. All Rights Reserved. 24 one-broker-per-machine vs. one-broker-per-disk Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Broker 2 Broker 3 V.S. One-broker-per-machine One-broker-per-disk
  • 25. ©2017 LinkedIn Corporation. All Rights Reserved. 25 one-broker-per-machine vs. one-broker-per-disk ▪ Both solutions use JBOD as disk configuration ▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine) – 100X threads and 100X sockets per machine – 10X control plane traffic from the controller to brokers (e.g. MetadataRequest) – 10X broker instances and configuration files to manage – 10X time to bounce a cluster if we bounce one broker at a time – 10X load on external service (e.g. a service used to query per-topic ACL) – Less efficient quota enforcement – Less efficient rebalance across disks on the same machine – Lower throughput
  • 26. ©2017 LinkedIn Corporation. All Rights Reserved. 26 Experimental setup ▪ Brokers deployed on 15 machines with 10 disks per machine IO threads Network threads Replica-fetcher threads One-broker-per-machine 160 120 140 One-broker-per-disk 16 12 14 ▪ Producers deployed on 15 machines acks threads sync retries retry backoff message size batch size request timeout all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT ▪ Topic configuration partition replication factor min-insync-replicas 512 3 3
  • 27. ©2017 LinkedIn Corporation. All Rights Reserved. 27 One-broker-per-machine throughput Average throughput is 2.3 GBps
  • 28. ©2017 LinkedIn Corporation. All Rights Reserved. 28 One-broker-per-disk throughput Average throughput is 2 GBps
  • 29. ©2017 LinkedIn Corporation. All Rights Reserved. 29 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  • 30. ©2017 LinkedIn Corporation. All Rights Reserved. 30 Changes in operational procedure ▪ Adjust replication factor and min.insync.replicas ▪ Configure num.replica.move.threads for broker ▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
  • 31. ©2017 LinkedIn Corporation. All Rights Reserved. 31 Future work ▪ Use more intelligent solution to select log directory for new replica ▪ Automatic load balancing across log directories on the same broker – Reduced operational overhead ▪ Distribute segments of a given replica across multiple log directories – Less overhead for rebalance between disks – Higher partition size limit ▪ Handle partial disk failure, e.g. disk with degraded performance.
  • 32. ©2017 LinkedIn Corporation. All Rights Reserved. 32 References ▪ KIP-112: Handle disk failure for JBOD (link) ▪ KIP-113: Support replicas movement between log directories (link)
  • 33. ©2017 LinkedIn Corporation. All Rights Reserved. 33