Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Kafka at half the price with JBOD setup

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Apache Kafka Best Practices
Apache Kafka Best Practices
Wird geladen in …3
×

Hier ansehen

1 von 33 Anzeige

Kafka at half the price with JBOD setup

Herunterladen, um offline zu lesen

The talk introduces JBOD setup for Apache Kafka and shows how LinkedIn can save more than 30% storage cost in Kafka by adopting JBOD setup. The talk is given during the LinkedIn Streaming meetup in May, 2017.

The talk introduces JBOD setup for Apache Kafka and shows how LinkedIn can save more than 30% storage cost in Kafka by adopting JBOD setup. The talk is given during the LinkedIn Streaming meetup in May, 2017.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Kafka at half the price with JBOD setup (20)

Anzeige

Aktuellste (20)

Kafka at half the price with JBOD setup

  1. 1. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka at half the price Dong Lin Streams Infrastructure
  2. 2. ©2017 LinkedIn Corporation. All Rights Reserved. 2 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  3. 3. ©2017 LinkedIn Corporation. All Rights Reserved. 3 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C
  4. 4. ©2017 LinkedIn Corporation. All Rights Reserved. 4 RAID-10 setup with RF=2 producer Broker 1 Broker 2 A B C A B C A B C A B C - Tolerate only one broker failure
  5. 5. ©2017 LinkedIn Corporation. All Rights Reserved. 5 RAID-10 setup with RF=3 producer Broker 1 Broker 3 A B C A B C A B C A B C Broker 2 A B C A B C - Tolerate up to two broker failures - 50% more storage cost
  6. 6. ©2017 LinkedIn Corporation. All Rights Reserved. 6 JBOD setup with RF=2 producer Broker 1 A B C Broker 2 A B C - Tolerate only one broker failure - 50% less storage cost
  7. 7. ©2017 LinkedIn Corporation. All Rights Reserved. 7 JBOD setup with RF=3 producer Broker 1 A B C Broker 3 A B C Broker 2 A B C - Tolerate up to two broker failures - 25% less storage cost
  8. 8. ©2017 LinkedIn Corporation. All Rights Reserved. 8 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5
  9. 9. ©2017 LinkedIn Corporation. All Rights Reserved. 9 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down)
  10. 10. ©2017 LinkedIn Corporation. All Rights Reserved. 10 RAID vs. JBOD Setup Replication Storage cost Broker failure tolerance Disk failure tolerance RAID-10 2 (baseline) 4X 1 (too small) 3 3 6X (50% up) 2 5 JBOD 2 2X (50% down) 1 (too small) 1 (too small) 3 (future) 3X (25% down) 2 (100% up) 2 (33% down) 4 4X 3 (300% up) 3
  11. 11. ©2017 LinkedIn Corporation. All Rights Reserved. 11 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  12. 12. ©2017 LinkedIn Corporation. All Rights Reserved. 12 Problem 1: All replicas become offline if any log directory fails Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  13. 13. ©2017 LinkedIn Corporation. All Rights Reserved. 13 Solution: Only replicas on the failed disk become offline Broker Disk A IOException when accessing disk B Disk B Disk C Broker Disk A Disk B Disk C
  14. 14. ©2017 LinkedIn Corporation. All Rights Reserved. 14 Problem 2: Controller does not recognize disk failure Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online X No further leader election STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline
  15. 15. ©2017 LinkedIn Corporation. All Rights Reserved. 15 Solution: Broker notifies and provides partition list to controller Zookeeper Controller Broker 1 Partition 1 Partition 2 STEP 2: Broker 1 has new disk failureSTEP 1: Notify disk failure X STEP 3: Become leader for partitions 1 and 2 STEP 4: partition 2 is offline STEP 5: Elect another broker as leader for partition 2
  16. 16. ©2017 LinkedIn Corporation. All Rights Reserved. 16 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  17. 17. ©2017 LinkedIn Corporation. All Rights Reserved. 17 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent STEP 4: Created partition 2 (problematic) Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list STEP 1: I am online
  18. 18. ©2017 LinkedIn Corporation. All Rights Reserved. 18 Problem 3: Broker always creates log for partition if not exist Zookeeper Controller STEP 3: Become follower for partition 2 Create partition 2 if non-existent Broker 1 Partition 1 Partition 2 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list Good disk may become overloaded STEP 1: I am online STEP 4: Created partition 2 (problematic)
  19. 19. ©2017 LinkedIn Corporation. All Rights Reserved. 19 Solution: Controller specifies whether to create log for partition Zookeeper Controller STEP 3: Become follower for partition 2 This is NOT a new partition STEP 4: Partition 2 is not available and there is offline log dir Broker 1 Partition 1 Partition 2 X STEP 2: - Broker -> is alive? - Broker -> partition list - Broker -> is new partition? STEP 5: Exclude broker 1 from leader election for partition 2 STEP 1: I am online
  20. 20. ©2017 LinkedIn Corporation. All Rights Reserved. 20 Problem 4: No mechanism to move replicas between disks Broker 1 P1 P2 P3 P5P4 P6 P7 Disk 1 Disk 2
  21. 21. ©2017 LinkedIn Corporation. All Rights Reserved. 21 Example workflow to move replicas between disks Broker Client STEP 1: DescribeDirRequest STEP 2: DescribeDirResponse Partition list and size STEP 3: ChangeDirRequest Disk 1 Disk 2 STEP 4: create p1.move STEP 5: ChangeDirResponse (Inprogress) STEP 6: copy data from p1.log to p1.move STEP 7: delete p1.log and rename p1.move to p1.log STEP 8: Verify new assignment via DescribeDirRequest
  22. 22. ©2017 LinkedIn Corporation. All Rights Reserved. 22 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  23. 23. ©2017 LinkedIn Corporation. All Rights Reserved. 23 Alternatives ▪ RAID-0 doesn’t provide disk fault tolerance – Assume each broker has 10 disks and RF = 2 – RAID-0 has 100X higher probability of unavailability due to disk failure than JBOD ▪ RAID-5 and RAID-6 have poor performance ▪ Hardware RAID is expensive ▪ One broker per disk
  24. 24. ©2017 LinkedIn Corporation. All Rights Reserved. 24 one-broker-per-machine vs. one-broker-per-disk Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Physical Machine Disk 1 Disk 2 Disk 3 Broker 1 Broker 2 Broker 3 V.S. One-broker-per-machine One-broker-per-disk
  25. 25. ©2017 LinkedIn Corporation. All Rights Reserved. 25 one-broker-per-machine vs. one-broker-per-disk ▪ Both solutions use JBOD as disk configuration ▪ Main drawbacks of one-broker-per-disk (assume 10 disk per machine) – 100X threads and 100X sockets per machine – 10X control plane traffic from the controller to brokers (e.g. MetadataRequest) – 10X broker instances and configuration files to manage – 10X time to bounce a cluster if we bounce one broker at a time – 10X load on external service (e.g. a service used to query per-topic ACL) – Less efficient quota enforcement – Less efficient rebalance across disks on the same machine – Lower throughput
  26. 26. ©2017 LinkedIn Corporation. All Rights Reserved. 26 Experimental setup ▪ Brokers deployed on 15 machines with 10 disks per machine IO threads Network threads Replica-fetcher threads One-broker-per-machine 160 120 140 One-broker-per-disk 16 12 14 ▪ Producers deployed on 15 machines acks threads sync retries retry backoff message size batch size request timeout all 50 true MAX_INT 60 sec 100 KB 1 MB MAX_INT ▪ Topic configuration partition replication factor min-insync-replicas 512 3 3
  27. 27. ©2017 LinkedIn Corporation. All Rights Reserved. 27 One-broker-per-machine throughput Average throughput is 2.3 GBps
  28. 28. ©2017 LinkedIn Corporation. All Rights Reserved. 28 One-broker-per-disk throughput Average throughput is 2 GBps
  29. 29. ©2017 LinkedIn Corporation. All Rights Reserved. 29 Agenda ▪ Motivation – Why switch from RAID-10 to JBOD? – Tradeoff between cost and fault-tolerance ▪ Design – How to run Kafka with disk failure – How to move replicas between disks ▪ Alternatives ▪ Evaluation ▪ Changes in operational procedures ▪ Future work ▪ Reference
  30. 30. ©2017 LinkedIn Corporation. All Rights Reserved. 30 Changes in operational procedure ▪ Adjust replication factor and min.insync.replicas ▪ Configure num.replica.move.threads for broker ▪ Monitor disk failure via the OfflineLogDirectoriesCount metric
  31. 31. ©2017 LinkedIn Corporation. All Rights Reserved. 31 Future work ▪ Use more intelligent solution to select log directory for new replica ▪ Automatic load balancing across log directories on the same broker – Reduced operational overhead ▪ Distribute segments of a given replica across multiple log directories – Less overhead for rebalance between disks – Higher partition size limit ▪ Handle partial disk failure, e.g. disk with degraded performance.
  32. 32. ©2017 LinkedIn Corporation. All Rights Reserved. 32 References ▪ KIP-112: Handle disk failure for JBOD (link) ▪ KIP-113: Support replicas movement between log directories (link)
  33. 33. ©2017 LinkedIn Corporation. All Rights Reserved. 33

×