Apache BookKeeper is a high-performance distributed log service that provides durability and ordering guarantees. It addresses challenges in distributed systems like failures, inconsistencies, and split-brain issues. It provides an immutable data abstraction of ledgers composed of segments and blocks. Projects like DistributedLog, Pulsar, and Salesforce Distributed Store use BookKeeper as a building block. DistributedLog scales to handle 1.5 trillion records per day at Twitter. Pulsar provides messaging at Yahoo at over 100 billion messages per day. BookKeeper provides durability and ordering which these systems leverage for use cases like logs, queues, and streams.
16. BookKeeper - Durable Storage
A Durable Storage Optimized for Immutable Data
Serve as a building block for reliable systems
Commodity Hardware
Durability
Replication Consistency Recovery
Client Library
19. Guarantees
If an entry
has been acknowledged,
it must be readable
If an entry
is read once,
it must always be readable
20. History
◉ Initial Use Case - Hadoop NameNode HA
◉ 2008: Open Sourced Contrib of ZooKeeper
◉ 2011: Sub-Project of ZooKeeper
◉ 2012: Yahoo! Push Notification
◉ 2012~Now: DistributedLog, Pulsar, Majordodo
◉ 2015~Now: Salesforce Distributed Store
23. Reliable Writes
◉ Store checksum along with entry
◉ Fsync entries before responding
◉ Ack when
○ All Previous Entries
○ This Entry
Bookie
Bookie
Bookie
Accepted
by
Quorum
36. Scale DistributedLog at Twitter
◉ 1.5 trillion records/day, 17.5 petabytes/day
◉ O(10) thousands streams, O(1) million live ledgers
◉ O(10^2) bookies, O(10^3) proxies
◉ Records size from 100 bytes to 20 KB to even more
◉ Data is kept from hours to days, even up to a year
◉ Replication factor is 3 or 5. 9 or 15 for global use
case.
37. DistributedLog Resources
◉ Website - https://distributedlog.io
◉ Mail List -
dev@distributedlog.incubator.apache.org
◉ Project Ideas -
https://cwiki.apache.org/confluence/display/DL/Project+Ideas
◉ Paper - “DistributedLog: A high performance
replicated log service” (ICDE 2017)
41. Scale Pulsar at Yahoo!
◉ 100 billion messages per day
◉ More than 1.4 million topics
◉ Avg publish latency across services of less than 5ms
◉ 10+ data centers, cross-region replications