Dhruba Borthakur, Facebook
Dhruba Borthakur is an engineer at Facebook. He has been one of the founding engineer of RocksDB, an open-source key-value store optimized for storing data in flash and main-memory storage. He has been one of the founding architects of the Apache Hadoop Distributed File System and has been instrumental in scaling Facebook's Hadoop cluster to multiples of petabytes. Dhruba has contributed code to the Apache HBase project. Earlier, he contributed to the development of the Andrew File System (AFS). He has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science BITS, Pilani, India.
5. RocksDB API
▪ Keys and values are arbitrary byte
arrays.
▪ Data are stored sorted by key.
▪ Update Operations: Put/Delete/Merge
▪ Queries: Get/Iterator
6. RocksDB Architecture
Write Request
Read Request
Flush
Compaction
Active
MemTable
ReadOnly
MemTable
log
log
log
LS
M
sstsst
sst sst
sst
sst
d
Switch
Memory Persistent Storage
Switch
11. RocksDB: Open & Pluggable
Pluggable
Memtable format
in RAM
Write Request from ApplicationGet or Scan Request from Application
Transaction logPluggable sst data format
on storage
Pluggable
Compaction
Blooms
Customizable
WAL
12. RocksDB is a tool, not a solution by itself.
Embed it into your software solution.
16. Want to extend RocksDB functionality?
Use StackableDB
e.g. TTL Support, Geo-index, Redis-style
17. Vision for the future
•Most performant database engine on ram, SSD and disks
• Optimize for next-generation storage hardware
•Flexibility to be deployed on varied environments
• RocksDB’s components are pluggable
• Enables software vendors to customize their solution
18. Roadmap for 2016
What do you want from RocksDB?
Here are some probable enhancements for RocksDB in 2016
How many of you have looked at the code internals?
How many of you have changed rocksdb code (for fun or for work)?
Describe high-level on what is RocksDB architecture
Why is RocksDB different from other database engine
roadmap discussion
A persistent store
C/C++ applications servers can link in the rocskdb library
The database software is not the bottleneck while serving data on ssd and ram
Server workloads: working set data does not fit entirely in ram
10K lines of code modified
sometime I ask our clients within facebook, why they chose rocksdb. The answer is not rocksdb rocks, solid or fast. It is simply, it’s easy to code with. Code tends to be cleaner. I don’t have to maintain that adhoc storage module anymore!
The Runtime DB state is composed of the in-memory portion (memtables) and the on-disk sst (sorted string table) files. Depending on the compaction strategy, the sst files could be structured in different ways. Each memtable is backed by a WAL. During crash recovery, outstanding logs are replayed to reconstruct the in-memory state.
Writes go to the active memtable. When the current memtable is full, it will be moved to a readonly memtable list and a new active memtable will be created together with a new log file. A background job will be triggered immediately after the memtable switch, to flush the readonly memtable(s) to persistent storage. The log file that’s backing the readonly memtable is eligible for purging when the flush job is done.
A background compaction job is also running to maintain the LSM tree structure and limit the steady state DB size. Overwritten/deleted values will be purged during this process.
Reads consult both in-memory state and persistent state.
How many of you have looked at the code internals?
How many of you have changed rocksdb code (for fun or for work)?
All major components are configurable, if not already totally pluggable.
Pluggability is the key here because we want to use RocksDb for variety of products,
different workload, different requirement, different trade-offs. For Rocksdb, sth that sits at the bottom of the application stack, to be able to serve this vast variety without skewing the architecture crazily, is to make each component pluggable. I will give you a lot of example to see what I mean.
1. disks, flash, ram
2. lower write amplification
3. tradeoff higher space amplification
4. lower write endurance
Just about every component is configurable, if not already completely pluggable.
Envs are a great way to pluggability.
1. does fsync on a single file also sync data from other files?
2. reduce fs-mutex contention for reads and writes
3. priority of writes: log writes are hi-pri, compaction writes are lo-pri
4.
The Redis-style stackable DB has bugs
database engine vs database system
g
g
g
g
1. “rockscheck” similar to “fsck” in the filesystem world. It can detect fixes, report them, quarantine them, and also fix some of them. Options to validate all metadata, ability to to a full-scan of all block headers, option to validate compression/decompression of all/specific blocks, etc etc.
2. ‘rocksdiff’ similar to linux-diff that can compare two dbs
3. ‘rockssync’ similar to rsync that find and apply relevant diffs from one db to another
4. ‘rocksdfind’ similar to linux find, that can find a specified key (with regular expression match) and list a few kvs that are adjacent to that key