Ceph is an open source distributed file system addressing file, block, and object storage use cases. Next generation storage devices require a change in strategy, so the community has been developing crimson-osd, an eventual replacement for ceph-osd intended to minimize cpu overhead and improve throughput and latency. Seastore is a new backing store for crimson-osd targeted at emerging storage technologies including persistent memory and ZNS devices.
SQL Database Design For Developers at php[tek] 2024
Seastore: Next Generation Backing Store for Ceph
1. Brought to you by
Seastore: Next Generation
Backing Store for Ceph
Samuel Just
Senior Principal Software Engineer at
2. Sam Just
Senior Principal Software Engineer
■ Currently Senior Principal Software Engineer at Red Hat
■ Crimson Lead
■ Veteran Ceph Contributor
3. Crimson!
What
■ Rewrite IO path in Seastar
● Preallocate cores
● One thread per core
● Explicitly shard all data structures
and work over cores
● No locks and no blocking
● Message passing between cores
● Polling for all IO
■ DPDK, SPDK
● Kernel bypass for network and
storage IO
■ Multi-year effort
Why
■ Not just about how many IOPS we do…
■ More about IOPS per CPU core
■ Current Ceph is based on traditional
multi-threaded programming model
■ Context switching is too expensive
when storage is almost as fast as
memory
8. ObjectStore
■ Transactional
■ Composed of flat object namespace
■ Object names may be large (>1k)
■ Each object contains a key->value mapping (string->bytes) as well as a data payload.
■ Supports COW object clones
■ Supports ordered listing of both the omap and the object namespace
9. SeaStore
■ New ObjectStore implementation designed natively for crimson’s threading/callback model
■ Avoids cpu-heavy metadata designs like RocksDB
■ Intended to exploit emerging technologies like ZNS, fast nvme, and persistent memory.
10. Seastore - ZNS
■ New NVMe Specification
■ Intended to address challenges with conventional FTL designs
● High writes amplification, bad for qlc flash
● Background garbage collection tends to impact tail latencies
■ Different interface
● Drive divided into zones
● Zones can only be opened, written sequentially, closed, and released.
■ As it happens, this kind of write pattern tends to be good for conventional ssds as well.
11. Seastore - Persistent Memory
■ Characteristics:
● Almost DRAM like read latencies
● Write latency drastically lower than flash
● Very high write endurance
● Seems like a good fit for persistently caching data and metadata!
● Reads from persistent memory can simply return a ready future without waiting at all.
■ SeaStore approach currently being discussed:
● Keep caching layer in persistent memory.
● Update by copy-on-write with a paddr->extent mapping maintained via a write-ahead journal.
14. Seastore - Why use an LBA indirection?
A
B
C C’
When GCing node C, we need to do 2 things:
1. Find all incoming references (B in this case)
2. Write out a transaction updating (and dirtying) those
references as well as writing out the new block C’
Using direct references means we still need to maintain some
means of finding the parent references, and we pay the cost of
updating the relatively low fanout onode and omap trees.
By contrast, using an LBA indirection requires extra reads in the
lookup path, but potentially decreases read and write
amplification during gc.
It also makes refcounting and subtree sharing for COW clones
easier, and we can use it to do sparse mapping for object data
extents.
15. Seastore - Layout
Journal Segments
header delta D’ ...
delta E’ (logical) block B ... ...
record
(physical) block A
E
D
A
E’
D’
B
LBA Tree LBA Tree
16. Seastore - ZNS
Journal Segments
header delta D’ ...
delta E’ (logical) block B ... ...
record
(physical) block A
E
D
A
E’
D’
B
LBA Tree LBA Tree
17. Architecture
Cache Journal LBAManager SegmentCleaner
TransactionManager
OnodeManager OmapManager ObjectDataHandler ...
SeaStore
RootBlock
■ Broadly, SeaStore is divided into components above and below TransactionManager:
● TransactionManager supplies a transactional interface in terms of logically addressed blocks used by
data extents and metadata structures like the ghobject->onode index and omap trees.
● Components under TransactionManager (mainly the logical address tree) deal in terms of physically
addressed extents.
18. OnodeManager -- FLTree
■ crimson/os/seastore/onode_manager/staged-fltree/
■ Btree based structure for mapping ghobject_t -> onode
■ Splits ghobject_t keys into a fixed prefix (shard, pool, key), variable middle (object name and
namespace), and fixed suffix (snapshot, generation)
■ Internal nodes drop components that are uniform over the whole node to reduce space overhead.
■ Internal nodes can also drop components that differ between adjacent keys.
● Allows nodes close to root to avoid recording name and namespace improving density.
■ Contributed mainly by Yingxin Cheng <yingxin.cheng@intel.com>
19. OmapManager -- BtreeOmapManager
■ crimson/os/seastore/omap_manager/btree/
■ Implements omap storage for each object.
■ Fairly straightforward btree mapping string keys to string values.
■ Contributed mainly by chunmei-liu <chunmei.liu@intel.com>
20. ObjectDataHandler
■ crimson/os/seastore/object_data_handler.h/cc
■ Each onode maps a contiguous, sparse range of logical addresses to that object’s offsets
■ Leverages the LBAManager to avoid a secondary extent map.
■ Clone support requires work to enable relocating logical extent addresses (TODO)
21. TransactionManager
■ crimson/os/seastore/TransactionManager.h/cc
■ Implements a uniform transactional interface for allocating, reading, and mutating logically
addressed extents.
■ Mutations to extents can be expressed as a compact type dependent “delta” which will be included
transparently in the commit journal record.
● For example, BtreeOmapManager represents the insertion of a key into a block by encoding
the key/value pair rather than needing to rewrite the full block.
■ Components using TransactionManager can ignore extent relocation entirely.
22. LBAManager -- BtreeLBAManager
■ crimson/os/seastore/lba_manager/btree/
■ Btree mapping logical addresses to physical offsets
■ Includes reference counts to be used with clone.
■ Will include extent checksums (TODO)
24. SegmentCleaner
■ crimson/os/seastore/lba_manager/segment_cleaner
■ Tracks usage status of segments.
■ Runs a background process (within the same reactor) to choose a segment, relocate any live
extents, and release it.
● Logical extents are simply remapped within the LBAManager
● Physical extents (mainly BtreeLBAManager extents) need any references updated as well
(mainly btree parents)
■ Also responsible for throttling foreground work based on pending gc work -- goal is to avoid abrupt
gc pauses and smooth out latency.
25. Current Status and Future Work
■ Can boot an osd via vstart without snapshots and complete upwards of several minutes of IO before
crashing!
■ Performance/stability: stabilize, add to ceph upstream testing infrastructure
■ Add ability to remap location of logical extents
■ In place mutation support for fast nvme devices
■ Persistent memory support in Cache
■ Tiering
26. Brought to you by
Samuel Just
Senior Principal Software Engineer at Red Hat
Email: sjust@redhat.com