2. Design goals
• A standalone SSD caching library that can be re-used between librbd
RGW
• Use cases:
• Librbd read-only cache: caching block contents on SSD
• Librbd parent/clone images, caching parent rbd contents on SSD, all cloned image can read
from parent image cache before COW happen
• RGW immutable caching: caching rados objects on SSD
• A small CDN farm behind RGW cluster
3. Cache
daemon
General architecture
• Libcachestore: common lib that
does read/write on SSD
• Sparse-file based cache
• Cache Daemon: controlling on the
cache promotion/demotion, sizing
of the cache
• Simple LRU based
• librbd/librgw hooks: call API from
libcachefile
FileImageCache
RBD_0
SSD
libCacheStore
RGW_DataCache
librbd librgw
RGW_civetweb
RBD_1
RBD_2
RGW_civetweb
RGW_civetweb
RADOS
librbd librados
hooks hooks
policy
5. PR #16788
• A generic file-based persistent cache store
• Sparse-file-based cache
• Sync interfaces provided
• A generic read-only caching framework
• Cache promotion on reads
• Cache invalidate on writes, write requests will go to RADOS directly
• A simple shared read-only cache implementation(“happy” data path)
• Shared cache will be fully promoted on the opening of 1st child
• The missing:
• A standalone cache daemon controls the cache state
• A configurable policy to control promotion/demotion on shared cache
7. Shared read-only cache for RBD –rbd clone flow
RBD_0 RBD_0@snap1 RBD_1
RBD_2
RBD_N
…
Parent image Protected snapshot
Cloned image
Cloned image
Cloned image
This is the shared image content
8. Shared read-only cache for RBD – Cache Daemon
• Read-only blocks from parent image(s) are
cached in a shared area on compute node(s)
• Reads are served from the shared cache until the
first COW request
• A Cache Daemon
• On each compute node to control the shared
cache state
• Policy thread - owns a policy to control
promotion/demotion of the shared cache
• RBD instances do IPC with the daemon do
read/write lock on a shared cache block
• Upon recovery from crash or reboot, the
daemon tries to rebuild shared cache state
from persistent metadata
• Rebuild process is simple – read persistent
metafile, check existence of image and
corresponding cachefile
• If rebuild fails (for example, on a meta/cachefile
read error), reset to empty cache
RBD_instance
Write I/O
Read I/O
SSD
Compute node
Shared RO
Cache
RADOS
OSD OSD OSD OSD OSD OSD OSD
policy
Promote/demote
Cache
Daemon
IPC
IPC
Read I/O
(post-COW)
RBD_instance …
Meta
File
13. Issues/Corner cases
• How to do VM migration? VM Crash?
• We could rebuild the cache state on RBD re-open
• RBD removed on other nodes?
• Policy thread in cache daemon will periodically check the local cache, and
remove those old cache eventually
• Cache daemon crash?
• The shared cache state will be persistent to local metafile
• The daemon is stateless, we only need to restart the daemon process and
rebuild the cache state
• Cache file inconsistent?
• We’re relying the filesystem to do the check, if some read error happen, we
simply re-issue a read from the RADOS
15. Shared Read-only cache for RGW
chunk_id RGW instance id Cache_chunk_id
7e21a6b2-89b9-4de6-869e-
1ddc0198a82b.5228.1__shadow_.Tzk
bVV_syqJ2vumnFe8uAaiL9j6ghtC_34
Rgw_1 7e21a6b2-89b9-4de6-869e-
1ddc0198a82b.5228.1__shadow_.Tzk
bVV_syqJ2vumnFe8uAaiL9j6ghtC_34
• A CDN cluster behind the RGW clusters
• L1 cache: allow to read from SSD cache of local RGW instance
• L2 cache(configurable): allow to read from SSD cache on other remote RGW instances
• Each object/chunk has an unique ID
• Need a centralized distributed K/V to store the mapping as the chunks maybe spreaded
on different RGW instances
16. Shared Read-only cache for RGW
rgw_1 rgw_2
RADOS
Local
cache
Local
cache
librados
Immutable Cache
S3 API Swift API
rgw_frontend
rgw_rados
rgw_cache
datacachepolicy
Immutable Cache
L1 L2
17. Issues
• different caching semantics for block and object?
• Promoting at block level(default 8k) for librbd
• Promoting at object level for RGW
• #13144 is not compiling
• https://github.com/maniaabdi/engage1.git
• Jewel based, need to rebase against master
• Currently the logic is inside rgw_rados, need to be decupled to cope with our
design(libcachefile + policy)
18. RGW datacache (PR #13144)
rgw_1 rgw_2
RADOS
Local
cache
Local
cache
librados
Immutable Cache
S3 API Swift API
rgw_frontend
rgw_rados
rgw_cache
datacache
policy
Immutable Cache
L1 L2
20. Shared read-only cache for RBD -- overview
• Read-only blocks from parent
image(s) are cached in a shared
area on compute node(s)
• Cloned image will read from the
shared cache unless COW happen
Local Cache
Write I/O
Read I/O
SSD backend
Write I/O
Read I/O
…
…
Compute node
Local CacheShared Cache
Shared
Cache
…
…
Compute node
RADOS
OSD OSD OSD OSD OSD OSD OSD
SSD backend
Cache
daemon
21. Shared read-only cache for RBD – fast cache
warmup
• The state of the shared cache will be persistent to a local metadata
file along with the cache file
• The state of the local cache will be persistent to RBD metadata
• On restart the cache controller will load the cache metadata file
and reuse the shared cache file
• On RBD instance restart the cache state will be loaded as an in-
mem map to tell the COWed parts
Each cloned image will have its COW cache mapping:
- For each read hit, either in shared cache, or in its own
cache
- Cache mapping bits for COWed data
- Updated when COW happen
23. librbd
FileImageCache
Cache fileCache file
RBD_2(cloned)
librbd
FileImageCache
COW
data
librbd
FileImageCache
Cache_demo
Shared Cache file
RADOS
1
COW Cache mapping
RBD_1(cloned)
Cache lookup
COW
data
2
2’
in cow cache:
- Invalidate the chunk in the cache file
- Write to RADOS
Compute node
rbd_id lba length
rbd_1 8192 4096
rbd_1 1048576 4096
write
SSD
Shared Read-only cache for RBD – IO flow(private cache)
COW Cache mapping
rbd_id lba length
rbd_2 81920 4096
rbd_2 1048576 4096
image_store
policy GC
in shared cache:
- Create entry in COW mapping table
- Write to RADOS
Editor's Notes
How to maintain the librbd parent/clone image table?
On promotion, a Writer lock will be hold
On demotion, a Writer lock will be hold
On promotion, a Reader lock will be hold
Maybe we could make the local cache to use the writeback policy?
Is it possible to tell the COWed parts quickly from RBD?
When to promote the shared cache file?
-> when opening the first cloned image, the cache will be promoted to local, this could be optimized
What data should we promote? parent_image@snapshot
Librbd caching will be promoting at block size(4k default) level
What is the cache file format?
-> sparse file based
Only do promote when read
Writes to osd directly and invalidates the cache if cache_hit