2. 2
The buzzwords
● “Software defined storage”
● “Unified storage system”
● “Scalable distributed storage”
● “The future of storage”
● “The Linux of storage”
WHAT IS CEPH?
The substance
● Ceph is open source software
● Runs on commodity hardware
○ Commodity servers
○ IP networks
○ HDDs, SSDs, NVMe, NV-DIMMs, ...
● A single cluster can serve object,
block, and file workloads
3. 3
● Freedom to use (free as in beer)
● Freedom to introspect, modify,
and share (free as in speech)
● Freedom from vendor lock-in
● Freedom to innovate
CEPH IS FREE AND OPEN SOURCE
4. 4
● Reliable storage service out of unreliable components
○ No single point of failure
○ Data durability via replication or erasure coding
○ No interruption of service from rolling upgrades, online expansion, etc.
● Favor consistency and correctness over performance
CEPH IS RELIABLE
5. 5
● Ceph is elastic storage infrastructure
○ Storage cluster may grow or shrink
○ Add or remove hardware while system is
online and under load
● Scale up with bigger, faster hardware
● Scale out within a single cluster for
capacity and performance
● Federate multiple clusters across
sites with asynchronous replication
and disaster recovery capabilities
CEPH IS SCALABLE
6. 6
CEPH IS A UNIFIED STORAGE SYSTEM
RGW
S3 and Swift
object storage
LIBRADOS
Low-level storage API
RADOS
Reliable, elastic, distributed storage layer with
replication and erasure coding
RBD
Virtual block device
CEPHFS
Distributed network
file system
OBJECT BLOCK FILE
7. 7
RELEASE SCHEDULE
Octopus
Mar 2020
14.2.z
Nautilus
Mar 2019
WE ARE
HERE
15.2.z
16.2.z
Pacific
Mar 2021
17.2.z
Quincy
Mar 2022
● Stable, named release every 12 months
● Backports for 2 releases
○ Nautilus reaches EOL shortly after Pacific is released
● Upgrade up to 2 releases at a time
○ Nautilus → Pacific, Octopus → Quincy
9. 9
New Features
● Automated upgrade from Octopus
○ (for clusters deployed with cephadm)
● Automated log-in to private registries
● iSCSI and NFS are now stable
● Automated HA for RGW
○ haproxy and keepalived
● Host maintenance mode
● cephadm exporter/agent for increased
performance/scalability
Robustness
● Lots of small usability improvements
● Lots of bug fixes
○ Backported into Octopus already
● Ongoing cleanup of docs.ceph.com
○ Removed ceph-deploy
CEPHADM
10. 10
● Robust and responsive management GUI for cluster operations
○ All core Ceph services (object, block, file) and extensions (iSCSI, NFS Ganesha)
○ Monitoring, metrics, management
● Full OSD management
○ Bulk creation with DriveGroups (filter by host, device properties: size/type/model)
○ Disk replacement and SMART diagnostics
● Multisite capabilities
○ RBD mirroring
○ RGW multisite sync monitoring
● Orchestrator/cephadm integration
● Official Management REST API for Ceph
○ Stable, versioned, and fully documented
● Production-ready security
○ RBAC, account policies (including account lock-out), secure cookies, sanitized logs, …
DASHBOARD
11. 11
● Improved hands-off defaults
○ Upmap balancer on by default
○ PG autoscaler has improved out-of-the-box experience
● Automatically detect and report daemon version mismatches
○ Associated health alert
○ Can be muted during upgrades and on demand
● Ability to cancel ongoing scrubs
● ceph -s simplified
○ Recovery progress shown as one progress bar - use ‘ceph progress’ to see more
● Framework for distributed tracing in the OSD (work in progress)
○ Opentracing tracepoints in the OSD I/O path
○ Can be collected and viewed via Jaeger's web ui.
○ To help with end-to-end performance analysis
RADOS USABILITY
12. 12
● MultiFS is marked stable!
○ Automated file system creation: use ‘ceph fs volume create NAME’
○ MDS automatically deployed with cephadm
● MDS autoscaler (start/stop MDS based on file system max_mds, standby count)
● cephfs-top (preview)
○ See client sessions and performance of the file system
● Continued improvements to cephfs-shell
● Scheduled snapshots via new snap_schedule mgr module.
● First class NFS gateway support
○ active/active configurations
○ automatically deployed via the Ceph orchestrator (Rook and cephadm)
● MDS-side encrypted file support (kernel-side development on-going)
CEPHFS USABILITY
13. 13
RBD
● “Instant” clone/recover from external
(file/HTTP/S3) data source
● Built-in support for LUKS1/LUKS2
encryption
● Native Windows driver
○ Signed, prebuilt driver available soon
● Restartable rbd-nbd daemon support
OTHER FEATURES AND USABILITY
RGW
● S3Select MVP (CSV-only)
● Lua scripting, RGW request path
● D3N (*)
15. 15
● Improved PG deletion performance
● More controlled osdmap trimming in the monitor
● Msgr2.1
○ New wire format for msgr2 (both crc and secure modes)
● More efficient manager modules
○ Ability to turn off progress module
○ Efficient use of large C++ structures in the codebase
● Monitor/display SSD wear levels
○ ‘ceph device ls’ output
RADOS ROBUSTNESS
16. 16
● Feature bit support for turning on/off required file system features
○ Clients not supporting features will be rejected
● Multiple MDS FS scrub (online integrity check)
● Kernel client (and mount.ceph) support for msgr2[.1]
○ Kernel mount option -o ms_mode=crc|secure|prefer-crc|prefer-secure
● Support for recovering mounts from blocklisting
○ Kernel reconnects with -o recover_session=clean
○ ceph-fuse reconnects with --client_reconnect_stale=1; page cache should be disabled
● Improved test coverage (doubled test matrix, 2500 -> 5000)
CEPHFS ROBUSTNESS
17. 17
TELEMETRY AND CRASH REPORTS
● Public dashboards!
○ https://telemetry-public.ceph.com/
○ Clusters, devices
● Opt-in
○ Will require re-opt-in if telemetry content
is expanded in the future
○ Explicitly acknowledge data sharing
license
● Telemetry channels
○ basic - cluster size, version, etc.
○ crash - anonymized crash metadata
○ device - device health (SMART) data
○ ident - contact info (off by default!)
● Initial focus on crash reports
○ Integration with bug tracker
○ Daily reports on top crashes in wild
○ Fancy (internal) dashboard
● Extensive device dashboard
○ See which HDD and SSD models ceph
users are deploying
19. 19
RADOS: BLUESTORE
● RocksDB sharding
○ Reduced disk space requirements
● Hybrid allocator
○ Lower memory use and disk fragmentation
● Better space utilization for small objects
○ 4K min_alloc_size for SSDs and HDDs
● More efficient caching
○ Better use of available memory
● Finer-grained memory tracking
○ Improved accounting of current usage
20. 20
● Phase 1: QoS between recovery and client I/O using mclock scheduler
○ Different profiles to prioritize client I/O, recovery and background tasks
■ Config sets to hide complexity of tuning dmclock and recovery parameters
○ Better default values for Ceph parameters to get improved performance out of the system
based on extensive testing on SSDs
○ Pacific!
● Phase 2: Quincy
○ Optimize performance for HDDs
○ Account for background activities like scrubbing, PG deletion etc.
○ Further testing across different types of workloads
● Phase 3: client vs client QoS
RADOS: QoS
21. 21
● High-performance rewrite of the OSD
● Recovery/backfill implemented
● Scrub state machine added to lay ground for scrub implementation in the
crimson osd
● Initial prototype of SeaStore in place
○ Targets both ZNS (zone-based) and traditional SSDs
○ Onode Tree implementation
○ Omap
○ LBA mappings
● Ability to run simple RBD workloads today
● Compatibility layer to run legacy BlueStore code
CRIMSON PROJECT
22. 22
● Ephemeral pinning (policy based subtree pinning)
○ Distributed pins automatically shard sub-directories (think: /home)
○ Random pins shard descendent directories probabilistically
● Improved capability/cache management by MDS for large clusters
○ Cap recall defaults improved based on larger production clusters (CERN)
○ Capability acquisition throttling for some client workloads
● Asynchronous unlink/create (partial MDS support since Octopus)
○ Miscellaneous fixes and added testing
○ Kernel v5.7 and downstream in RHEL 8 / CentOS 8 (Stream)
○ libcephfs/ceph-fuse support in-progress
CEPHFS PERFORMANCE
23. 23
RGW
● Avoid omap where unnecessary
○ FIFO queues for garbage collection
○ FIFO queues for data sync log
● Negative caching for bucket metadata
○ significant reduction in request latency for
many workloads
● Sync process performance improvements
○ Better state tracking and performance
when bucket sharding is enabled
● nginx authenticated HTTP front-cache
○ dramatically accelerate read-mostly
workloads
MISC PERFORMANCE
RBD
● librbd migration to boost::asio reactor
○ Event driven; uses neorados
○ May eventually allow tighter integration
with SPDK
25. 25
● Replication targets (remote clusters) configured on any directory
● New cephfs-mirror daemon to migrate data
○ Managed by Rook or cephadm
● Snapshot-based
○ When snapshot is created on source cluster, it is replicated to remote cluster
● Initial implementation in Pacific
○ Single daemon
○ Inefficient incremental updates
○ Improvements will be backported
CEPHFS: SNAPSHOT-BASED MIRRORING
26. 26
● Current multi-site supports
○ Federate multiple sites
○ Global bucket/user namespace
○ Async data replication at site/zone granularity
○ Bucket granularity replication
● Pacific adds:
○ Testing and QA to move bucket granularity replication out of experimental status
○ Foundation to support bucket resharding in multi-site environment
● Large-scale refactoring
○ Extensive multisite roadmap… lots of goodness should land in Quincy
RGW: PER-BUCKET REPLICATION
28. 28
ROOK
● Stretch clusters
○ Configure storage in two datacenters, with a mon in a third location with higher latency
○ 5 mons
○ Pools with replication 4 (2 replicas in each datacenter)
● CephFS mirroring
○ Manage CephFS mirroring via CRDs
○ New, simpler snapshot-based mirroring
29. 29
CSI / OPENSTACK MANILA
● RWX/ROX -- CephFS via mgr/volumes
○ New PV snapshot stabilization
■ Limits in place for snapshots on subvolumes.
○ New authorization API support (Manila)
○ New ephemeral pinning for volumes
● RWO/RWX/ROX -- RBD
○ dm-crypt encryption with Vault key management
○ PV snapshots and clones
○ Topology-aware provisioning
○ Integration with snapshot-based mirroring in-progress
30. 30
● Removing instances of racially charged terms
○ blacklist/whitelist
○ master/slave
○ Some librados APIs/CLI affected: deprecated old calls, will remove in future
● Ongoing documentation improvements
● https://ceph.io website redesign
○ Static site generator, built from git (no more wordpress)
○ Should launch this spring
MISC
31. 31
● (Consistent) CI and release builds
○ Thank you to Ampere for donated build hardware
● Container builds
● Limited QA/regression testing coverage
ARM: AARCH64
33. 33
● Ceph Developer Summit - March or April
○ Quincy planning
○ Traditional format (scheduled topical sessions, video + chat, recorded)
● Ceph Month - April or May
○ Topic per week (Core, RGW, RBD, CephFS)
○ 2-3 scheduled talks spread over the week
■ Including developers presenting what is new, what’s coming
○ Each talk followed by semi-/un-structured discussion meetup
■ Etherpad agenda
■ Open discussion
■ Opportunity for new/existing users/operators to compare notes
○ Lightning talks
○ Fully virtual (video + chat, recorded)
VIRTUAL EVENTS THIS SPRING
38. 38
● cephadm improvements
○ CIFS/SMB support
○ Resource-aware service placement (memory, CPU)
○ Moving Services out of failed hosts
○ Improved scalability/responsiveness for large clusters
● Rook improvements
○ Better integration with orchestrator API
○ Parity with cephadm
ORCHESTRATION
39. 39
RGW
● Deduplicated storage
CephFS
● ‘fs top’
● NFS and SMB support via orchestrator
MISC USABILITY AND FEATURES
RBD
● Expose snapshots via RGW (object)
● Expose RBD via NVMeOF target gateway
● Improved rbd-nbd support
○ Expose kernel block device with full librbd
feature set
○ Improved integration with ceph-csi for
Kubernetes environments
42. 42
RADOS
● Enable ‘upmap’ balancer by default
○ More precise than ‘crush-compat’ mode
○ Hands-off by default
○ Improve balancing of ‘primary’ role
● Dynamically adjust recovery priority
based on load
● Automatic periodic security key rotation
● Distributed tracing framework
○ For end-to-end performance analysis
STABILITY AND ROBUSTNESS
CephFS
● MultiMDS metadata balancing
improvements
● Minor version upgrade improvements
43. 43
● Work continues on backend analysis of telemetry data
○ Tools for developers to use crash reports identify and prioritize bug fixes
● Adjustments in collected data
○ Adjust what data is collected for Pacific
○ Periodic backport to Octopus (we re-opt-in)
○ e.g., which orchestrator module is in use (if any)
● Drive failure prediction
○ Building improved models for predictive drive failures
○ Expanding data set via Ceph collector, standalone collector, and other data sources
TELEMETRY
45. 45
CephFS
● Async metadata operations
○ Support in both libcephfs
○ Async rmdir/mkdir
● Ceph-fuse performance
○ Take advantage of recent libfuse changes
MISC PERFORMANCE
RGW
● Data sync optimizations, sync fairness
● Sync metadata improvements
○ omap -> cls_fifo
○ Bucket index, metadata+data logs
● Ongoing async refactoring of RGW
○ Based on boost::asio
46. 46
● Sharded RocksDB
○ Improve compaction performance
○ Reduce disk space requirements
● In-memory cache improvements
● SMR
○ Support for host-managed SMR HDDs
○ Targeting cold-stored workloads (e.g., RGW) only
RADOS: BLUESTORE
47. 47
PROJECT CRIMSON
What
● Rewrite IO path in using Seastar
○ Preallocate cores
○ One thread per core
○ Explicitly shard all data structures
and work over cores
○ No locks and no blocking
○ Message passing between cores
○ Polling for IO
● DPDK, SPDK
○ Kernel bypass for network and
storage IO
● Goal: Working prototype for Pacific
Why
● Not just about how many IOPS we do…
● More about IOPS per CPU core
● Current Ceph is based on traditional
multi-threaded programming model
● Context switching is too expensive when
storage is almost as fast as memory
● New hardware devices coming
○ DIMM form-factor persistent memory
○ ZNS - zone-based SSDs
49. 49
CEPHFS MULTI-SITE REPLICATION
● Automate periodic snapshot + sync to remote cluster
○ Arbitrary source tree, destination in remote cluster
○ Sync snapshots via rsync
○ May support non-CephFS targets
● Discussing more sophisticated models
○ Bidirectional, loosely/eventually consistent sync
○ Simple conflict resolution behavior?
50. 50
● Nodes scale up (faster, bigger)
● Clusters scale out
○ Bigger clusters within a site
● Organizations scale globally
○ Multiple sites, data centers
○ Multiple public and private clouds
○ Multiple units within an organization
MOTIVATION, OBJECT
● Universal, global connectivity
○ Access your data from anywhere
● API consistency
○ Write apps to a single object API (e.g., S3)
regardless of which site, cloud it is
deployed on
● Disaster recovery
○ Replicate object data across sites
○ Synchronously or asynchronously
○ Failover application and reattach
○ Active/passive and active/active
● Migration
○ Migrate data set between sites, tiers
○ While it is being used
● Edge scenarios (caching and buffering)
○ Cache remote bucket locally
○ Buffer new data locally
51. 51
● Project Zipper
○ Internal abstractions to allow alternate
storage backends (e.g., storage data in
external object store)
○ Policy layer based on LUA
○ Initial targets: database and file-based
stores, tiering to cloud (e.g., S3)
● Dynamic reshard vs multisite support
● Sync from external sources
○ AWS
● Lifecycle transition to cloud
RGW MULTISITE FOR QUINCY
54. 54
Windows
● Windows port for RBD is underway
● Lightweight kernel pass-through to librbd
● CephFS to follow (based on Dokan)
Performance testing hardware
● Intel test cluster: officianalis
● AMD / Samsung / Mellanox cluster
● High-end ARM-based system?
OTHER ECOSYSTEM EFFORTS
ARM (aarch64)
● Loads of new build and test hardware
arriving in the lab
● CI and release builds for aarch64
IBM Z
● Collaboration with IBM Z team
● Build and test