2021.02 new in Ceph Pacific Dashboard

1
WHAT’S NEW IN CEPH
PACIFIC
2021.02.25

2
The buzzwords
● “Software defined storage”
● “Unified storage system”
● “Scalable distributed storage”
● “The future of storage”
● “The Linux of storage”
WHAT IS CEPH?
The substance
● Ceph is open source software
● Runs on commodity hardware
○ Commodity servers
○ IP networks
○ HDDs, SSDs, NVMe, NV-DIMMs, ...
● A single cluster can serve object,
block, and file workloads

3
● Freedom to use (free as in beer)
● Freedom to introspect, modify,
and share (free as in speech)
● Freedom from vendor lock-in
● Freedom to innovate
CEPH IS FREE AND OPEN SOURCE

4
● Reliable storage service out of unreliable components
○ No single point of failure
○ Data durability via replication or erasure coding
○ No interruption of service from rolling upgrades, online expansion, etc.
● Favor consistency and correctness over performance
CEPH IS RELIABLE

5
● Ceph is elastic storage infrastructure
○ Storage cluster may grow or shrink
○ Add or remove hardware while system is
online and under load
● Scale up with bigger, faster hardware
● Scale out within a single cluster for
capacity and performance
● Federate multiple clusters across
sites with asynchronous replication
and disaster recovery capabilities
CEPH IS SCALABLE

6
CEPH IS A UNIFIED STORAGE SYSTEM
RGW
S3 and Swift
object storage
LIBRADOS
Low-level storage API
RADOS
Reliable, elastic, distributed storage layer with
replication and erasure coding
RBD
Virtual block device
CEPHFS
Distributed network
file system
OBJECT BLOCK FILE

7
RELEASE SCHEDULE
Octopus
Mar 2020
14.2.z
Nautilus
Mar 2019
WE ARE
HERE
15.2.z
16.2.z
Pacific
Mar 2021
17.2.z
Quincy
Mar 2022
● Stable, named release every 12 months
● Backports for 2 releases
○ Nautilus reaches EOL shortly after Pacific is released
● Upgrade up to 2 releases at a time
○ Nautilus → Pacific, Octopus → Quincy

8
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

9
New Features
● Automated upgrade from Octopus
○ (for clusters deployed with cephadm)
● Automated log-in to private registries
● iSCSI and NFS are now stable
● Automated HA for RGW
○ haproxy and keepalived
● Host maintenance mode
● cephadm exporter/agent for increased
performance/scalability
Robustness
● Lots of small usability improvements
● Lots of bug ﬁxes
○ Backported into Octopus already
● Ongoing cleanup of docs.ceph.com
○ Removed ceph-deploy
CEPHADM

10
● Robust and responsive management GUI for cluster operations
○ All core Ceph services (object, block, file) and extensions (iSCSI, NFS Ganesha)
○ Monitoring, metrics, management
● Full OSD management
○ Bulk creation with DriveGroups (filter by host, device properties: size/type/model)
○ Disk replacement and SMART diagnostics
● Multisite capabilities
○ RBD mirroring
○ RGW multisite sync monitoring
● Orchestrator/cephadm integration
● Official Management REST API for Ceph
○ Stable, versioned, and fully documented
● Production-ready security
○ RBAC, account policies (including account lock-out), secure cookies, sanitized logs, …
DASHBOARD

11
● Improved hands-off defaults
○ Upmap balancer on by default
○ PG autoscaler has improved out-of-the-box experience
● Automatically detect and report daemon version mismatches
○ Associated health alert
○ Can be muted during upgrades and on demand
● Ability to cancel ongoing scrubs
● ceph -s simpliﬁed
○ Recovery progress shown as one progress bar - use ‘ceph progress’ to see more
● Framework for distributed tracing in the OSD (work in progress)
○ Opentracing tracepoints in the OSD I/O path
○ Can be collected and viewed via Jaeger's web ui.
○ To help with end-to-end performance analysis
RADOS USABILITY

12
● MultiFS is marked stable!
○ Automated file system creation: use ‘ceph fs volume create NAME’
○ MDS automatically deployed with cephadm
● MDS autoscaler (start/stop MDS based on file system max_mds, standby count)
● cephfs-top (preview)
○ See client sessions and performance of the file system
● Continued improvements to cephfs-shell
● Scheduled snapshots via new snap_schedule mgr module.
● First class NFS gateway support
○ active/active configurations
○ automatically deployed via the Ceph orchestrator (Rook and cephadm)
● MDS-side encrypted file support (kernel-side development on-going)
CEPHFS USABILITY

13
RBD
● “Instant” clone/recover from external
(ﬁle/HTTP/S3) data source
● Built-in support for LUKS1/LUKS2
encryption
● Native Windows driver
○ Signed, prebuilt driver available soon
● Restartable rbd-nbd daemon support
OTHER FEATURES AND USABILITY
RGW
● S3Select MVP (CSV-only)
● Lua scripting, RGW request path
● D3N (*)

14
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

15
● Improved PG deletion performance
● More controlled osdmap trimming in the monitor
● Msgr2.1
○ New wire format for msgr2 (both crc and secure modes)
● More efﬁcient manager modules
○ Ability to turn off progress module
○ Efﬁcient use of large C++ structures in the codebase
● Monitor/display SSD wear levels
○ ‘ceph device ls’ output
RADOS ROBUSTNESS

16
● Feature bit support for turning on/off required ﬁle system features
○ Clients not supporting features will be rejected
● Multiple MDS FS scrub (online integrity check)
● Kernel client (and mount.ceph) support for msgr2[.1]
○ Kernel mount option -o ms_mode=crc|secure|prefer-crc|prefer-secure
● Support for recovering mounts from blocklisting
○ Kernel reconnects with -o recover_session=clean
○ ceph-fuse reconnects with --client_reconnect_stale=1; page cache should be disabled
● Improved test coverage (doubled test matrix, 2500 -> 5000)
CEPHFS ROBUSTNESS

17
TELEMETRY AND CRASH REPORTS
● Public dashboards!
○ https://telemetry-public.ceph.com/
○ Clusters, devices
● Opt-in
○ Will require re-opt-in if telemetry content
is expanded in the future
○ Explicitly acknowledge data sharing
license
● Telemetry channels
○ basic - cluster size, version, etc.
○ crash - anonymized crash metadata
○ device - device health (SMART) data
○ ident - contact info (off by default!)
● Initial focus on crash reports
○ Integration with bug tracker
○ Daily reports on top crashes in wild
○ Fancy (internal) dashboard
● Extensive device dashboard
○ See which HDD and SSD models ceph
users are deploying

18
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

19
RADOS: BLUESTORE
● RocksDB sharding
○ Reduced disk space requirements
● Hybrid allocator
○ Lower memory use and disk fragmentation
● Better space utilization for small objects
○ 4K min_alloc_size for SSDs and HDDs
● More efﬁcient caching
○ Better use of available memory
● Finer-grained memory tracking
○ Improved accounting of current usage

20
● Phase 1: QoS between recovery and client I/O using mclock scheduler
○ Different profiles to prioritize client I/O, recovery and background tasks
■ Config sets to hide complexity of tuning dmclock and recovery parameters
○ Better default values for Ceph parameters to get improved performance out of the system
based on extensive testing on SSDs
○ Pacific!
● Phase 2: Quincy
○ Optimize performance for HDDs
○ Account for background activities like scrubbing, PG deletion etc.
○ Further testing across different types of workloads
● Phase 3: client vs client QoS
RADOS: QoS

21
● High-performance rewrite of the OSD
● Recovery/backﬁll implemented
● Scrub state machine added to lay ground for scrub implementation in the
crimson osd
● Initial prototype of SeaStore in place
○ Targets both ZNS (zone-based) and traditional SSDs
○ Onode Tree implementation
○ Omap
○ LBA mappings
● Ability to run simple RBD workloads today
● Compatibility layer to run legacy BlueStore code
CRIMSON PROJECT

22
● Ephemeral pinning (policy based subtree pinning)
○ Distributed pins automatically shard sub-directories (think: /home)
○ Random pins shard descendent directories probabilistically
● Improved capability/cache management by MDS for large clusters
○ Cap recall defaults improved based on larger production clusters (CERN)
○ Capability acquisition throttling for some client workloads
● Asynchronous unlink/create (partial MDS support since Octopus)
○ Miscellaneous ﬁxes and added testing
○ Kernel v5.7 and downstream in RHEL 8 / CentOS 8 (Stream)
○ libcephfs/ceph-fuse support in-progress
CEPHFS PERFORMANCE

23
RGW
● Avoid omap where unnecessary
○ FIFO queues for garbage collection
○ FIFO queues for data sync log
● Negative caching for bucket metadata
○ signiﬁcant reduction in request latency for
many workloads
● Sync process performance improvements
○ Better state tracking and performance
when bucket sharding is enabled
● nginx authenticated HTTP front-cache
○ dramatically accelerate read-mostly
workloads
MISC PERFORMANCE
RBD
● librbd migration to boost::asio reactor
○ Event driven; uses neorados
○ May eventually allow tighter integration
with SPDK

24
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

25
● Replication targets (remote clusters) configured on any directory
● New cephfs-mirror daemon to migrate data
○ Managed by Rook or cephadm
● Snapshot-based
○ When snapshot is created on source cluster, it is replicated to remote cluster
● Initial implementation in Pacific
○ Single daemon
○ Inefficient incremental updates
○ Improvements will be backported
CEPHFS: SNAPSHOT-BASED MIRRORING

26
● Current multi-site supports
○ Federate multiple sites
○ Global bucket/user namespace
○ Async data replication at site/zone granularity
○ Bucket granularity replication
● Paciﬁc adds:
○ Testing and QA to move bucket granularity replication out of experimental status
○ Foundation to support bucket resharding in multi-site environment
● Large-scale refactoring
○ Extensive multisite roadmap… lots of goodness should land in Quincy
RGW: PER-BUCKET REPLICATION

27
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

28
ROOK
● Stretch clusters
○ Conﬁgure storage in two datacenters, with a mon in a third location with higher latency
○ 5 mons
○ Pools with replication 4 (2 replicas in each datacenter)
● CephFS mirroring
○ Manage CephFS mirroring via CRDs
○ New, simpler snapshot-based mirroring

29
CSI / OPENSTACK MANILA
● RWX/ROX -- CephFS via mgr/volumes
○ New PV snapshot stabilization
■ Limits in place for snapshots on subvolumes.
○ New authorization API support (Manila)
○ New ephemeral pinning for volumes
● RWO/RWX/ROX -- RBD
○ dm-crypt encryption with Vault key management
○ PV snapshots and clones
○ Topology-aware provisioning
○ Integration with snapshot-based mirroring in-progress

30
● Removing instances of racially charged terms
○ blacklist/whitelist
○ master/slave
○ Some librados APIs/CLI affected: deprecated old calls, will remove in future
● Ongoing documentation improvements
● https://ceph.io website redesign
○ Static site generator, built from git (no more wordpress)
○ Should launch this spring
MISC

31
● (Consistent) CI and release builds
○ Thank you to Ampere for donated build hardware
● Container builds
● Limited QA/regression testing coverage
ARM: AARCH64

33
● Ceph Developer Summit - March or April
○ Quincy planning
○ Traditional format (scheduled topical sessions, video + chat, recorded)
● Ceph Month - April or May
○ Topic per week (Core, RGW, RBD, CephFS)
○ 2-3 scheduled talks spread over the week
■ Including developers presenting what is new, what’s coming
○ Each talk followed by semi-/un-structured discussion meetup
■ Etherpad agenda
■ Open discussion
■ Opportunity for new/existing users/operators to compare notes
○ Lightning talks
○ Fully virtual (video + chat, recorded)
VIRTUAL EVENTS THIS SPRING

34
● https://ceph.io/
● Twitter: @ceph
● Docs: http://docs.ceph.com/
● Mailing lists: http://lists.ceph.io/
○ ceph-announce@ceph.io → announcements
○ ceph-users@ceph.io → user discussion
○ dev@ceph.io → developer discussion
● IRC: irc.oftc.net
○ #ceph, #ceph-devel
● GitHub: https://github.com/ceph/
● YouTube ‘Ceph’ channel
FOR MORE INFORMATION

36
● RBD NVMeoF target support
● Cephadm resource-aware scheduling, CIFS gateway
●
QUINCY SNEAK PEEK

37
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

38
● cephadm improvements
○ CIFS/SMB support
○ Resource-aware service placement (memory, CPU)
○ Moving Services out of failed hosts
○ Improved scalability/responsiveness for large clusters
● Rook improvements
○ Better integration with orchestrator API
○ Parity with cephadm
ORCHESTRATION

39
RGW
● Deduplicated storage
CephFS
● ‘fs top’
● NFS and SMB support via orchestrator
MISC USABILITY AND FEATURES
RBD
● Expose snapshots via RGW (object)
● Expose RBD via NVMeOF target gateway
● Improved rbd-nbd support
○ Expose kernel block device with full librbd
feature set
○ Improved integration with ceph-csi for
Kubernetes environments

40
● Multi-site monitoring & management (RBD, RGW, CephFS) and multi-cluster
support (single dashboard managing multiple Ceph clusters).
● RGW advanced features management (bucket policies, lifecycle, encryption,
notiﬁcations…)
● High-level user workﬂows:
○ Cluster installation wizard
○ Cluster upgrades
● Improved observability
○ Log aggregation
DASHBOARD

41
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

42
RADOS
● Enable ‘upmap’ balancer by default
○ More precise than ‘crush-compat’ mode
○ Hands-off by default
○ Improve balancing of ‘primary’ role
● Dynamically adjust recovery priority
based on load
● Automatic periodic security key rotation
● Distributed tracing framework
○ For end-to-end performance analysis
STABILITY AND ROBUSTNESS
CephFS
● MultiMDS metadata balancing
improvements
● Minor version upgrade improvements

43
● Work continues on backend analysis of telemetry data
○ Tools for developers to use crash reports identify and prioritize bug ﬁxes
● Adjustments in collected data
○ Adjust what data is collected for Paciﬁc
○ Periodic backport to Octopus (we re-opt-in)
○ e.g., which orchestrator module is in use (if any)
● Drive failure prediction
○ Building improved models for predictive drive failures
○ Expanding data set via Ceph collector, standalone collector, and other data sources
TELEMETRY

44
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

45
CephFS
● Async metadata operations
○ Support in both libcephfs
○ Async rmdir/mkdir
● Ceph-fuse performance
○ Take advantage of recent libfuse changes
MISC PERFORMANCE
RGW
● Data sync optimizations, sync fairness
● Sync metadata improvements
○ omap -> cls_ﬁfo
○ Bucket index, metadata+data logs
● Ongoing async refactoring of RGW
○ Based on boost::asio

46
● Sharded RocksDB
○ Improve compaction performance
○ Reduce disk space requirements
● In-memory cache improvements
● SMR
○ Support for host-managed SMR HDDs
○ Targeting cold-stored workloads (e.g., RGW) only
RADOS: BLUESTORE

47
PROJECT CRIMSON
What
● Rewrite IO path in using Seastar
○ Preallocate cores
○ One thread per core
○ Explicitly shard all data structures
and work over cores
○ No locks and no blocking
○ Message passing between cores
○ Polling for IO
● DPDK, SPDK
○ Kernel bypass for network and
storage IO
● Goal: Working prototype for Paciﬁc
Why
● Not just about how many IOPS we do…
● More about IOPS per CPU core
● Current Ceph is based on traditional
multi-threaded programming model
● Context switching is too expensive when
storage is almost as fast as memory
● New hardware devices coming
○ DIMM form-factor persistent memory
○ ZNS - zone-based SSDs

48
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

49
CEPHFS MULTI-SITE REPLICATION
● Automate periodic snapshot + sync to remote cluster
○ Arbitrary source tree, destination in remote cluster
○ Sync snapshots via rsync
○ May support non-CephFS targets
● Discussing more sophisticated models
○ Bidirectional, loosely/eventually consistent sync
○ Simple conﬂict resolution behavior?

50
● Nodes scale up (faster, bigger)
● Clusters scale out
○ Bigger clusters within a site
● Organizations scale globally
○ Multiple sites, data centers
○ Multiple public and private clouds
○ Multiple units within an organization
MOTIVATION, OBJECT
● Universal, global connectivity
○ Access your data from anywhere
● API consistency
○ Write apps to a single object API (e.g., S3)
regardless of which site, cloud it is
deployed on
● Disaster recovery
○ Replicate object data across sites
○ Synchronously or asynchronously
○ Failover application and reattach
○ Active/passive and active/active
● Migration
○ Migrate data set between sites, tiers
○ While it is being used
● Edge scenarios (caching and buffering)
○ Cache remote bucket locally
○ Buffer new data locally

51
● Project Zipper
○ Internal abstractions to allow alternate
storage backends (e.g., storage data in
external object store)
○ Policy layer based on LUA
○ Initial targets: database and ﬁle-based
stores, tiering to cloud (e.g., S3)
● Dynamic reshard vs multisite support
● Sync from external sources
○ AWS
● Lifecycle transition to cloud
RGW MULTISITE FOR QUINCY

52
RBD
● Consistency group support
MISC MULTI-SITE

53
Usability
Performance
Ecosystem
Multi-site
Quality
FIVE THEMES

54
Windows
● Windows port for RBD is underway
● Lightweight kernel pass-through to librbd
● CephFS to follow (based on Dokan)
Performance testing hardware
● Intel test cluster: ofﬁcianalis
● AMD / Samsung / Mellanox cluster
● High-end ARM-based system?
OTHER ECOSYSTEM EFFORTS
ARM (aarch64)
● Loads of new build and test hardware
arriving in the lab
● CI and release builds for aarch64
IBM Z
● Collaboration with IBM Z team
● Build and test

2021.02 new in Ceph Pacific Dashboard

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 2021.02 new in Ceph Pacific Dashboard

Ähnlich wie 2021.02 new in Ceph Pacific Dashboard (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

2021.02 new in Ceph Pacific Dashboard