Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

OUR VISION FOR OPEN UNIFIED CLOUD STORAGE
SAGE WEIL – RED HAT
OPENSTACK SUMMIT BARCELONA – 2016.10.27

2
● Now
– why we develop ceph
– why it's open
– why it's unified
– hardware
– openstack
– why not ceph
● Later
– roadmap highlights
– the road ahead
– community
OUTLINE

4
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant
distributed file
system, with a
Linux kernel client
and support for
FUSE
CEPH FS
A POSIX-compliant
distributed file
system, with a
Linux kernel client
and support for
FUSE
NEARLY
AWESOME
AWESOMEAWESOME
AWESOME
AWESOME

5
2016 =
RGW
S3 and Swift compatible
object storage with object
versioning, multi-site
federation, and replication
LIBRADOS
A library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOS
A software-based, reliable, autonomic, distributed object store comprised of
self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons)
RBD
A virtual block device with
snapshots, copy-on-write
clones, and multi-site
replication
CEPHFS
A distributed POSIX file
system with coherent
caches and snapshots on
any directory
OBJECT BLOCK FILE
FULLY AWESOME

6
WHAT IS CEPH ALL ABOUT
● Distributed storage
● All components scale horizontally
● No single point of failure
● Software
● Hardware agnostic, commodity hardware
● Object, block, and file in a single cluster
● Self-manage whenever possible
● Open source (LGPL)

7
WHY OPEN SOURCE
● Avoid software vendor lock-in
● Avoid hardware lock-in
– and enable hybrid deployments
● Lower total cost of ownership
● Transparency
● Self-support
● Add, improve, or extend

8
WHY UNIFIED STORAGE
●
Simplicity of deployment
– single cluster serving all APIs
●
Efficient utilization of storage
●
Simplicity of management
– single set of management skills, staff, tools
...is that really true?

9
WHY UNIFIED STORAGE
●
Simplicity of deployment
– single cluster serving all APIs
●
Efficient utilization of space
●
Only really true for small clusters
– Large deployments optimize for
workload
●
Simplicity of management
– single set of skills, staff, tools
●
Functionality in RADOS exploited by all
– Erasure coding, tiering, performance,
monitoring

10
SPECIALIZED CLUSTERS
SSD
M M M
RADOS CLUSTER
NVMeSSDHDD
– Multiple Ceph clusters per-use case
RADOS CLUSTERRADOS CLUSTER
MM M M MM

11
HYBRID CLUSTER
SSD
M M M
RADOS CLUSTER
NVMeSSDHDD
– Single Ceph cluster
– Separate RADOS pools for different storage hardware types

12
CEPH ON FLASH
● Samsung
– up to 153 TB in 2u
– 700K IOPS, 30 GB/s
● SanDisk
– up to 512 TB in 3u
– 780K IOPS, 7 GB/s

13
CEPH ON SSD+HDD
● SuperMicro
● Dell
● HP
● Quanta / QCT
● Fujitsu (Eternus)
● Penguin OCP

14
● Microserver on 8TB He WD HDD
● 1 GB RAM
● 2 core 1.3 GHz Cortex A9
● 2x 1GbE SGMII instead of SATA
● 504 OSD 4 PB cluster
– SuperMicro 1048-RT chassis
● Next gen will be aarch64
CEPH ON HDD (LITERALLY)

15
FOSS → FAST AND OPEN INNOVATION
● Free and open source software (FOSS)
– enables innovation in software
– ideal testbed for new hardware architectures
● Proprietary software platforms are full of friction
– harder to modify, experiment with closed-source code
– require business partnerships and NDAs to be negotiated
– dual investment (of time and/or $$$) from both parties

16
PERSISTENT MEMORY
● Persistent (non-volatile) RAM is coming
– e.g., 3D X-Point from Intel/Micron any year now
– (and NV-DIMMs are a bit $$$ but already here)
● These are going to be
– fast (~1000x), dense (~10x), high-endurance (~1000x)
– expensive (~1/2 the cost of DRAM)
– disruptive
● Intel
– PMStore – OSD prototype backend targetting 3D-Xpoint
– NVM-L (pmem.io/nvml)

17
CREATE THE ECOSYSTEM
TO BECOME THE LINUX
OF DISTRIBUTED STORAGE
OPEN COLLABORATIVE GENERAL PURPOSE

18
KEYSTONE SWIF
T
CINDER NOVAGLANCE
LIBRBD
OPENSTACK
CEPH FS
M M
M
RADOS CLUSTER
KVM
CEPH + OPENSTACK
MANILA
RADOSG
W

20
WHY NOT CEPH?
● inertia
● performance
● functionality
● stability
● ease of use

21
WHAT ARE WE DOING ABOUT IT?

22
Hammer (LTS)
Spring 2015
Jewel (LTS)
Spring 2016
Infernalis
Fall 2015
Kraken
Fall 2016
Luminous (LTS)
Spring 2017
WE ARE HERE
RELEASE CADENCERELEASE CADENCE
10.2.z
12.2.z
0.94.z

24
● BlueStore = Block + NewStore
– key/value database (RocksDB) for metadata
– all data written directly to raw block device(s)
– can combine HDD, SSD, NVMe, NVRAM
● Full data checksums (crc32c, xxhash)
● Inline compression (zlib, snappy, zstd)
● ~2x faster than FileStore
– better parallelism, efficiency on fast devices
– no double writes for data
– performs well with very small SSD journals
BLUESTORE
BlueStore
BlueFS
RocksDB
BlockDeviceBlockDeviceBlockDevice
BlueRocksEnv
data metadata
Allocator
ObjectStore
Compressor
OSD

25
● Current master
– nearly finalized disk format,
good performance
● Kraken
– stable code, stable disk format
– probably still flagged
'experimental'
● Luminous
– fully stable and ready for broad
usage
– maybe the default... we will see
how it goes
BLUESTORE – WHEN CAN HAZ?
● Migrating from FileStore
– evacuate and fail old OSDs
– reprovision with BlueStore
– let Ceph recovery normally

26
● New implementation of network layer
– replaces aging SimpleMessenger
– fixed size thread pool (vs 2 threads per socket)
– scales better to larger clusters
– more healthy relationship with tcmalloc
– now the default!
● Pluggable backends
– PosixStack – Linux sockets, TCP (default, supported)
– Two experimental backends!
ASYNCMESSENGER

27
TCP OVERHEAD IS SIGNIFICANT
writes
reads

29
RDMA STACK
● RDMA for data path (ibverbs)
● Keep TCP for control path
● Working prototype in ~1 month

32
ERASURE CODE OVERWRITES
● Current EC RADOS pools are append only
– simple, stable suitable for RGW, or behind a cache tier
● EC overwrites will allow RBD and CephFS to consume EC pools directly
● It's hard
– implementation requires two-phase commit
– complex to avoid full-stripe update for small writes
– relies on efficient implementation of “move ranges” operation by BlueStore
● It will be huge
– tremendous impact on TCO
– make RBD great again

33
CEPH-MGR
● ceph-mon monitor daemons currently do a lot
– more than they need to (PG stats to support things like 'df')
– this limits cluster scalability
● ceph-mgr moves non-critical metrics into a separate daemon
– that is more efficient
– that can stream to graphite, influxdb
– that can efficiently integrate with external modules (even Python!)
● Good host for
– integrations, like Calamari REST API endpoint
– coming features like 'ceph top' or 'rbd top'
– high-level management functions and policy
M M
???
(time for new iconography)

34
QUALITY OF SERVICE
● Set policy for both
– reserved/minimum IOPS
– proportional sharing of excess capacity
● by
– type of IO (client, scrub, recovery)
– pool
– client (e.g., VM)
● Based on mClock paper from OSDI'10
– IO scheduler
– distributed enforcement with cooperating clients

35
A B C D E F A' B' G H
VM
SHARED STORAGE → LATENCY
ETHERNET

36
VM
LOCAL SSD → LOW LATENCY
A B C D E F A' B' G HSSD
SATA
ETHERNET

37
VM
LOCAL SSD → FAILURE → SADNESS
SATA
ETHERNET

38
VM
WRITEBACK IS UNORDERED
SATA
ETHERNET
A'
B'
C
D

39
VM
WRITEBACK IS UNORDERED
SATA
ETHERNET
A'
B'
C
D
A' B' C D → never happened

40
VM
RBD ORDERED WRITEBACK CACHE
SATA
ETHERNET
A B C D → CRASH CONSISTENT!
A
B
C
D
cleverness

41
● Low latency writes to local SSD
● Persistent cache (across reboots etc)
● Fast reads from cache
● Ordered writeback to the cluster, with batching, etc.
● RBD image is always point-in-time consistent
RBD ORDERED WRITEBACK CACHE
Low Latency
SPoF
No data
Local SSD Ceph RBD
Higher Latency
No SPoF
Perfect durability
Low Latency
SPoF on recent writes
Stale but crash consistent

42
IN THE FUTURE
MOST BYTES WILL BE STORED
IN OBJECTS

43
S3
Swift
Erasure coding
Multisite federation
Multisite replication
WORM
Encryption
Tiering
Deduplication
RADOSGW
Compression

44
ZONE CZONE B
RADOSGW INDEXING
RADOSG
WLIBRADOS
M
CLUSTER A
MM M
CLUSTER B
MM
RADOSG
W
RADOSG
W
RADOSG
WLIBRADOS LIBRADOS LIBRADOS
REST
RADOSG
WLIBRADOS
M
CLUSTER C
MM
REST
ZONE A

45
GROWING DEVELOPMENT COMMUNITY

46
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellanox
● Intel
● Walmart Labs
● DreamHost
● Tencent
● Deutsche Telekom
● Igalia
● Fujitsu
● DigitalOcean
● University of Toronto

47
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellanox
● Intel
● Walmart Labs
● DreamHost
● Tencent
● Igalia
● Fujitsu
● DigitalOcean

48
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellanox
● Intel
● Walmart Labs
● DreamHost
● Tencent
● Igalia
● Fujitsu
● DigitalOcean

49
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellanox
● Intel
● Walmart Labs
● DreamHost
● Tencent
● Igalia
● Fujitsu
● DigitalOcean

50
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellanox
● Intel
● Walmart Labs
● DreamHost
● Tencent
● Igalia
● Fujitsu
● DigitalOcean

51
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellanox
● Intel
● Walmart Labs
● DreamHost
● Tencent
● Igalia
● Fujitsu
● DigitalOcean

52
● Nobody should have to use proprietary storage systems out of necessity
● The best storage solution should be an open source solution
● Ceph is growing up, but we're not done yet
– performance, scalability, features, ease of use
● We are highly motivated
TAKEAWAYS

53
Operator
● File bugs
http://tracker.ceph.com/
● Document
http://github.com/ceph/ceph
● Blog
● Build a relationship with the core
team
Developer
● Fix bugs
● Help design missing functionality
● Implement missing functionality
● Integrate
● Help make it easier to use
● Participate in monthly Ceph
Developer Meetings
– video chat, EMEA- and APAC-
friendly times
HOW TO HELP

THANK YOU!
Sage Weil
CEPH PRINCIPAL ARCHITECT
sage@redhat.com
@liewegas

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

Similar to Ceph, Now and Later: Our Plan for Open Unified Cloud Storage (20)

Recently uploaded

Recently uploaded (20)

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage