Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
OUR VISION FOR OPEN UNIFIED CLOUD STORAGE
SAGE WEIL – RED HAT
OPENSTACK SUMMIT BARCELONA – 2016.10.27
2
● Now
– why we develop ceph
– why it's open
– why it's unified
– hardware
– openstack
– why not ceph
● Later
– roadmap h...
3
JEWEL
4
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nod...
5
2016 =
RGW
S3 and Swift compatible
object storage with object
versioning, multi-site
federation, and replication
LIBRADO...
6
WHAT IS CEPH ALL ABOUT
● Distributed storage
● All components scale horizontally
● No single point of failure
● Software...
7
WHY OPEN SOURCE
● Avoid software vendor lock-in
● Avoid hardware lock-in
– and enable hybrid deployments
● Lower total c...
8
WHY UNIFIED STORAGE
●
Simplicity of deployment
– single cluster serving all APIs
●
Efficient utilization of storage
●
Si...
9
WHY UNIFIED STORAGE
●
Simplicity of deployment
– single cluster serving all APIs
●
Efficient utilization of space
●
Only...
10
SPECIALIZED CLUSTERS
SSD
M M M
RADOS CLUSTER
NVMeSSDHDD
– Multiple Ceph clusters per-use case
RADOS CLUSTERRADOS CLUSTE...
11
HYBRID CLUSTER
SSD
M M M
RADOS CLUSTER
NVMeSSDHDD
– Single Ceph cluster
– Separate RADOS pools for different storage ha...
12
CEPH ON FLASH
● Samsung
– up to 153 TB in 2u
– 700K IOPS, 30 GB/s
● SanDisk
– up to 512 TB in 3u
– 780K IOPS, 7 GB/s
13
CEPH ON SSD+HDD
● SuperMicro
● Dell
● HP
● Quanta / QCT
● Fujitsu (Eternus)
● Penguin OCP
14
● Microserver on 8TB He WD HDD
● 1 GB RAM
● 2 core 1.3 GHz Cortex A9
● 2x 1GbE SGMII instead of SATA
● 504 OSD 4 PB clu...
15
FOSS → FAST AND OPEN INNOVATION
● Free and open source software (FOSS)
– enables innovation in software
– ideal testbed...
16
PERSISTENT MEMORY
● Persistent (non-volatile) RAM is coming
– e.g., 3D X-Point from Intel/Micron any year now
– (and NV...
17
CREATE THE ECOSYSTEM
TO BECOME THE LINUX
OF DISTRIBUTED STORAGE
OPEN COLLABORATIVE GENERAL PURPOSE
18
KEYSTONE SWIF
T
CINDER NOVAGLANCE
LIBRBD
OPENSTACK
CEPH FS
M M
M
RADOS CLUSTER
KVM
CEPH + OPENSTACK
MANILA
RADOSG
W
19
CINDER DRIVER USAGE
20
WHY NOT CEPH?
● inertia
● performance
● functionality
● stability
● ease of use
21
WHAT ARE WE DOING ABOUT IT?
22
Hammer (LTS)
Spring 2015
Jewel (LTS)
Spring 2016
Infernalis
Fall 2015
Kraken
Fall 2016
Luminous (LTS)
Spring 2017
WE AR...
23 KRAKEN
24
● BlueStore = Block + NewStore
– key/value database (RocksDB) for metadata
– all data written directly to raw block dev...
25
● Current master
– nearly finalized disk format,
good performance
● Kraken
– stable code, stable disk format
– probably...
26
● New implementation of network layer
– replaces aging SimpleMessenger
– fixed size thread pool (vs 2 threads per socke...
27
TCP OVERHEAD IS SIGNIFICANT
writes
reads
28
DPDK STACK
29
RDMA STACK
● RDMA for data path (ibverbs)
● Keep TCP for control path
● Working prototype in ~1 month
30 LUMINOUS
31
MULTI-MDS CEPHFS
32
ERASURE CODE OVERWRITES
● Current EC RADOS pools are append only
– simple, stable suitable for RGW, or behind a cache t...
33
CEPH-MGR
● ceph-mon monitor daemons currently do a lot
– more than they need to (PG stats to support things like 'df')
...
34
QUALITY OF SERVICE
● Set policy for both
– reserved/minimum IOPS
– proportional sharing of excess capacity
● by
– type ...
35
A B C D E F A' B' G H
VM
SHARED STORAGE → LATENCY
ETHERNET
36
VM
LOCAL SSD → LOW LATENCY
A B C D E F A' B' G HSSD
SATA
ETHERNET
37
VM
LOCAL SSD → FAILURE → SADNESS
A B C D E F A' B' G HSSD
SATA
ETHERNET
38
VM
WRITEBACK IS UNORDERED
A B C D E F A' B' G HSSD
SATA
ETHERNET
A'
B'
C
D
39
VM
WRITEBACK IS UNORDERED
A B C D E F A' B' G HSSD
SATA
ETHERNET
A'
B'
C
D
A' B' C D → never happened
40
VM
RBD ORDERED WRITEBACK CACHE
A B C D E F A' B' G HSSD
SATA
ETHERNET
A B C D → CRASH CONSISTENT!
A
B
C
D
cleverness
41
● Low latency writes to local SSD
● Persistent cache (across reboots etc)
● Fast reads from cache
● Ordered writeback t...
42
IN THE FUTURE
MOST BYTES WILL BE STORED
IN OBJECTS
43
S3
Swift
Erasure coding
Multisite federation
Multisite replication
WORM
Encryption
Tiering
Deduplication
RADOSGW
Compre...
44
ZONE CZONE B
RADOSGW INDEXING
RADOSG
WLIBRADOS
M
CLUSTER A
MM M
CLUSTER B
MM
RADOSG
W
RADOSG
W
RADOSG
WLIBRADOS LIBRADO...
45
GROWING DEVELOPMENT COMMUNITY
46
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellan...
47
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellan...
48
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellan...
49
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellan...
50
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellan...
51
● Red Hat
● Mirantis
● SUSE
● SanDisk
● XSKY
● ZTE
● LETV
● Quantum
● EasyStack
● H3C
● UnitedStack
● Digiware
● Mellan...
52
● Nobody should have to use proprietary storage systems out of necessity
● The best storage solution should be an open ...
53
Operator
● File bugs
http://tracker.ceph.com/
● Document
http://github.com/ceph/ceph
● Blog
● Build a relationship with...
THANK YOU!
Sage Weil
CEPH PRINCIPAL ARCHITECT
sage@redhat.com
@liewegas
Nächste SlideShare
Wird geladen in …5
×

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

3.307 Aufrufe

Veröffentlicht am

Ceph is a highly scalable open source distributed storage system that provides object, block, and file interfaces on a single platform. Although Ceph RBD block storage has dominated OpenStack deployments for several years, maturing object (S3, Swift, and librados) interfaces and stable CephFS (file) interfaces now make Ceph the only fully open source unified storage platform.

This talk will cover Ceph's architectural vision and project mission and how our approach differs from alternative approaches to storage in the OpenStack ecosystem. In particular, we will look at how our open development model dovetails well with OpenStack, how major contributors are advancing Ceph capabilities and performance at a rapid pace to adapt to new hardware types and deployment models, and what major features we are priotizing for the next few years to meet the needs of expanding cloud workloads.

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

Ceph, Now and Later: Our Plan for Open Unified Cloud Storage

  1. 1. OUR VISION FOR OPEN UNIFIED CLOUD STORAGE SAGE WEIL – RED HAT OPENSTACK SUMMIT BARCELONA – 2016.10.27
  2. 2. 2 ● Now – why we develop ceph – why it's open – why it's unified – hardware – openstack – why not ceph ● Later – roadmap highlights – the road ahead – community OUTLINE
  3. 3. 3 JEWEL
  4. 4. 4 RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOMEAWESOME AWESOME AWESOME
  5. 5. 5 2016 = RGW S3 and Swift compatible object storage with object versioning, multi-site federation, and replication LIBRADOS A library allowing apps to direct access RADOS (C, C++, Java, Python, Ruby, PHP) RADOS A software-based, reliable, autonomic, distributed object store comprised of self-healing, self-managing, intelligent storage nodes (OSDs) and lightweight monitors (Mons) RBD A virtual block device with snapshots, copy-on-write clones, and multi-site replication CEPHFS A distributed POSIX file system with coherent caches and snapshots on any directory OBJECT BLOCK FILE FULLY AWESOME
  6. 6. 6 WHAT IS CEPH ALL ABOUT ● Distributed storage ● All components scale horizontally ● No single point of failure ● Software ● Hardware agnostic, commodity hardware ● Object, block, and file in a single cluster ● Self-manage whenever possible ● Open source (LGPL)
  7. 7. 7 WHY OPEN SOURCE ● Avoid software vendor lock-in ● Avoid hardware lock-in – and enable hybrid deployments ● Lower total cost of ownership ● Transparency ● Self-support ● Add, improve, or extend
  8. 8. 8 WHY UNIFIED STORAGE ● Simplicity of deployment – single cluster serving all APIs ● Efficient utilization of storage ● Simplicity of management – single set of management skills, staff, tools ...is that really true?
  9. 9. 9 WHY UNIFIED STORAGE ● Simplicity of deployment – single cluster serving all APIs ● Efficient utilization of space ● Only really true for small clusters – Large deployments optimize for workload ● Simplicity of management – single set of skills, staff, tools ● Functionality in RADOS exploited by all – Erasure coding, tiering, performance, monitoring
  10. 10. 10 SPECIALIZED CLUSTERS SSD M M M RADOS CLUSTER NVMeSSDHDD – Multiple Ceph clusters per-use case RADOS CLUSTERRADOS CLUSTER MM M M MM
  11. 11. 11 HYBRID CLUSTER SSD M M M RADOS CLUSTER NVMeSSDHDD – Single Ceph cluster – Separate RADOS pools for different storage hardware types
  12. 12. 12 CEPH ON FLASH ● Samsung – up to 153 TB in 2u – 700K IOPS, 30 GB/s ● SanDisk – up to 512 TB in 3u – 780K IOPS, 7 GB/s
  13. 13. 13 CEPH ON SSD+HDD ● SuperMicro ● Dell ● HP ● Quanta / QCT ● Fujitsu (Eternus) ● Penguin OCP
  14. 14. 14 ● Microserver on 8TB He WD HDD ● 1 GB RAM ● 2 core 1.3 GHz Cortex A9 ● 2x 1GbE SGMII instead of SATA ● 504 OSD 4 PB cluster – SuperMicro 1048-RT chassis ● Next gen will be aarch64 CEPH ON HDD (LITERALLY)
  15. 15. 15 FOSS → FAST AND OPEN INNOVATION ● Free and open source software (FOSS) – enables innovation in software – ideal testbed for new hardware architectures ● Proprietary software platforms are full of friction – harder to modify, experiment with closed-source code – require business partnerships and NDAs to be negotiated – dual investment (of time and/or $$$) from both parties
  16. 16. 16 PERSISTENT MEMORY ● Persistent (non-volatile) RAM is coming – e.g., 3D X-Point from Intel/Micron any year now – (and NV-DIMMs are a bit $$$ but already here) ● These are going to be – fast (~1000x), dense (~10x), high-endurance (~1000x) – expensive (~1/2 the cost of DRAM) – disruptive ● Intel – PMStore – OSD prototype backend targetting 3D-Xpoint – NVM-L (pmem.io/nvml)
  17. 17. 17 CREATE THE ECOSYSTEM TO BECOME THE LINUX OF DISTRIBUTED STORAGE OPEN COLLABORATIVE GENERAL PURPOSE
  18. 18. 18 KEYSTONE SWIF T CINDER NOVAGLANCE LIBRBD OPENSTACK CEPH FS M M M RADOS CLUSTER KVM CEPH + OPENSTACK MANILA RADOSG W
  19. 19. 19 CINDER DRIVER USAGE
  20. 20. 20 WHY NOT CEPH? ● inertia ● performance ● functionality ● stability ● ease of use
  21. 21. 21 WHAT ARE WE DOING ABOUT IT?
  22. 22. 22 Hammer (LTS) Spring 2015 Jewel (LTS) Spring 2016 Infernalis Fall 2015 Kraken Fall 2016 Luminous (LTS) Spring 2017 WE ARE HERE RELEASE CADENCERELEASE CADENCE 10.2.z 12.2.z 0.94.z
  23. 23. 23 KRAKEN
  24. 24. 24 ● BlueStore = Block + NewStore – key/value database (RocksDB) for metadata – all data written directly to raw block device(s) – can combine HDD, SSD, NVMe, NVRAM ● Full data checksums (crc32c, xxhash) ● Inline compression (zlib, snappy, zstd) ● ~2x faster than FileStore – better parallelism, efficiency on fast devices – no double writes for data – performs well with very small SSD journals BLUESTORE BlueStore BlueFS RocksDB BlockDeviceBlockDeviceBlockDevice BlueRocksEnv data metadata Allocator ObjectStore Compressor OSD
  25. 25. 25 ● Current master – nearly finalized disk format, good performance ● Kraken – stable code, stable disk format – probably still flagged 'experimental' ● Luminous – fully stable and ready for broad usage – maybe the default... we will see how it goes BLUESTORE – WHEN CAN HAZ? ● Migrating from FileStore – evacuate and fail old OSDs – reprovision with BlueStore – let Ceph recovery normally
  26. 26. 26 ● New implementation of network layer – replaces aging SimpleMessenger – fixed size thread pool (vs 2 threads per socket) – scales better to larger clusters – more healthy relationship with tcmalloc – now the default! ● Pluggable backends – PosixStack – Linux sockets, TCP (default, supported) – Two experimental backends! ASYNCMESSENGER
  27. 27. 27 TCP OVERHEAD IS SIGNIFICANT writes reads
  28. 28. 28 DPDK STACK
  29. 29. 29 RDMA STACK ● RDMA for data path (ibverbs) ● Keep TCP for control path ● Working prototype in ~1 month
  30. 30. 30 LUMINOUS
  31. 31. 31 MULTI-MDS CEPHFS
  32. 32. 32 ERASURE CODE OVERWRITES ● Current EC RADOS pools are append only – simple, stable suitable for RGW, or behind a cache tier ● EC overwrites will allow RBD and CephFS to consume EC pools directly ● It's hard – implementation requires two-phase commit – complex to avoid full-stripe update for small writes – relies on efficient implementation of “move ranges” operation by BlueStore ● It will be huge – tremendous impact on TCO – make RBD great again
  33. 33. 33 CEPH-MGR ● ceph-mon monitor daemons currently do a lot – more than they need to (PG stats to support things like 'df') – this limits cluster scalability ● ceph-mgr moves non-critical metrics into a separate daemon – that is more efficient – that can stream to graphite, influxdb – that can efficiently integrate with external modules (even Python!) ● Good host for – integrations, like Calamari REST API endpoint – coming features like 'ceph top' or 'rbd top' – high-level management functions and policy M M ??? (time for new iconography)
  34. 34. 34 QUALITY OF SERVICE ● Set policy for both – reserved/minimum IOPS – proportional sharing of excess capacity ● by – type of IO (client, scrub, recovery) – pool – client (e.g., VM) ● Based on mClock paper from OSDI'10 – IO scheduler – distributed enforcement with cooperating clients
  35. 35. 35 A B C D E F A' B' G H VM SHARED STORAGE → LATENCY ETHERNET
  36. 36. 36 VM LOCAL SSD → LOW LATENCY A B C D E F A' B' G HSSD SATA ETHERNET
  37. 37. 37 VM LOCAL SSD → FAILURE → SADNESS A B C D E F A' B' G HSSD SATA ETHERNET
  38. 38. 38 VM WRITEBACK IS UNORDERED A B C D E F A' B' G HSSD SATA ETHERNET A' B' C D
  39. 39. 39 VM WRITEBACK IS UNORDERED A B C D E F A' B' G HSSD SATA ETHERNET A' B' C D A' B' C D → never happened
  40. 40. 40 VM RBD ORDERED WRITEBACK CACHE A B C D E F A' B' G HSSD SATA ETHERNET A B C D → CRASH CONSISTENT! A B C D cleverness
  41. 41. 41 ● Low latency writes to local SSD ● Persistent cache (across reboots etc) ● Fast reads from cache ● Ordered writeback to the cluster, with batching, etc. ● RBD image is always point-in-time consistent RBD ORDERED WRITEBACK CACHE Low Latency SPoF No data Local SSD Ceph RBD Higher Latency No SPoF Perfect durability Low Latency SPoF on recent writes Stale but crash consistent
  42. 42. 42 IN THE FUTURE MOST BYTES WILL BE STORED IN OBJECTS
  43. 43. 43 S3 Swift Erasure coding Multisite federation Multisite replication WORM Encryption Tiering Deduplication RADOSGW Compression
  44. 44. 44 ZONE CZONE B RADOSGW INDEXING RADOSG WLIBRADOS M CLUSTER A MM M CLUSTER B MM RADOSG W RADOSG W RADOSG WLIBRADOS LIBRADOS LIBRADOS REST RADOSG WLIBRADOS M CLUSTER C MM REST ZONE A
  45. 45. 45 GROWING DEVELOPMENT COMMUNITY
  46. 46. 46 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  47. 47. 47 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  48. 48. 48 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  49. 49. 49 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  50. 50. 50 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  51. 51. 51 ● Red Hat ● Mirantis ● SUSE ● SanDisk ● XSKY ● ZTE ● LETV ● Quantum ● EasyStack ● H3C ● UnitedStack ● Digiware ● Mellanox ● Intel ● Walmart Labs ● DreamHost ● Tencent ● Deutsche Telekom ● Igalia ● Fujitsu ● DigitalOcean ● University of Toronto GROWING DEVELOPMENT COMMUNITY
  52. 52. 52 ● Nobody should have to use proprietary storage systems out of necessity ● The best storage solution should be an open source solution ● Ceph is growing up, but we're not done yet – performance, scalability, features, ease of use ● We are highly motivated TAKEAWAYS
  53. 53. 53 Operator ● File bugs http://tracker.ceph.com/ ● Document http://github.com/ceph/ceph ● Blog ● Build a relationship with the core team Developer ● Fix bugs ● Help design missing functionality ● Implement missing functionality ● Integrate ● Help make it easier to use ● Participate in monthly Ceph Developer Meetings – video chat, EMEA- and APAC- friendly times HOW TO HELP
  54. 54. THANK YOU! Sage Weil CEPH PRINCIPAL ARCHITECT sage@redhat.com @liewegas

×