2. outline
● the scale-out world
● what is it, what it's for, how it works
● how you can use it
● librados
● radosgw
● RBD, the ceph block device
● distributed file system
● how do you deploy and manage it
● roadmap
● why we do this, who we are
4. the old reality
● direct-attached storage
● raw disks
● RAID controller
● network-attached storage
● a few large arrays
● NFS, CIFS, maybe iSCSI
● SAN
● expensive enterprise array
● scalable capacity and performance, to a point
● poor management scalability
5. new user demands
● “cloud”
● many disk volumes, with automated provisioning
● larger data sets
– RESTful object storage (e.g., S3, Swift)
– NoSQL distributed key/value stores (Cassandra, Riak)
● “big data”
● distributed processing
● “unstructured data”
● ...and the more traditional workloads
6. requirements
● diverse use-cases
● object storage
● block devices (for VMs) with snapshots, cloning
● shared file system with POSIX, coherent caches
● structured data... files, block devices, or objects?
● scale
● terabytes, petabytes, exabytes
● heterogeneous hardware
● reliability and fault tolerance
7. cost
● near-linear function of size or performance
● incremental expansion
● no fork-lift upgrades
● no vendor lock-in
● choice of hardware
● choice of software
● open
8. time
● ease of administration
● no manual data migration, load balancing
● painless scaling
● expansion and contraction
● seamless migration
● devops friendly
9. being a good devops citizen
● seamless integration with infrastructure tools
● Chef, Juju, Puppet, Crowbar, or home-grown
● collectd, statsd, graphite, nagios
● OpenStack, CloudStack
●
simple processes for
● expanding or contracting the storage cluster
●
responding to failure scenarios: initial healing, or mop-up
● evolution of “Unix philosophy”
●
everything is just daemon
● CLI, REST APIs, JSON
●
solve one problem, and solve it well
12. APP
APP APP
APP HOST/VM
HOST/VM CLIENT
CLIENT
RADOSGW
RADOSGW RBD
RBD CEPH FS
CEPH FS
LIBRADOS
LIBRADOS
A bucket-based
A bucket-based A reliable and
A reliable and A POSIX-compliant
A POSIX-compliant
A library allowing
A library allowing REST gateway,
REST gateway, fully-distributed
fully-distributed distributed file
distributed file
apps to directly
apps to directly compatible with S3
compatible with S3 block device, with aa
block device, with system, with aa
system, with
access RADOS,
access RADOS, and Swift
and Swift Linux kernel client
Linux kernel client Linux kernel client
Linux kernel client
with support for
with support for and aaQEMU/KVM
and QEMU/KVM and support for
and support for
C, C++, Java,
C, C++, Java, driver
driver FUSE
FUSE
Python, Ruby,
Python, Ruby,
and PHP
and PHP
RADOS ––the Ceph object store
RADOS the Ceph object store
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
13. distributed storage system
● data center scale
● 10s to 10,000s of machines
● terabytes to exabytes
● fault tolerant
● no single point of failure
● commodity hardware
● self-managing, self-healing
14. ceph object model
● pools
● 1s to 100s
● independent object namespaces or collections
● replication level, placement policy
● objects
● bazillions
● blob of data (bytes to gigabytes)
● attributes (e.g., “version=12”; bytes to kilobytes)
● key/value bundle (bytes to gigabytes)
15. why start with objects?
● more useful than (disk) blocks
● names in a simple flat namespace
● variable size
● simple API with rich semantics
● more scalable than files
● no hard-to-distribute hierarchy
● update semantics do not span objects
● workload is trivially parallel
16. How do you design
a storage system that scales?
17. DISK
DISK
DISK
DISK
DISK
DISK
HUMAN
HUMAN COMPUTER
COMPUTER DISK
DISK
DISK
DISK
DISK
DISK
DISK
DISK
18. DISK
DISK
DISK
DISK
HUMAN
HUMAN
DISK
DISK
HUMAN
HUMAN COMPUTER
COMPUTER DISK
DISK
DISK
DISK
HUMAN
HUMAN
DISK
DISK
DISK
DISK
19. HUMAN
HUMAN HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN DISK
DISK
HUMAN
HUMAN HUMAN
HUMAN
DISK
DISK
DISK
DISK
HUMAN
HUMAN
DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN
DISK
DISK
(COMPUTER))
(COMPUTER
HUMAN
HUMAN
DISK
DISK
HUMAN HUMAN
HUMAN
HUMAN DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN DISK
DISK
HUMAN
HUMAN DISK
DISK
HUMAN
HUMAN DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN DISK
DISK
HUMAN
HUMAN
HUMAN
HUMAN
(actually more like this…)
20. COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
HUMAN
HUMAN
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
HUMAN
HUMAN
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
HUMAN
HUMAN
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
21. OSD OSD OSD OSD OSD
FS FS FS FS FS btrfs
xfs
ext4
DISK DISK DISK DISK DISK
M M M
22. Monitors
maintain cluster membership and state
M
●
●
provide consensus for distributed
decision-making
●
small, odd number
●
these do not served stored objects to clients
Object Storage Daemons (OSDs)
●
10s to 10000s in a cluster
●
one per disk, SSD, or RAID group
●
hardware agnostic
●
serve stored objects to clients
●
intelligently peer to perform replication and
recovery tasks
24. data distribution
● all objects are replicated N times
● objects are automatically placed, balanced, migrated
in a dynamic cluster
● must consider physical infrastructure
● ceph-osds on hosts in racks in rows in data centers
● three approaches
● pick a spot; remember where you put it
● pick a spot; write down where you put it
● calculate where to put it, where to find it
25. CRUSH
●
pseudo-random placement algorithm
● fast calculation, no lookup
● repeatable, deterministic
●
statistically uniform distribution
●
stable mapping
●
limited data migration on change
●
rule-based configuration
●
infrastructure topology aware
● adjustable replication
●
allows weighting
28. RADOS
● monitors publish osd map that describes cluster state
● ceph-osd node status (up/down, weight, IP) M
● CRUSH function specifying desired data distribution
● object storage daemons (OSDs)
● safely replicate and store object
● migrate data as the cluster changes over time
● coordinate based on shared view of reality
● decentralized, distributed approach allows
● massive scales (10,000s of servers or more)
● the illusion of a single copy with consistent behavior
33. APP
APP APP
APP HOST/VM
HOST/VM CLIENT
CLIENT
RADOSGW
RADOSGW RBD
RBD CEPH FS
CEPH FS
LIBRADOS
A bucket-based
A bucket-based A reliable and
A reliable and A POSIX-compliant
A POSIX-compliant
A library allowing REST gateway,
REST gateway, fully-distributed
fully-distributed distributed file
distributed file
apps to directly compatible with S3
compatible with S3 block device, with aa
block device, with system, with aa
system, with
access RADOS, and Swift
and Swift Linux kernel client
Linux kernel client Linux kernel client
Linux kernel client
with support for and aaQEMU/KVM
and QEMU/KVM and support for
and support for
C, C++, Java, driver
driver FUSE
FUSE
Python, Ruby,
and PHP
RADOS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
35. librados
L
●
direct access to RADOS from
applications
●
C, C++, Python, PHP, Java,
Erlang
●
direct access to storage nodes
●
no HTTP overhead
36. rich librados API
● atomic single-object transactions
● update data, attr together
● atomic compare-and-swap
● efficient key/value storage inside an object
● object-granularity snapshot infrastructure
● embed code in ceph-osd daemon via plugin API
● inter-client communication via object
37. APP
APP APP
APP HOST/VM
HOST/VM CLIENT
CLIENT
RADOSGW RBD
RBD CEPH FS
CEPH FS
LIBRADOS
LIBRADOS
A bucket-based A reliable and
A reliable and A POSIX-compliant
A POSIX-compliant
A library allowing
A library allowing REST gateway, fully-distributed
fully-distributed distributed file
distributed file
apps to directly
apps to directly compatible with S3 block device, with aa
block device, with system, with aa
system, with
access RADOS,
access RADOS, and Swift Linux kernel client
Linux kernel client Linux kernel client
Linux kernel client
with support for
with support for and aaQEMU/KVM
and QEMU/KVM and support for
and support for
C, C++, Java,
C, C++, Java, driver
driver FUSE
FUSE
Python, Ruby,
Python, Ruby,
and PHP
and PHP
RADOS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
38. APP
APP APP
APP
REST
RADOSGW
RADOSGW RADOSGW
RADOSGW
LIBRADOS
LIBRADOS LIBRADOS
LIBRADOS
native
M
M
M
M M
M
39. RADOS Gateway
●
REST-based object storage proxy
●
uses RADOS to store objects
●
API supports buckets, accounting
●
usage accounting for billing
purposes
●
compatible with S3, Swift APIs
40. APP
APP APP
APP HOST/VM
HOST/VM CLIENT
CLIENT
RADOSGW
RADOSGW RBD CEPH FS
CEPH FS
LIBRADOS
LIBRADOS
A bucket-based
A bucket-based A reliable and A POSIX-compliant
A POSIX-compliant
A library allowing
A library allowing REST gateway,
REST gateway, fully-distributed distributed file
distributed file
apps to directly
apps to directly compatible with S3
compatible with S3 block device, with a system, with aa
system, with
access RADOS,
access RADOS, and Swift
and Swift Linux kernel client Linux kernel client
Linux kernel client
with support for
with support for and a QEMU/KVM and support for
and support for
C, C++, Java,
C, C++, Java, driver FUSE
FUSE
Python, Ruby,
Python, Ruby,
and PHP
and PHP
RADOS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
41. COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER COMPUTER
COMPUTER DISK
DISK
DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
42. COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
VM
VM COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
VM
VM COMPUTER DISK
COMPUTER DISK
COMPUTER
COMPUTER DISK
DISK
VM
VM
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER
COMPUTER DISK
DISK
43. VM
VM
VIRTUALIZATION CONTAINER (KVM)
VIRTUALIZATION CONTAINER (KVM)
LIBRBD
LIBRBD
LIBRADOS
LIBRADOS
M
M
M
M M
M
44. CONTAINER
CONTAINER VM
VM CONTAINER
CONTAINER
LIBRBD
LIBRBD LIBRBD
LIBRBD
LIBRADOS
LIBRADOS LIBRADOS
LIBRADOS
M
M
M
M M
M
45. HOST
HOST
KRBD (KERNEL MODULE)
KRBD (KERNEL MODULE)
LIBRADOS
LIBRADOS
M
M
M
M M
M
46. RADOS Block Device
●
storage of disk images in RADOS
●
decouple VM from host
●
images striped across entire
cluster (pool)
●
snapshots
●
copy-on-write clones
●
support in
●
Qemu/KVM
●
mainline Linux kernel (2.6.39+)
●
OpenStack, CloudStack
47. APP
APP APP
APP HOST/VM
HOST/VM CLIENT
CLIENT
RADOSGW
RADOSGW RBD
RBD CEPH FS
LIBRADOS
LIBRADOS
A bucket-based
A bucket-based A reliable and
A reliable and A POSIX-compliant
A library allowing
A library allowing REST gateway,
REST gateway, fully-distributed
fully-distributed distributed file
apps to directly
apps to directly compatible with S3
compatible with S3 block device, with aa
block device, with system, with a
access RADOS,
access RADOS, and Swift
and Swift Linux kernel client
Linux kernel client Linux kernel client
with support for
with support for and aaQEMU/KVM
and QEMU/KVM and support for
C, C++, Java,
C, C++, Java, driver
driver FUSE
Python, Ruby,
Python, Ruby,
and PHP
and PHP
RADOS
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
48. CLIENT
CLIENT
metadata 01
01 data
10
10
M
M
M
M M
M
50. Metadata Server (MDS)
●
manages metadata for POSIX
shared file system
●
directory hierarchy
●
file metadata (size, owner, timestamps)
●
stores metadata in RADOS
●
does not serve file data to clients
●
only required for the shared file
system
61. installation
● work closely with distros
● packages in Debian, Fedora, Ubuntu, OpenSUSE
● build releases for all distros, recent releases
● e.g., point to our deb src and apt-get install
● allow mixed-version clusters
● upgrade each component individually, each host or rack one-by-one
● protocol feature bits, safe data type encoding
● facilitate testing, bleeding edge
● automatic build system generates packages for all branches in git
● each to test new code, push hot-fixes
62. configuration
● minimize local configuration
● logging, local data paths, tuning options
● cluster state is managed centrally by monitors
● specify option via
● config file (global or local)
● command line
● adjusted for running daemon
● flexible configuration management options
● global synced/copied config file
● generated by management tools
– Chef, Juju, Crowbar
● be flexible
63. provisioning
● embrace dynamic nature of the cluster
● disks, hosts, rack may come online at any time
● anything may fail at any time
● simple scriptable sequences
'ceph osd create <uuid>' → allocate osd id
'ceph-osd --mkfs -i <id>' → initialize local data dir
'ceph osd crush …' → add to crush map
● identify minimal amount of central coordination
● monitor cluster membership/quorum
● provide hooks for external tools to do the rest
64. old school HA
● deploy a pair of servers
● heartbeat
● move a floating IP between then
● clients contact same “server”, which floats
● cumbersome to configure, deploy
● the “client/server” model doesn't scale out
65. make client, protocol cluster-aware
● clients learn topology of the distributed cluster
● all OSDs, current ips/ports, data distribution map
● clients talk to the server they want
● servers scale out, client traffic distributes too
● servers dynamically register with the cluster
● when a daemon starts, it updates its address in the map
● other daemons and clients learn map updates via gossip
● servers manage heartbeat, replication
● no fragile configuration
● no fail-over pairs
66. ceph disk management
● label disks
● GPT partition type (fixed uuid)
● udev
● generate event when disk is added, on boot
● 'ceph-disk-activate /dev/sdb'
● mount the disk in the appropriate location (/var/lib/ceph/osd/NNN)
● possibly adjust cluster metadata about disk location (host, rack)
● upstart, sysvinit, …
● start the daemon
● daemon “joins” the cluster, brings itself online
● no manual per-node configuration
67. distributed system health
●
ceph-mon collects, aggregates basic system state
● daemon up/down, current IPs
● disk utilization
●
simple hooks query system health
● 'ceph health [detail]'
● trivial plugins for nagios, etc.
●
daemons instrumentation
● query state of running daemon
● use external tools to aggregate
– collectd, graphite
– statsd
68. many paths
● do it yourself
● use ceph interfaces directly
● Chef
● ceph cookbooks, DreamHost cookbooks
● Crowbar
● Juju
● pre-packaged solutions
● Piston AirFrame, Dell OpenStack-Powered Cloud Solution,
Suse Cloud, etc.
70. APP
APP APP
APP HOST/VM
HOST/VM CLIENT
CLIENT
RADOSGW
RADOSGW RBD
RBD CEPH FS
CEPH FS
LIBRADOS
LIBRADOS
A bucket-based
A bucket-based A reliable and
A reliable and A POSIX-compliant
A POSIX-compliant
A library allowing
A library allowing REST gateway,
REST gateway, fully-distributed
fully-distributed distributed file
distributed file
apps to directly
apps to directly compatible with S3
compatible with S3 block device, with aa
block device, with system, with aa
system, with
access RADOS,
access RADOS, and Swift
and Swift Linux kernel client
Linux kernel client Linux kernel client
Linux kernel client
with support for
with support for and aaQEMU/KVM
and QEMU/KVM and support for
and support for
C, C++, Java,
C, C++, Java, driver
driver FUSE
FUSE
Python, Ruby,
Python, Ruby,
and PHP
and PHP AWESOME AWESOME
NEARLY
AWESOME AWESOME
RADOS
RADOS AWESOME
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
73. why we do this
● limited options for scalable open source storage
● proprietary solutions
● expensive
● don't scale (well or out)
● marry hardware and software
● users hungry for alternatives
● scalability
● cost
● features
● open, interoperable
74. licensing
<yawn>
● promote adoption
● enable community development
● prevent ceph from becoming proprietary
● allow organic commercialization
75. LGPLv2
● “copyleft”
● free distribution
● allow derivative works
● changes you distribute/sell must be shared
● ok to link to proprietary code
● allow proprietary products to incude and build on
ceph
● does not allow proprietary derivatives of ceph
76. fragmented copyright
● we do not require copyright assignment from
contributors
● no single person or entity owns all of ceph
● no single entity can make ceph proprietary
● strong community
● many players make ceph a safe technology bet
● project can outlive any single business
77. why it's important
● ceph is an ingredient
● we need to play nice in a larger ecosystem
● community will be key to success
● truly open source solutions are disruptive
● frictionless integration with projects, platforms, tools
● freedom to innovate on protocols
● leverage community testing, development resources
● open collaboration is efficient way to build technology
78. who we are
● Ceph created at UC Santa Cruz (2004-2007)
● developed by DreamHost (2008-2011)
● supported by Inktank (2012)
● Los Angeles, Sunnyvale, San Francisco, remote
● growing user and developer community
● Linux distros, users, cloud stacks, SIs, OEMs
http://ceph.com/