5. Ceph at a glance
So ftw a re
A ll- in - 1
CR U SH
S c a le
O n c o m m o d ity h a rd w a re
O b je c t , b lo c k , a n d file
A w esom esauce
H uge and b eyond
C e p h ca n ru n o n a n y
in f r a s t r u c t u r e , m e t a l
o r v ir t u a liz e d t o
p r o v id e a c h e a p a n d
p o w e rfu l sto ra g e
c lu s t e r .
L o w o ve rh e a d
d o e sn ’t m e a n ju st
h a r d w a r e , it m e a n s
p e o p le t o o !
In f r a s t r u c t u r e - a w a r e
p la c e m e n t a lg o r it h m
a llo w s y o u t o d o
r e a lly c o o l s t u f f .
D e s ig n e d f o r
e x a b yte , cu rre n t
im p le m e n t a t io n s in
t h e m u lt i- p e t a b y t e .
H P C , B ig D a t a ,
C lo u d , r a w s t o r a g e .
7. Distributed storage system
●
data center scale
●
●
●
10s to 10,000s of machines
terabytes to exabytes
fault tolerant
●
●
●
no single point of failure
commodity hardware
self-managing, self-healing
15. UCSC research grant
●
“Petascale object storage”
●
DOE: LANL, LLNL, Sandia
●
Scalability
●
Reliability
●
Performance
●
●
Raw IO bandwidth, metadata ops/sec
HPC file system workloads
●
Thousands of clients writing to same file, directory
16. Distributed metadata management
●
Innovative design
●
Subtree-based partitioning for locality, efficiency
●
Dynamically adapt to current workload
●
Embedded inodes
●
Prototype simulator in Java (2004)
●
First line of Ceph code
●
Summer internship at LLNL
●
High security national lab environment
●
Could write anything, as long as it was OSS
17. The rest of Ceph
●
RADOS – distributed object storage cluster (2005)
●
EBOFS – local object storage (2004/2006)
●
CRUSH – hashing for the real world (2005)
●
Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture
18. Industry black hole
●
Many large storage vendors
●
●
Proprietary solutions that don't scale well
Few open source alternatives (2006)
●
●
Limited community and architecture (Lustre)
●
●
Very limited scale, or
No enterprise feature sets (snapshots, quotas)
PhD grads all built interesting systems...
●
●
...and then went to work for Netapp, DDN, EMC, Veritas.
They want you, not your project
19. A different path
●
Change the world with open source
●
●
●
Do what Linux did to Solaris, Irix, Ultrix, etc.
What could go wrong?
License
●
●
●
GPL, BSD...
LGPL: share changes, okay to link to proprietary code
Avoid unsavory practices
●
Dual licensing
●
Copyright assignment
21. APP
APP
LIBRADOS
LIBRADOS
APP
APP
RADOSGW
RADOSGW
AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
and Swift
access RADOS,
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
and PHP
and PHP
HOST/VM
HOST/VM
CLIENT
CLIENT
RBD
RBD
CEPH FS
CEPH FS
AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver
AA POSIX-compliant
POSIX-compliant
distributed file system,
distributed file system,
with a a Linux kernel
with Linux kernel
client and support for
client and support for
FUSE
FUSE
RADOS
RADOS
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
22. Why start with objects?
●
more useful than (disk) blocks
●
●
variable size
●
●
names in a single flat namespace
simple API with rich semantics
more scalable than files
●
no hard-to-distribute hierarchy
●
update semantics do not span objects
●
workload is trivially parallel
23. Ceph object model
●
pools
●
●
independent namespaces or object collections
●
●
1s to 100s
replication level, placement policy
objects
●
bazillions
●
blob of data (bytes to gigabytes)
●
attributes (e.g., “version=12”; bytes to kilobytes)
●
key/value bundle (bytes to gigabytes)
25. Object Storage Daemons (OSDs)
●
10s to 10000s in a cluster
●
one per disk, SSD, or RAID group, or ...
●
●
●
hardware agnostic
serve stored objects to clients
intelligently peer to perform replication and
recovery tasks
Monitors
M
●
●
maintain cluster membership and state
provide consensus for distributed decisionmaking
●
small, odd number
●
these do not served stored objects to clients
26. Data distribution
●
●
●
all objects are replicated N times
objects are automatically placed, balanced, migrated in a
dynamic cluster
must consider physical infrastructure
●
●
ceph-osds on hosts in racks in rows in data centers
three approaches
●
pick a spot; remember where you put it
●
pick a spot; write down where you put it
●
calculate where to put it, where to find it
37. APP
APP
LIBRADOS
LIBRADOS
APP
APP
RADOSGW
RADOSGW
AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
and Swift
access RADOS,
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
and PHP
and PHP
HOST/VM
HOST/VM
CLIENT
CLIENT
RBD
RBD
CEPH FS
CEPH FS
AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver
AA POSIX-compliant
POSIX-compliant
distributed file system,
distributed file system,
with a a Linux kernel
with Linux kernel
client and support for
client and support for
FUSE
FUSE
RADOS
RADOS
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
38. RADOS Gateway
●
REST-based object storage proxy
●
uses RADOS to store objects
●
API supports buckets, accounting
●
usage accounting for billing purposes
●
compatible with S3, Swift APIs
39. RADOS Block Device
●
storage of disk images in RADOS
●
decouple VM from host
●
images striped across entire cluster (pool)
●
snapshots
●
copy-on-write clones
●
support in
●
Qemu/KVM, Xen
●
mainline Linux kernel (2.6.34+)
●
OpenStack, CloudStack
●
Ganeti, OpenNebula
40. Metadata Server (MDS)
●
manages metadata for POSIX shared file system
●
directory hierarchy
●
file metadata (size, owner, timestamps)
●
stores metadata in RADOS
●
does not serve file data to clients
●
only required for the shared file system
43. Package Building
●
GitBuilder
●
●
Very basic tool, works well though
●
Push to get → .deb && .rpm for all platforms
●
●
Build packages / tarballs
autosigned
Pbuilder
●
Used for major releases
●
Clean room type of build
●
Signed using different key
●
As many distros as possible (All Debian, Ubuntu, Fedora 17&18,
CentOS, SLES, OpenSUSE, tarball)
44. Teuthology
●
Our own custom test framework
●
Allocates machines you define to cluster you define
●
Runs tasks against that environment
●
Set up ceph
●
Run workloads
●
Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount
it → run workload
●
Makes it easy to stack things up, map them to hosts, run
●
Suite of tests on our github
●
Active community development underway!
48. 4M Relative Performance
All Workloads: QEMU/KVM, or XFS
Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
49. What does this mean?
●
Anecdotal results are great!
●
Concrete evidence coming soon
●
Performance is complicated
●
Workload
●
Hardware (especially network!)
●
Lots and lots of tuneables
●
Mark Nelson (nhm on #ceph) is the performance geek
●
Lots of reading on ceph.com blog
●
We welcome testing and comparison help
●
Anyone want to help with an EMC / NetApp showdown?
51. Future plans
G e o - R e p lic a t io n
E r a s u r e C o d in g
T ie r in g
G o v e rn a n c e
A n o n g o in g p r o c e s s
R e c e p t io n e f f ic ie n c y
H e a d e d t o d y n a m ic
M a k in g it o p e n - e r
W h ile t h e f ir s t p a s s
f o r d is a s t e r r e c o v e r y
is d o n e , w e w a n t t o
g e t t o b u ilt - in ,
w o r ld - w id e
r e p lic a t io n .
C u r r e n t ly u n d e r w a y
in t h e c o m m u n it y !
C a n a lr e a d y d o t h is
in a s t a t ic p o o lb a s e d s e t u p . L o o k in g
to g e t to a u se -b a se d
m ig r a t io n .
B e e n t a lk in g a b o u t it
f o r e v e r. T h e t i m e i s
c o m in g !
52. Get Involved!
CDS
Q u a r t e r ly O n li n e S u m m i t
Ceph Day
IR C
N o t ju st fo r N Y C
G e e k -o n -d u ty
E m a i l m a k e s t h e w o r ld g o
O n lin e s u m m it p u t s
th e c o re d e vs
t o g e t h e r w it h t h e
C e p h c o m m u n it y.
M o r e p la n n e d ,
in c lu d in g S a n t a C la r a
and London. Keep an
eye out:
D u r in g t h e w e e k
t h e r e a r e t im e s
w h e n C e p h e x p e rts
a r e a v a ila b le t o
h e lp . S t o p b y
o ftc.n e t/ ce p h
O u r m a ilin g lis t s a r e
v e r y a c t iv e , c h e c k
o u t ce p h .co m fo r
d e t a ils o n h o w t o
j o in in !
h ttp : / / in k ta n k . c o m / c e p h d a y
s/
L is t s
53. E - M A IL
p a tr ic k @ in k ta n k . c o m
W E B S IT E
C ep h .co m
S O C IA L
@ s c u t t le m o n k e y
@ ceph
F a ce b o o k .co m / ce p h sto ra g e
Questions?