"Scaling Storage with Ceph", Ross Turk, VP of Community, Inktank
Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. It runs on commodity hardware, has no single point of failure, and is supported by the Linux kernel. This talk will describe the Ceph architecture, share its design principles, and discuss how it can be part of a cost-effective, reliable cloud stack.
1. S C A L I N G
S T O R A G E
W I T H
C E P H
Ross
Turk,
Inktank
2.
3.
4. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
5. I N
T H E
B E G I N N I N G
Magic Madzik, Flickr / CC BY 2.0
6. E A R L Y
I N F O R M A T I O N
S T O R A G E
Chico.Ferreira, Flickr / CC BY 2.0
7. W R I T I N G
>
C A V E
P A I N T I N G S
kevingessner, Flickr / CC BY-SA 2.0
24. aa
ab 111010 ac
101 ba bb bc 111 010
da 110 db 01 010 000 dc
10
000 110 001
25. W E
O U T G R O W
T H E
H A R D
D R I V E
Mr. T in DC, Flickr / CC BY 2.0
26. DISK
DISK
DISK
HUMAN COMPUTER DISK
DISK
DISK
DISK
27. P E O P L E
N E E D
S I M U L T A N E O U S
A C C E S S
wFourier, Flickr / CC BY 2.0
28. DISK
DISK
HUMAN
DISK
HUMAN COMPUTER DISK
DISK
HUMAN
DISK
DISK
29. HUMAN HUMAN
HUMAN
HUMAN DISK
HUMAN
HUMAN DISK
HUMAN
HUMAN DISK
DISK
HUMAN
DISK
HUMAN
HUMAN DISK
(COMPUTER)
HUMAN
DISK
HUMAN
HUMAN
DISK
HUMAN
HUMAN DISK
HUMAN DISK
HUMAN DISK
HUMAN
HUMAN DISK
HUMAN
HUMAN
(actually more like this…)
30. COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
HUMAN
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
HUMAN
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
HUMAN
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
31. X
aa
ab 111010 ac
101 ba bb bc 111 010
da 110 db 011 010 000 dc
000 110 001
33. COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
APP
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
34. COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER
COMPUTER DISK
DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
35. COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
VM COMPUTER DISK
COMPUTER DISK
VM COMPUTER DISK
COMPUTER DISK
VM
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
36. Ceph
Cloud computing
Distributed storage
Shared storage
Computers
Writing
Painting
S T O R A G E
T H R O U G H O U T
H I S T O R Y
Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.
37. COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
HUMAN
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
HUMAN
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
HUMAN
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
38. COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
39. C D
C D
C D
C D
C D
C D
C D
C D
C D
C D
C D
C D
40. C D
C D
C D
HUMAN
C D
C D
C D
HUMAN C D
C D
C D
HUMAN C D
C D
C D
41. S T O R A G E
A P P L I A N C E S
Michael Moll, Wikipedia / CC BY-SA 2.0
42. 6 . 4
M I L L I O N
S Q F T
O F
F A C T O R I E S
Dude94111, Flickr / CC BY 2.0
43. S T O R A G E
V E N D O R S
H A V E
B I G
B I L L S
CarbonNYC, Flickr / CC BY 2.0
44. S T O R A G E
A P P L I A N C E S
A R E
E X P E N S I V E
401K 2012, Flickr / CC BY-SA 2.0
45. T E C H N O L O G Y
I S
A
C O M M O D I T Y
RaeAllen, Flickr / CC-BY 2.0
46. C O M M O D I T Y
P R I C E S
F L U C T U A T E
May-07 May-08 May-09 May-10 May-11 May-12
47. G R O W I N G
W I T H
H A R D W A R E
A P P L I A N C E S
C D § First PB C D § Second PB
C D § Proprietary C D § Proprietary
C D storage C D storage
C D hardware C D hardware
C D § Well-known C D § Same storage
C D storage C D vendor
C D vendor C D
C D C D
C D C D
§ Another $14
C D
§ $14 b’zillion C D b’zillion
C D C D
C D C D
48. A P P L I A N C E S
A R E
O L D
T E C H N O L O G Y
Paul Keller, Flickr / CC BY 2.0
78. N E W
M O N T H L Y
C O D E
C O M M I T S
700
600
500
400
300
200
100
0
2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07
79. C E P H
S T A R T S
P O P P I N G
U P !
(sorry about all the logo tampering)
80. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
81. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
82. OSD OSD OSD OSD OSD
btrfs
FS FS FS FS FS
xfs
ext4
DISK DISK DISK DISK DISK
M M M
84. M
Monitors:
§ Maintain cluster map
§ Provide consensus for
distributed decision-
making
§ Must have an odd number
§ These do not serve stored
objects to clients
OSDs:
§ One per disk
(recommended)
§ At least three in a cluster
§ Serve stored objects to
clients
§ Intelligently peer to perform
replication tasks
§ Supports object classes
85. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
87. L
LIBRADOS
§ Provides direct access to
RADOS for applications
§ C, C++, Python, PHP,
Java
§ No HTTP overhead
88. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
89. APP APP
REST
RADOSGW RADOSGW
LIBRADOS LIBRADOS
native
M
M M
90. RADOS Gateway:
§ REST-based interface to
RADOS
§ Supports buckets,
accounting
§ Compatible with S3 and
Swift applications
91. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
95. RADOS Block Device:
§ Storage of virtual disks in
RADOS
§ Allows decoupling of VMs
and containers
§ Live migration!
§ Images are striped across
the cluster
§ Boot support in QEMU,
KVM, and OpenStack Nova
§ Mount support in the Linux
kernel
96. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
98. Metadata Server
§ Manages metadata for a
POSIX-compliant shared
filesystem
§ Directory hierarchy
§ File metadata (owner,
timestamps, mode, etc.)
§ Stores metadata in RADOS
§ Does not serve file data to
clients
§ Only required for shared
filesystem
132. APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS
LIBRADOS
A bucket-based REST A reliable and fully- A POSIX-compliant
A library allowing gateway, compatible distributed block distributed file
apps to directly with S3 and Swift device, with a Linux system, with a Linux
access RADOS, kernel client and a kernel client and
with support for QEMU/KVM driver support for FUSE
C, C++, Java,
Python, Ruby,
and PHP AWESOME AWESOME
NEARLY
AWESOME AWESOME
RADOS AWESOME
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
134. C E P H
A N D
C L O U D S T A C K
tableatny, Flickr / CC BY 2.0
135. R B D
S U P P O R T
I N
C L O U D S T A C K
§ Just announced two weeks ago!
§ Allows storage of virtual disks inside RADOS
§ Works with KVM only right now
§ No volume snapshots yet
§ Requires the latest version of, um, everything
§ More information can be found on the mailing list:
§ ceph-devel / incubator-cloudstack-dev:
http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505