This document introduces Ceph storage. It discusses the evolution of storage systems from individual disks to clustered storage. Ceph is introduced as a clustered, software-defined storage system that provides object, block, and file storage. Key Ceph components like RADOS, monitors, OSDs, and the CRUSH algorithm are explained. CRUSH provides a scalable and reliable way to determine object locations across multiple nodes. Python and command line interfaces for Ceph are also summarized. Finally, Yahoo's Ceph cluster architecture is briefly described.
10. Storage Cluster - 1
Client
Controller
Storage
node
1. Client send data
to controller to store
data.
2. Controller store
the data to storage.
Storage
node
11. Storage Cluster - 2
Client Controller1. Get where the data
sotre
2. Client store the data to
stroage directly.
Storage
node
Storage
node
12. Storage Cluster - 3
Client
monitor
Storage
1. Get the cluster
information.
3. Cient store the data to
stroa“ge directly.
2. Client compute the position where the data
should put based on cluster information
Storage
node
13. Ceph
● One of clustered storage
● Software defined storage
○ Cost-performance tradeoff
○ Flexible interfaces
○ Different storage abstractions
14. Ceph
● A distributed object store and file system
designed to provide excellent performance,
reliability and scalability.
● Open source and freely-available, and it
always will be.
● Object Storage (rados)
● Block Storage (rbd)
● File System (cephfs)
15. ● Object Storage:
○ You can get/put object using key based on the
interface Object Storage provides.
○ example: S3
● Block Storage:
○ Block storage provide you virtual disk.
The virtual disk is just like real disk.
● File System:
○ Just like nas
16. Ceph Motivating Principles
● Everything must scale horizontally
● No single point of failure
● Commodity hardware
● Self-manage whenever possible
17. Ceph Architectural Features
● Object locations get computed.
○ CRUSH algorithm
○ Ceph OSD Daemon uses it to compute where
replicas of objects should be stored (and for
rebalancing).
○ Ceph Clients use the CRUSH algorithm to efficiently
compute information about object location
● No centralized interface.
● OSDs Service Clients Directly
21. RADOS
● ceph-osd:
○ Storing objects on a local file system
○ Providing access to objects over the network to
client directly.
○ Use CRUSH algorithm to get where the object is.
○ One per disk(RAID group)
○ Peering
○ Checks its own state and the state of other OSDs
and reports back to monitors.
22. RADOS
● librados:
○ Client retrieve the latest copy of the Cluster Map,
so it knows about all of the monitors, OSDs, and
metadata servers in the cluster
○ Client use CRUSH and Cluster Map to get where the
object is.
○ Directly access to OSD.
23. CRUSH
● Controlled Replication Under Scalable
Hashing
● Hash algorithm
● Generate position based on:
○ pg(Placement Group)
○ cluster map
○ rule set
25. How to Decide Where the Object
Put?
● Way 1: look up the table.
○ Easy to implement
○ Hard to scale horizontally
key1 node1
key2 node1
key3 node2
key4 node1
26. How to Decide Where the Object
Put?
● Way 2: hash:
○ Easy to implement
○ But too many data movement when rebanlance
A:
0~33
B:
C:
66~99
A:
0~25
B:
25~50
D:
50~75
C:
75~50
Add new node
27.
28. How to Decide Where the Object
Put?
● Way 3: hash
with static
table:
○ look up table
after hashing
○ openstack swift
hash
data1 data2 data3
node1
data4
node2 node3
Virtual
partitio
n1
Virtual
partitio
n2
Virtual
partitio
n3
Virtual
partitio
n4
data5
map
29. How to Decide Where the Object
Put?
● Way 4: CRUSH
○ Fast calculation, no lookup
○ Repeatable, deterministic
○ Statistically uniform distribution
○ Stable mapping
○ Rule-based configuration
30. Why We Need Placement Group
● A layer of indirection between the Ceph
OSD Daemon and the Ceph Client.
○ Decoupling OSD and client.
○ Easy to rebanlance.
32. How to Compute Object Location
object id: foo
pool: bar
hash(“foo”) % 256 = 0x23
“bar” => 3
Placement
Group: 3.23
OSDMap.h
33. How to Compute Object Location
Placement
Group: 3.23
CRUSH OSD
OSD
OSD
34. How does the Client Access Object
Client monitor
Storage
1. get the cluster
information.
3. client store the data to
stroage directly.
2. Client compute the position where the data
should put based on cluster information
Storage
node
36. How does the Client Read Object
● Read from primary osd
● Send reads to all replicas. The quickest
reply would "win" and others would be
ignored.
Client
Primary
OSD
Client
OSD
OSD
OSD
37. Rados Gateway
● HTTP REST gateway for the RADOS object
store
● Rich api
○ S3 api
○ swift api
● Integrate with openstack keystone.
● stateless
40. Why We Need radowgw?
● We want to use RESTFul api
○ S3 api
○ swift api
● We don’t want other to know the cluster
status.
41. RBD
● RADOS Block Devices
○ Provide virtual disk just like real disk.
● Image strip (by librbd)
● integrate with linux kernel, kvm
42. Why RBD need to Strip Image?
● Avoid big file.
● Parallelism
● Random access
0~2500 2500~5000 5000~7500 7500~10000
01XXXX OSD1 OSD3 OSD4 OSD6
02XXXX OSD8 OSD2 OSD3 OSD5
03XXXX OSD1 OSD6 OSD2 OSD3
librbd
OSD1
OSD4
OSD3
43. librados Support Stripping?
● No, librados doesn’t support stripping.
● But you can use libradosstriper
○ Poor document.
44. Openstack use rbd
● You can use rbd with nova, glance, or
cinder.
● Cinder use rbd to provide volumes.
● Glance use rbd to store image.
● Why glance use rbd instead of libradow or
radowgw?
○ Copy-on-write