Introduction to Ceph Storage - A Distributed Object Store and File System

簡介
● 任職於 inwinstack
○ 過去的迎廣科技雲端應用研發中
心
● python, django, linux,
openstack, docker
● kjellytw at gmail dot com
● http://www.blackwhite.
tw/

大綱
● 儲存系統是什麼
● 儲存系統的演進
● ceph 介紹
● python 如何用 ceph
● ceph 的一些指令介紹
● yahoo 的 ceph 架構介紹

儲存系統是什麼？
● 容量有限，資料無窮
● 一顆硬碟不夠，就在多一顆硬碟
● 儲存系統，是管理很多資料的系統

Storage System
● Feature:
○ Replication
○ High capacity
○ Consistency
● Optional feature:
○ Over Allocation
○ Snapshot
○ Dereplication

儲存系統的演進 - 1. 自己的硬碟自己用
Host
DISK
DISK
DISK
磁碟控
制器
檔案系統目錄層
JBOD Host
DISK
DISK
DISK
磁碟控
制器
檔案系統目錄層

儲存系統的演進 - 2
Host
DISK
DISK
DISK
磁碟控制器 or
SCSI 控制器
等
檔案系統目錄層LUN FC protocal or iscsi

Host
儲存系統的演進 - 3
NAS
DISK
DISK
DISK
磁碟控制器 or
SCSI 控制器
等
檔案系統LUN
FC protocal
or iscsi
目錄層

What’s Next?
● storage cluster:
○ capacity
○ performance
● stroage cluster 例子：
○ lbrix
○ Panfs
○ ceph

Storage Cluster - 1
Client
Controller
Storage
node
1. Client send data
to controller to store
data.
2. Controller store
the data to storage.
Storage
node

Storage Cluster - 2
Client Controller1. Get where the data
sotre
2. Client store the data to
stroage directly.
Storage
node
Storage
node

Storage Cluster - 3
Client
monitor
Storage
1. Get the cluster
information.
3. Cient store the data to
stroa“ge directly.
2. Client compute the position where the data
should put based on cluster information
Storage
node

Ceph
● One of clustered storage
● Software defined storage
○ Cost-performance tradeoff
○ Flexible interfaces
○ Different storage abstractions

Ceph
● A distributed object store and file system
designed to provide excellent performance,
reliability and scalability.
● Open source and freely-available, and it
always will be.
● Object Storage (rados)
● Block Storage (rbd)
● File System (cephfs)

● Object Storage:
○ You can get/put object using key based on the
interface Object Storage provides.
○ example: S3
● Block Storage:
○ Block storage provide you virtual disk.
The virtual disk is just like real disk.
● File System:
○ Just like nas

Ceph Motivating Principles
● Everything must scale horizontally
● No single point of failure
● Commodity hardware
● Self-manage whenever possible

Ceph Architectural Features
● Object locations get computed.
○ CRUSH algorithm
○ Ceph OSD Daemon uses it to compute where
replicas of objects should be stored (and for
rebalancing).
○ Ceph Clients use the CRUSH algorithm to efficiently
compute information about object location
● No centralized interface.
● OSDs Service Clients Directly

RADOS
● A Scalable, Reliable Storage Service for
Petabyte-scale Storage Clusters
● Component:
○ ceph-mon
○ ceph-osd
○ librados
● data hierarchy:
○ pool
○ object

RADOS
● ceph-mon:
○ maintaining a master copy of the cluster map
○ These do not serve stored objects to clients

RADOS
● ceph-osd:
○ Storing objects on a local file system
○ Providing access to objects over the network to
client directly.
○ Use CRUSH algorithm to get where the object is.
○ One per disk(RAID group)
○ Peering
○ Checks its own state and the state of other OSDs
and reports back to monitors.

RADOS
● librados:
○ Client retrieve the latest copy of the Cluster Map,
so it knows about all of the monitors, OSDs, and
metadata servers in the cluster
○ Client use CRUSH and Cluster Map to get where the
object is.
○ Directly access to OSD.

CRUSH
● Controlled Replication Under Scalable
Hashing
● Hash algorithm
● Generate position based on:
○ pg（Placement Group）
○ cluster map
○ rule set

What’s the Problem CRUSH Solved?

How to Decide Where the Object
Put?
● Way 1: look up the table.
○ Easy to implement
○ Hard to scale horizontally
key1 node1
key2 node1
key3 node2
key4 node1

Put?
● Way 2: hash:
○ Easy to implement
○ But too many data movement when rebanlance
A:
0~33
B:
C:
66~99
A:
0~25
B:
25~50
D:
50~75
C:
75~50
Add new node

Put?
● Way 3: hash
with static
table:
○ look up table
after hashing
○ openstack swift
hash
data1 data2 data3
node1
data4
node2 node3
Virtual
partitio
n1
Virtual
partitio
n2
Virtual
partitio
n3
Virtual
partitio
n4
data5
map

Put?
● Way 4: CRUSH
○ Fast calculation, no lookup
○ Repeatable, deterministic
○ Statistically uniform distribution
○ Stable mapping
○ Rule-based configuration

Why We Need Placement Group
● A layer of indirection between the Ceph
OSD Daemon and the Ceph Client.
○ Decoupling OSD and client.
○ Easy to rebanlance.

Placement Group
Pool Pool
Placement Group Placement Group Placement Group
OSD OSD OSD OSD
Object Object Object

How to Compute Object Location
object id: foo
pool: bar
hash(“foo”) % 256 = 0x23
“bar” => 3
Placement
Group: 3.23
OSDMap.h

How to Compute Object Location
Placement
Group: 3.23
CRUSH OSD
OSD
OSD

How does the Client Access Object
Client monitor
Storage
1. get the cluster
information.
3. client store the data to
stroage directly.
2. Client compute the position where the data
should put based on cluster information
Storage
node

How does the Client Write Object

How does the Client Read Object
● Read from primary osd
● Send reads to all replicas. The quickest
reply would "win" and others would be
ignored.
Client
Primary
OSD
Client
OSD
OSD
OSD

Rados Gateway
● HTTP REST gateway for the RADOS object
store
● Rich api
○ S3 api
○ swift api
● Integrate with openstack keystone.
● stateless

Why We Need radowgw?
● We want to use RESTFul api
○ S3 api
○ swift api
● We don’t want other to know the cluster
status.

RBD
● RADOS Block Devices
○ Provide virtual disk just like real disk.
● Image strip (by librbd)
● integrate with linux kernel, kvm

Why RBD need to Strip Image?
● Avoid big file.
● Parallelism
● Random access
0~2500 2500~5000 5000~7500 7500~10000
01XXXX OSD1 OSD3 OSD4 OSD6
librbd
OSD1
OSD4
OSD3

librados Support Stripping?
● No, librados doesn’t support stripping.
● But you can use libradosstriper
○ Poor document.

Openstack use rbd
● You can use rbd with nova, glance, or
cinder.
● Cinder use rbd to provide volumes.
● Glance use rbd to store image.
● Why glance use rbd instead of libradow or
radowgw?
○ Copy-on-write

rados command
● rados lspools
● rados mkpool my_pool
● rados create obj_name -p my
● rados put file_path obj_name -p my
● rados get obj_name file_path -p my

rados command
● rados getxattr obj_name attr_name
● rados setxattr obj_name attr_name value
● rados rmxattr obj_name attr_name

rados command
● rados lssnap -p pool_name
● rados mksnap snap_name -p pool_name
● rados rmsnap snap_name -p pool_name
● rados obj_name snap_name -p pool_name

rados command
● rados import backup_dir pool_name
● rados export pool_name backup_dir

rbd command
● rbd create --size 1000 volume -p pools
● rbd map volume -p pools
○ /dev/rbd*
● rbd unmap /dev/rbd0
● rbd import file_path image_name
● rbd export image_name file_path

rbd command
● snap ls <image-name>
● snap create <snap-name>
● snap rollback <snap-name>
● snap rm <snap-name>
● snap purge <image-name>
● snap protect <snap-name>
● snap unprotect <snap-name>

Yahoo Ceph Cluster Architecture
● COS contains
many ceph
culster
○ limit OSD number.
● There are many
gateway behind
of load
banlancer.

Summary
● 儲存系統演進
● 介紹 ceph
● CRUSH 和其他方法的比較

徵人
● Interesting in
○ openstack
○ ceph

Introduction to Ceph Storage - A Distributed Object Store and File System

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Introduction to Ceph Storage - A Distributed Object Store and File System

Ähnlich wie Introduction to Ceph Storage - A Distributed Object Store and File System (20)

Mehr von kao kuo-tung

Mehr von kao kuo-tung (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Ceph Storage - A Distributed Object Store and File System