SlideShare ist ein Scribd-Unternehmen logo
1 von 99
Downloaden Sie, um offline zu lesen
Build an High-Performance and High-Durable 
Block Storage Service Based on Ceph 
Rongze & Haomai 
[ rongze | haomai ]@unitedstack.com
CONTENTS 
Block Storage Service 
1 
3 
High Durability 
2 
High Performance 
4 
Operation Expericence
THE FIRST PART 
01 Block Storage 
Service
Block Storage Service Highlight 
• 6000 IOPS 170 MB/s 95% < 2ms SLA 
• 3 copys, strong consistency, 99.99999999% durability 
• All management ops in seconds 
• Real-time snapshot 
• Performance volume type and capacity volume type 
4
Software used 
5
6 
Now 
OpenStack Essex Folsom Havana Icehouse/ 
Juno Juno 
Ceph 0.42 0.67.2 base on 
0.67.5 
base on 
0.67.5 
base on 
0.80.7 
CentOS 6.4 6.5 6.5 6.6 
Qemu 0.12 0.12 base on 
1.2 
base on 
1.2 2.0 
Kernel 2.6.32 3.12.21 3.12.21 ?
Deployment Architecture 
7
minimum deployment 
12 osd nodes and 3 monitor nodes 
 
	
 
8 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
40 Gb Switch 40 Gb Switch
Scale-out 
9
the minimum scale deployment 
  
	
 
Compute/Storage Node 
 
	
 	
 
 
	
 
 
	
 
 
	
 
40 Gb Switch 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
40 Gb Switch 
12 osd nodes
Compute/Storage Node 
 
	
 	
  
	
 
 
	
  
	
 
40 Gb Switch 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
  
	
 
Compute/Storage Node 
 
	
 	
  
	
 
 
	
  
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
40 Gb Switch 
144 osd nodes 
  
	
 
Compute/Storage Node 
 
	
 	
 
 
	
 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
  
	
 
Compute/Storage Node 
 
	
 	
 
 
	
 
 
	
 
 
	
 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node 
Compute/Storage Node
OpenStack
nova 
VM VM 
1 Gb Network: 20 GB / 100 MB = 200 s = 3 mins 
10 Gb Network: 20 GB / 1000 MB = 20 s 
LVM 
SAN 
Ceph 
20 GB Image 20 GB Image 
LocalFS 
Swift 
Ceph 
GlusterFS 
LocalFS 
NFS 
GlusterFS 
cinder glance 
HTTP HTTP 
Boot Storm
Nova, Glance, Cinder use the same ceph pool 
All action in seconds 
No boot storm 
cinder glance 
14 
nova 
VM VM 
Ceph
15 
QoS 
• Nova 
• Libvirt 
• Qemu(throttle) 
Two Volume Types 
• Cinder multi-backend 
• Ceph SSD Pool 
• Ceph SATA Pool 
Shared Volume 
• Read Only 
• Multi-attach
THE SECOND PART 
02 High 
Performance
OS configure
• CPU: 
• Get out of CPU out of power save mode: 
• echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null 
• Cgroup: 
• Bind Ceph-OSD processes to fixed cores(1-2 cores per OSD) 
• Memory: 
• Turn off NUMA if support NUMA in /etc/grub.conf 
• Set vm.swappiness = 0 
• Block: 
• echo deadline  /sys/block/sd[x]/queue/scheduler 
• FileSystem 
• Mount with “noatime nobarrier
Qemu
• Throttle: Smooth IO limit algorithm(backport) 
• RBD enhance: Discard and flush enhance(backport) 
• Burst: 
• Virt-scsi: Multi-queue support
Ceph IO Stack
VCPU 
Thread 
Qemu Main 
Thread 
data flow 
Pipe::Writer Pipe::Reader DispatchThreader OSD::OpWQ FileJournal::Writer FileJournal-finisher 
Network 
Qemu OSD 
RadosClient-finisher DispatchThreader Pipe::Reader Pipe::Writer FileStore-ondisk_finisher FileStore-op_finisher FileStore::SyncThread FileStore::OpWQ 
Object Object Object Object 
Object Object Object Object 
RBD Image 
Rados 
Object File
Ceph Optimization
Rule 1: Keep FD 
• Facts: 
• FileStore Bottleneck: Remarkable performance degraded when FD cache missed 
• SSD = 480GB = 122880 Objects(4MB) = 30720 objects(16MB) in theory 
• Action: 
• Increase FDCache/OMapHeader to very large to hold all objects 
• Increase object size to 16MB instead of 4MB(default) 
• Improve default OS fd limits 
• Configuration: 
• “filestore_omap_header_cache_size” 
• “filestore_fd_cache_size” 
• “rbd_chunk_size”(OpenStack Cinder)
Rule 2: Sparse Read/Write 
• Facts: 
• Only few KB exists in Object for RBD usage 
• Clone/Recovery will copy full object, harmful to performance and capacity 
• Action: 
• Use sparse read/write 
• Problem: 
• XFS or other local filesystems exists existing bugs for fiemap 
• Configuration: 
• “filestore_fiemap=true”
Rule 3: Drop default limits 
• Facts: 
• Default configuration value is suitable for HDD backend 
• Action: 
• Change all throttle-related configuration value 
• Configuration: 
• “filestore_wbthrottle_*” 
• “filestore_queue_*” 
• “journal_queue_*” 
• “…” More related configs(recovery, scrub)
Rule 4: Use RBD cache 
• Facts: 
• RBD cache has remarkable performance 
improvement for seq read/write 
• Action: 
• Enable RBD cache 
• Configuration: 
• “rbd_cache = true”
Rule 5: Speed Cache 
• Facts: 
• Default cache container implementation isn’t suitable for large cache 
capacity 
• Temporary Action: 
• Change cache container to “RandomCache” (Out of Master Branch) 
• FDCache, OMap header cache, ObjectCacher 
• Next: 
• RandomCache isn’t suitable for generic situations 
• Implementation Effective ARC replacing RandomCache
Rule 6: Keep Thread Running 
• Facts: 
• Ineffective thread wakeup 
• Action: 
• Make OSD queue running 
• Configuration: 
• Still in Pull Request(https://github.com/ceph/ceph/pull/2727) 
• “osd_op_worker_wake_time”
Rule 7: Async 
Messenger(experiment) 
• Facts: 
• Each client need two threads on OSD side 
• Painful context switch latency 
• Action: 
• Use Async Messenger 
• Configuration: 
• “ms_type = async” 
• “ms_async_op_threads = 5”
Result: IOPS 
Based on Ceph 0.67.5 Single OSD
Result: Latency 
• 4K random write for 1TB rbd image: 1.05 ms per IO 
• 4K random read for 1TB rbd image: 0.5 ms per IO 
• 1.5x latency performance improvement 
• Outstanding large dataset performance
THE THIRD PART 
03 High 
Durability
Ceph Reliability Model 
• https://wiki.ceph.com/Development/Reliability_model 
• 《CRUSH: Controlled, Scalable, Decentralized Placement of 
Replicated Data》 
• 《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》 
• Ceph CRUSH code 
Durability Formula 
• P = func(N, R, S, AFR) 
• P = Pr * M / C(R, N)
Where 
need to optimize ?
• DataPlacement decides Durability 
• CRUSH-MAP decides DataPlacement 
• CRUSH-MAP decides Durability
What 
need to optimize ?
• Durability depend on OSD recovery time 
• Durability depend on the number of Copy set in ceph 
pool 
the possible PG’s OSD set is Copy set, 
the data loss in ceph is loss of any PG, actually is loss of 
any Copy set. 
If the replication number is 3 and we lost 3 osds in ceph, 
the probability of data loss depend on the number of 
copy set, because the 3 osds may be not the Copy set.
• The shorter recovery time, the higher Durability 
• The less the number of Copy set, the higher Durability
How 
to optimize it?
• The CRUSH-MAP setting decides osd recovery time 
• The CRUSH-MAP setting decides the number of Copy set
Default CRUSH-MAP setting
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
24 OSDs 24 OSDs 24 OSDs 
if R = 3, M = 24 * 24 * 24 = 13824 
3 racks 
24 nodes 
72 osds
default crush setting
N = 72 
S = 3 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 1728 13824 
Pr 0.99 2.1*10E-4 4.6*10E-8 
P 0.99 1.4*10E-4 5.4*10E-9 
Nines 3 8
Reduce recovery time
default CRUSH-MAP setting 
server-08 
if one OSD out, only two OSDs can do data recovery 
so 
recovery time is too longer 
we 
need more OSD to do data recovery to 
reduce recovery time
use osd-domain instead of host bucket 
reduce recovery time 
osd-domain
server-01 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
if R = 3, M = 24 * 24 * 24 = 13824
new crush map
N = 72 
S = 12 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 1728 13824 
Pr 0.99 7.8*10E-5 6.7*10E-9 
P 0.99 5.4*10E-5 7.7*10E-10 
Nines 4 9
Reduce the number of copy 
set
use replica-domain instead 
of rack bucket
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
osd-domain 
failure-domain 
replica-domain 
replica-domain 
if R = 3, M = (12*12*12) * 2 = 3456
new crush map
N = 72 
S = 12 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 864 3456 
Pr 0.99 7.8*10E-5 6.7*10E-9 
P 0.99 2.7*10E-5 1.9*10E-10 
Nines 0 4 ≈ 10
THE FOURTH PART 
04 Operation 
Expericence
deploy 
• eNovance: puppet-ceph 
• Stackforge: puppet-ceph 
• UnitedStack: puppet-ceph 
• shorter deploy time 
• support all ceph options 
• support multi disk type 
• wwn-id instead of disk label 
• hieradata
Operation goal: Availability 
• reduce unnecessary data migration 
• reduce slow requests
upgrade ceph 
1. noout: ceph osd set noout 
2. mark down: ceph osd down x 
3. restart: service ceph restart osd.x
reboot host 
1. migrate vm 
2. mark down osd 
3. reboot host
expand ceph capacity 
1. setting crushmap 
2. setting recovery options 
3. trigger data migration 
4. observe data recovery rate 
5. observe slow requests
replace disk 
• be careful 
• ensure replica-domain’s weight unchanged, 
otherwise data(pg) migrate to another replica-domain
monitoring 
• diamond: add new collector, ceph perf dump, ceph status 
• graphite: store data 
• grafana: display 
• alert: zabbix  ceph health command
throttle model
add new collector in diamond 
redefine metric name in graphite 
[process].[what].[component].[attr]
osd_client_messenger 
osd_dispatch_client 
osd_dispatch_cluster 
osd_pg 
osd_pg_client_w 
osd_pg_client_r 
osd_pg_client_rw 
osd_pg_cluster_w 
filestore_op_queue 
filestore_journal_queue 
filestore_journal 
filestore_wb 
filestore_leveldb 
filestore_commit 
	 
max_bytes 
max_ops 
ops 
bytes 
op/s 
in_b/s 
out_b/s 
lat
Accidents 
• SSD GC 
• network failure 
• Ceph bug 
• XFS bug 
• SSD corruption 
• PG inconsistent 
• recovery data filled network bandwidth
@ UnitedStack 
THANK YOU 
FOR WATCHING 
2014/11/05
https://www.ustack.com/jobs/
About UnitedStack
UnitedStack - The Leading OpenStack Cloud Service 
Solution Provider in China
VM deployment in seconds 
WYSIWYG network topology 
High performance cloud storage 
Billing by seconds
Public Cloud 
devel 
Beijing1 
Guangdong1 
Managed Cloud 
Cloud Storage 
Mobile Socail 
Financial 
U Center 
Unified Cloud Service Platform 
Unified Ops 
Unified SLA 
IDC 
test 
…… ……
The detail of durability formula
• DataPlacement decides Durability 
• CRUSH-MAP decides DataPlacement 
• CRUSH-MAP decides Durability
Default crush setting 
 
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24
3 racks 
24 nodes 
72 osds 
How to compute Durability?
Ceph Reliability Model
• https://wiki.ceph.com/Development/Reliability_model 
• 《CRUSH: Controlled, Scalable, Decentralized 
Placement of Replicated Data》 
• 《Copysets: Reducing the Frequency of Data Loss in 
Cloud Storage》 
• Ceph CRUSH code
Durability Formula
P = func(N, R, S, AFR) 
• P: the probability of losing all copy 
• N: the number of OSD in ceph pool 
• R: the number of copy 
• S: the number of OSD in bucket(it decide recovery 
time) 
• AFR: disk annualized failure rate
Failure events are 
considered to be Poisson 
• Failure rates are characterized in units of failures per 
billion hours(FITs), and so I have tried to represent all 
periodicities in FITs and all times in hours: 
fit = failures in time = 1/MTTF ~= 1/MTBF = AFR/ 
(24*365) 
• Event Probabilities, λ is the failure rate, the 
probability of n failure events during time t: 
Pn(λ,t) = (λt)n e-λt / n!
The probability of data loss 
• OSD set: copy set, any PG reside in 
• data loss: any OSD set loss 
• ignore Non-Recoverable Errrors, NRE’s never happen 
which might be true on scrubbed osd
Non-Recoverable Errors 
NREs are read errors that cannot be 
corrected by retries or ECC. 
• media noise 
• high-fly 
• off-track writes
The probability of R OSDs loss 
1. The probability of an initial OSD loss incident. 
2. Having sufferred this loss, the probability of losing R- 
1 OSDs is based on the recovery time. 
3. Multiplied by the probability of the above. The result 
is Pr。
The probability of Copy sets 
loss 
1. M = Copy Sets Number in Ceph Pool 
2. any R OSDs is C(R, N) 
3. the probability of copy sets loss is Pr * M / C(R, N)
P = Pr * M / C(R, N) 
If R = 3, One PG - (osd.x, osd.y, osd.z) 
(osd.x, osd.y, osd.z) is a Copy Set 
All Copy Sets are in line with the rule of CRUSH MAP 
M = The numer of Copy Sets in Ceph Pool 
Pr = The probability of R OSDs loss 
C(R, N) = any R OSDs in N 
P = The probability of any Copy Sets loss(data loss)
24 OSDs 24 OSDs 24 OSDs 
default crush setting 
 
server-01 
root 
rack-01 
server-02 
server-03 
server-04 
server-05 
server-06 
server-07 
server-08 
 
server-09 
rack-02 
server-10 
server-11 
server-12 
server-13 
server-14 
server-15 
server-16 
 
server-17 
rack-03 
server-18 
server-19 
server-20 
server-21 
server-22 
server-23 
server-24 
if R = 3, M = 24 * 24 * 24 = 13824
default crush setting
AFR = 0.04 
One Disk Recovery Rate = 100 MB/s 
Mark Out Time = 10 mins
N = 72 
S = 3 R = 1 R = 2 R = 3 
C(R, N) 72 2556 119280 
M 72 1728 13824 
Pr 0.99 2.1*10E-4 4.6*10E-8 
P 0.99 1.4*10E-4 5.4*10E-9 
Nines 3 8
Trade-off
trade off between durability and availability 
new crush N R S Nines R 
Ceph 72 3 3 11 31 mins 
Ceph 72 3 6 10 13 mins 
Ceph 72 3 12 10 6 mins 
Ceph 72 3 24 9 3 mins
Shorter recovery time 
Minimize the impact of SLA

Weitere ähnliche Inhalte

Was ist angesagt?

Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Community
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...OpenStack Korea Community
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage systemItalo Santos
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionKaran Singh
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Sean Cohen
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency CephShapeBlue
 
Using the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackUsing the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackShapeBlue
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFShapeBlue
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanCeph Community
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpPushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpJames Denton
 
OVN DBs HA with scale test
OVN DBs HA with scale testOVN DBs HA with scale test
OVN DBs HA with scale testAliasgar Ginwala
 

Was ist angesagt? (20)

Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Ceph issue 해결 사례
Ceph issue 해결 사례Ceph issue 해결 사례
Ceph issue 해결 사례
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
Storage 101: Rook and Ceph - Open Infrastructure Denver 2019
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
Using the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackUsing the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStack
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li XiaoyanPerformance tuning in BlueStore & RocksDB - Li Xiaoyan
Performance tuning in BlueStore & RocksDB - Li Xiaoyan
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpPushing Packets - How do the ML2 Mechanism Drivers Stack Up
Pushing Packets - How do the ML2 Mechanism Drivers Stack Up
 
OVN DBs HA with scale test
OVN DBs HA with scale testOVN DBs HA with scale test
OVN DBs HA with scale test
 

Ähnlich wie Build an High-Performance and High-Durable Block Storage Service Based on Ceph

Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storageMarian Marinov
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
 
Comparison of-foss-distributed-storage
Comparison of-foss-distributed-storageComparison of-foss-distributed-storage
Comparison of-foss-distributed-storageMarian Marinov
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"OpenStack Korea Community
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterPatrick Quairoli
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortNAVER D2
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Community
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Ceph Community
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryScyllaDB
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterEttore Simone
 
Build an affordable Cloud Stroage
Build an affordable Cloud StroageBuild an affordable Cloud Stroage
Build an affordable Cloud StroageAlex Lau
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingMarian Marinov
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster inwin stack
 

Ähnlich wie Build an High-Performance and High-Durable Block Storage Service Based on Ceph (20)

ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storage
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Comparison of-foss-distributed-storage
Comparison of-foss-distributed-storageComparison of-foss-distributed-storage
Comparison of-foss-distributed-storage
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
 
Quick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage ClusterQuick-and-Easy Deployment of a Ceph Storage Cluster
Quick-and-Easy Deployment of a Ceph Storage Cluster
 
ceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep Dive
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
 
Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions Reference Architecture: Architecting Ceph Storage Solutions
Reference Architecture: Architecting Ceph Storage Solutions
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Crimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent MemoryCrimson: Ceph for the Age of NVMe and Persistent Memory
Crimson: Ceph for the Age of NVMe and Persistent Memory
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data Center
 
Build an affordable Cloud Stroage
Build an affordable Cloud StroageBuild an affordable Cloud Stroage
Build an affordable Cloud Stroage
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuilding
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster Ambedded - how to build a true no single point of failure ceph cluster
Ambedded - how to build a true no single point of failure ceph cluster
 

Kürzlich hochgeladen

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 

Kürzlich hochgeladen (20)

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 

Build an High-Performance and High-Durable Block Storage Service Based on Ceph

  • 1. Build an High-Performance and High-Durable Block Storage Service Based on Ceph Rongze & Haomai [ rongze | haomai ]@unitedstack.com
  • 2. CONTENTS Block Storage Service 1 3 High Durability 2 High Performance 4 Operation Expericence
  • 3. THE FIRST PART 01 Block Storage Service
  • 4. Block Storage Service Highlight • 6000 IOPS 170 MB/s 95% < 2ms SLA • 3 copys, strong consistency, 99.99999999% durability • All management ops in seconds • Real-time snapshot • Performance volume type and capacity volume type 4
  • 6. 6 Now OpenStack Essex Folsom Havana Icehouse/ Juno Juno Ceph 0.42 0.67.2 base on 0.67.5 base on 0.67.5 base on 0.80.7 CentOS 6.4 6.5 6.5 6.6 Qemu 0.12 0.12 base on 1.2 base on 1.2 2.0 Kernel 2.6.32 3.12.21 3.12.21 ?
  • 8. minimum deployment 12 osd nodes and 3 monitor nodes 8 Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node 40 Gb Switch 40 Gb Switch
  • 10. the minimum scale deployment Compute/Storage Node 40 Gb Switch Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node 40 Gb Switch 12 osd nodes
  • 11. Compute/Storage Node 40 Gb Switch Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node 40 Gb Switch 144 osd nodes Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node Compute/Storage Node
  • 13. nova VM VM 1 Gb Network: 20 GB / 100 MB = 200 s = 3 mins 10 Gb Network: 20 GB / 1000 MB = 20 s LVM SAN Ceph 20 GB Image 20 GB Image LocalFS Swift Ceph GlusterFS LocalFS NFS GlusterFS cinder glance HTTP HTTP Boot Storm
  • 14. Nova, Glance, Cinder use the same ceph pool All action in seconds No boot storm cinder glance 14 nova VM VM Ceph
  • 15. 15 QoS • Nova • Libvirt • Qemu(throttle) Two Volume Types • Cinder multi-backend • Ceph SSD Pool • Ceph SATA Pool Shared Volume • Read Only • Multi-attach
  • 16. THE SECOND PART 02 High Performance
  • 18. • CPU: • Get out of CPU out of power save mode: • echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor /dev/null • Cgroup: • Bind Ceph-OSD processes to fixed cores(1-2 cores per OSD) • Memory: • Turn off NUMA if support NUMA in /etc/grub.conf • Set vm.swappiness = 0 • Block: • echo deadline /sys/block/sd[x]/queue/scheduler • FileSystem • Mount with “noatime nobarrier
  • 19. Qemu
  • 20. • Throttle: Smooth IO limit algorithm(backport) • RBD enhance: Discard and flush enhance(backport) • Burst: • Virt-scsi: Multi-queue support
  • 22. VCPU Thread Qemu Main Thread data flow Pipe::Writer Pipe::Reader DispatchThreader OSD::OpWQ FileJournal::Writer FileJournal-finisher Network Qemu OSD RadosClient-finisher DispatchThreader Pipe::Reader Pipe::Writer FileStore-ondisk_finisher FileStore-op_finisher FileStore::SyncThread FileStore::OpWQ Object Object Object Object Object Object Object Object RBD Image Rados Object File
  • 24. Rule 1: Keep FD • Facts: • FileStore Bottleneck: Remarkable performance degraded when FD cache missed • SSD = 480GB = 122880 Objects(4MB) = 30720 objects(16MB) in theory • Action: • Increase FDCache/OMapHeader to very large to hold all objects • Increase object size to 16MB instead of 4MB(default) • Improve default OS fd limits • Configuration: • “filestore_omap_header_cache_size” • “filestore_fd_cache_size” • “rbd_chunk_size”(OpenStack Cinder)
  • 25. Rule 2: Sparse Read/Write • Facts: • Only few KB exists in Object for RBD usage • Clone/Recovery will copy full object, harmful to performance and capacity • Action: • Use sparse read/write • Problem: • XFS or other local filesystems exists existing bugs for fiemap • Configuration: • “filestore_fiemap=true”
  • 26. Rule 3: Drop default limits • Facts: • Default configuration value is suitable for HDD backend • Action: • Change all throttle-related configuration value • Configuration: • “filestore_wbthrottle_*” • “filestore_queue_*” • “journal_queue_*” • “…” More related configs(recovery, scrub)
  • 27. Rule 4: Use RBD cache • Facts: • RBD cache has remarkable performance improvement for seq read/write • Action: • Enable RBD cache • Configuration: • “rbd_cache = true”
  • 28. Rule 5: Speed Cache • Facts: • Default cache container implementation isn’t suitable for large cache capacity • Temporary Action: • Change cache container to “RandomCache” (Out of Master Branch) • FDCache, OMap header cache, ObjectCacher • Next: • RandomCache isn’t suitable for generic situations • Implementation Effective ARC replacing RandomCache
  • 29. Rule 6: Keep Thread Running • Facts: • Ineffective thread wakeup • Action: • Make OSD queue running • Configuration: • Still in Pull Request(https://github.com/ceph/ceph/pull/2727) • “osd_op_worker_wake_time”
  • 30. Rule 7: Async Messenger(experiment) • Facts: • Each client need two threads on OSD side • Painful context switch latency • Action: • Use Async Messenger • Configuration: • “ms_type = async” • “ms_async_op_threads = 5”
  • 31. Result: IOPS Based on Ceph 0.67.5 Single OSD
  • 32. Result: Latency • 4K random write for 1TB rbd image: 1.05 ms per IO • 4K random read for 1TB rbd image: 0.5 ms per IO • 1.5x latency performance improvement • Outstanding large dataset performance
  • 33. THE THIRD PART 03 High Durability
  • 34. Ceph Reliability Model • https://wiki.ceph.com/Development/Reliability_model • 《CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data》 • 《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》 • Ceph CRUSH code Durability Formula • P = func(N, R, S, AFR) • P = Pr * M / C(R, N)
  • 35. Where need to optimize ?
  • 36. • DataPlacement decides Durability • CRUSH-MAP decides DataPlacement • CRUSH-MAP decides Durability
  • 37. What need to optimize ?
  • 38. • Durability depend on OSD recovery time • Durability depend on the number of Copy set in ceph pool the possible PG’s OSD set is Copy set, the data loss in ceph is loss of any PG, actually is loss of any Copy set. If the replication number is 3 and we lost 3 osds in ceph, the probability of data loss depend on the number of copy set, because the 3 osds may be not the Copy set.
  • 39. • The shorter recovery time, the higher Durability • The less the number of Copy set, the higher Durability
  • 41. • The CRUSH-MAP setting decides osd recovery time • The CRUSH-MAP setting decides the number of Copy set
  • 43. server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 24 OSDs 24 OSDs 24 OSDs if R = 3, M = 24 * 24 * 24 = 13824 3 racks 24 nodes 72 osds
  • 45. N = 72 S = 3 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 1728 13824 Pr 0.99 2.1*10E-4 4.6*10E-8 P 0.99 1.4*10E-4 5.4*10E-9 Nines 3 8
  • 47. default CRUSH-MAP setting server-08 if one OSD out, only two OSDs can do data recovery so recovery time is too longer we need more OSD to do data recovery to reduce recovery time
  • 48. use osd-domain instead of host bucket reduce recovery time osd-domain
  • 49. server-01 rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 osd-domain osd-domain osd-domain osd-domain osd-domain osd-domain if R = 3, M = 24 * 24 * 24 = 13824
  • 51. N = 72 S = 12 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 1728 13824 Pr 0.99 7.8*10E-5 6.7*10E-9 P 0.99 5.4*10E-5 7.7*10E-10 Nines 4 9
  • 52. Reduce the number of copy set
  • 53. use replica-domain instead of rack bucket
  • 54. server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 osd-domain osd-domain osd-domain osd-domain osd-domain osd-domain failure-domain replica-domain replica-domain if R = 3, M = (12*12*12) * 2 = 3456
  • 56. N = 72 S = 12 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 864 3456 Pr 0.99 7.8*10E-5 6.7*10E-9 P 0.99 2.7*10E-5 1.9*10E-10 Nines 0 4 ≈ 10
  • 57. THE FOURTH PART 04 Operation Expericence
  • 58. deploy • eNovance: puppet-ceph • Stackforge: puppet-ceph • UnitedStack: puppet-ceph • shorter deploy time • support all ceph options • support multi disk type • wwn-id instead of disk label • hieradata
  • 59. Operation goal: Availability • reduce unnecessary data migration • reduce slow requests
  • 60. upgrade ceph 1. noout: ceph osd set noout 2. mark down: ceph osd down x 3. restart: service ceph restart osd.x
  • 61. reboot host 1. migrate vm 2. mark down osd 3. reboot host
  • 62. expand ceph capacity 1. setting crushmap 2. setting recovery options 3. trigger data migration 4. observe data recovery rate 5. observe slow requests
  • 63. replace disk • be careful • ensure replica-domain’s weight unchanged, otherwise data(pg) migrate to another replica-domain
  • 64. monitoring • diamond: add new collector, ceph perf dump, ceph status • graphite: store data • grafana: display • alert: zabbix ceph health command
  • 66. add new collector in diamond redefine metric name in graphite [process].[what].[component].[attr]
  • 67. osd_client_messenger osd_dispatch_client osd_dispatch_cluster osd_pg osd_pg_client_w osd_pg_client_r osd_pg_client_rw osd_pg_cluster_w filestore_op_queue filestore_journal_queue filestore_journal filestore_wb filestore_leveldb filestore_commit max_bytes max_ops ops bytes op/s in_b/s out_b/s lat
  • 68.
  • 69.
  • 70.
  • 71. Accidents • SSD GC • network failure • Ceph bug • XFS bug • SSD corruption • PG inconsistent • recovery data filled network bandwidth
  • 72. @ UnitedStack THANK YOU FOR WATCHING 2014/11/05
  • 75. UnitedStack - The Leading OpenStack Cloud Service Solution Provider in China
  • 76. VM deployment in seconds WYSIWYG network topology High performance cloud storage Billing by seconds
  • 77.
  • 78. Public Cloud devel Beijing1 Guangdong1 Managed Cloud Cloud Storage Mobile Socail Financial U Center Unified Cloud Service Platform Unified Ops Unified SLA IDC test …… ……
  • 79. The detail of durability formula
  • 80. • DataPlacement decides Durability • CRUSH-MAP decides DataPlacement • CRUSH-MAP decides Durability
  • 81. Default crush setting server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24
  • 82. 3 racks 24 nodes 72 osds How to compute Durability?
  • 84. • https://wiki.ceph.com/Development/Reliability_model • 《CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data》 • 《Copysets: Reducing the Frequency of Data Loss in Cloud Storage》 • Ceph CRUSH code
  • 86. P = func(N, R, S, AFR) • P: the probability of losing all copy • N: the number of OSD in ceph pool • R: the number of copy • S: the number of OSD in bucket(it decide recovery time) • AFR: disk annualized failure rate
  • 87. Failure events are considered to be Poisson • Failure rates are characterized in units of failures per billion hours(FITs), and so I have tried to represent all periodicities in FITs and all times in hours: fit = failures in time = 1/MTTF ~= 1/MTBF = AFR/ (24*365) • Event Probabilities, λ is the failure rate, the probability of n failure events during time t: Pn(λ,t) = (λt)n e-λt / n!
  • 88. The probability of data loss • OSD set: copy set, any PG reside in • data loss: any OSD set loss • ignore Non-Recoverable Errrors, NRE’s never happen which might be true on scrubbed osd
  • 89. Non-Recoverable Errors NREs are read errors that cannot be corrected by retries or ECC. • media noise • high-fly • off-track writes
  • 90. The probability of R OSDs loss 1. The probability of an initial OSD loss incident. 2. Having sufferred this loss, the probability of losing R- 1 OSDs is based on the recovery time. 3. Multiplied by the probability of the above. The result is Pr。
  • 91. The probability of Copy sets loss 1. M = Copy Sets Number in Ceph Pool 2. any R OSDs is C(R, N) 3. the probability of copy sets loss is Pr * M / C(R, N)
  • 92. P = Pr * M / C(R, N) If R = 3, One PG - (osd.x, osd.y, osd.z) (osd.x, osd.y, osd.z) is a Copy Set All Copy Sets are in line with the rule of CRUSH MAP M = The numer of Copy Sets in Ceph Pool Pr = The probability of R OSDs loss C(R, N) = any R OSDs in N P = The probability of any Copy Sets loss(data loss)
  • 93. 24 OSDs 24 OSDs 24 OSDs default crush setting server-01 root rack-01 server-02 server-03 server-04 server-05 server-06 server-07 server-08 server-09 rack-02 server-10 server-11 server-12 server-13 server-14 server-15 server-16 server-17 rack-03 server-18 server-19 server-20 server-21 server-22 server-23 server-24 if R = 3, M = 24 * 24 * 24 = 13824
  • 95. AFR = 0.04 One Disk Recovery Rate = 100 MB/s Mark Out Time = 10 mins
  • 96. N = 72 S = 3 R = 1 R = 2 R = 3 C(R, N) 72 2556 119280 M 72 1728 13824 Pr 0.99 2.1*10E-4 4.6*10E-8 P 0.99 1.4*10E-4 5.4*10E-9 Nines 3 8
  • 98. trade off between durability and availability new crush N R S Nines R Ceph 72 3 3 11 31 mins Ceph 72 3 6 10 13 mins Ceph 72 3 12 10 6 mins Ceph 72 3 24 9 3 mins
  • 99. Shorter recovery time Minimize the impact of SLA
  • 100. Final crush map old map: root rack host osd new map: failure-domain replica-domain osd-domain osd