[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
1. All-Flash Ceph 구성과 최적화
Feb. 18, 2016
SDS Tech. Lab, Corporate R&D Center
SK Telecom
OpenStack Days in Korea
2. 1
Why are we focusing at all-flash Ceph?
Tech. Trends of Storage Systems
Hybrid
Scale-up Storage
Systems
Hybrid
Scale-out
Storage Systems
All-flash
Scale-up Storage
Systems
All-flash
Scale-out
Storage Systems
Effective
Capacity
Increasing
Performance Up
Requirements for All-IT Network/Infra Storage System
Scalability Availability Performance
3. 2
What is Ceph?
http://docs.ceph.com/docs/master/_images/stack.png
Object Virtual Disk Files & DirsObject
App App Host/VM Client
• Ceph is a unified, distributed, massively scalable open source storage solution
Object, Block and File storage
• mostly LGPL open source project
• Failure is normal
• Self managing
• Scale out on commodity hardware
• Everything runs in software
4. 3
Ceph Architecture
OSD
Cluster Maps
Direct IO between clients and OSDs
Service Network
Storage Network
Ceph Storage System
OSD OSD
KVM
librbd
Application
krbd
Application
librados
Monitor Monitor Monitor
OSD
Cluster Maps
5. 4
Ceph Operation: Ceph Block Device
PG#0
OSD #1
Disk or RAID Group
XFS
Journal
1. O_DIRECT
2. O_DSYNC
2. Buffered I/O
OSD #0
Synchronous Replication
FileStore
PG#1 PG#2 PG#3
Data
PG#2
librbd
librados
OSD
Service
OSD #0
Application
데이터 배치 : CRUSH 알고리즘
Ceph Block Device
고정 크기 (기본: 4MB) Object의 연속
예) 1GB Block Image = 256 Objects
Hash: Object to PG
6. 5
Ceph OSD 노드 구성
Journal / Data 디스크 구성
• 일반적인 조합 (Journal / Data)
SSD / HDD
외부 저널 디스크 없음 / SSD
PCIe SSD / SATA SSD
NVRAM / SSD
0
5
10
15
20
25
30
35
0
20000
40000
60000
80000
100000
SSD NVRAM
ms
IOPS
4KB Random Write
IOPS Latency
노드 별 OSD 개수
• OSD: 1 OSD per (1 DISK or 1 RAID
GROUP)
• Ceph OSD Daemon CPU-intensive
processes
0
20
40
60
80
100
120
0
5000
10000
15000
20000
25000
30000
35000
3 OSDs 4 OSDs 6 OSDs 8 OSDs 12 OSDs
ms
IOPS
4KB Random Write
IOPS Latency
Journal Type
10. 9
Ceph Write IO Flow: in File Store
File Store (Data)
File Journal
Operation
Threads
Writer
Thread
Committed
Data Disk
Journal
Disk
Operation
Threads
1.Queue
Transaction
2. Operate
Journal
Transactions
5. Queue op
6. Queue to
Finisher
Finisher
Thread
writeq
Operation WQ
3. Write to
Journal Disk
Write
Finisher
Thread
4. AIO
Complete
Journal and
Data
completion?
Finisher
Thread
7. Write Data
8. Queue to
Finisher
Applied
PG Lock
PG Lock
PG Unlock
Send RepOp Reply to Primary if this
is secondary OSD
11. 10
최적화
항목 이슈
PG Lock
전체 Latency 중 30% 이상
이 PG Lock을 얻는데 소모
• OP Processing Worker Thread가 Block되어 관련 없는 OP
의 처리가 늦어짐
• 큰 PG Lock의 Critical Section
• Secondary OSD의 ACK 처리가 늦어져 IO Latency 증가
Ceph & System Tuning
성능 측정 도중 결과값의 기
복이 큼
• Ceph 설정 변수: 개별 변경은 효과가 없고 최적 조합이 필요
• Memory Allocator의 CPU 사용량이 높음
• TCP/IP Nagle 알고리즘
Log
Log 비활성화 여부에 따라
성능 변화가 큼
• OSD의 I/O 처리 과정에서 Log로 인한 시간 소모
Transaction
Transaction 처리가 성능에
큰 영향
• Transaction 처리 비효율: 불필요한 연산, Lock Contention
12. 11
VM 성능: 실험 환경
Service Network
(10GbE)
Storage Network
(10GbE)
Physical Client (x 5)
Vender / Model DELL R720XD
Processor Intel® Xeon® E5-2670v3 @ 2.60GHz x 2 (10core)
Memory 128GB
OS CentOS 7.0
OSD Node/Monitor (x 4)
Vender / Model DELL R630
Processor Intel® Xeon® E5-2690v3 @ 2.60GHz x 2 (12core)
Memory 128GB NIC 10Gbe
OS CentOS 7.0 JOURNAL RAMDISK
Switch (x 2)
Vender / Model Cisco nexus 5548UP 10G
Disk
SSD SK Hynix SSD 480GB 10개 / OSD Node
RAID
RAID 0, SSD 3개, 3개, 2개, 2개 (4 RAID Group)
- Device(4개) & Daemon(4개) / OSD Node
Ceph
Version
SKT Ceph와
Community(0.94.4)
VM (x Physical Client 당 최대 4개)
Guest OS Spec 2 Core, 4 GB memory
librbd
FIO Test Configuration
Run Time 300 Ramp Time 10
Threads 8 Queue Depth 8
Sustained Performance: 2x Write(80% of Usable Space)
13. 12
VM 성능 비교 : Random Workload
71
3
43
3
185
114 118
71
3.4
5.7
2.7
5.5
3.5 3.4
2.0
2.5
0
1
2
3
4
5
6
7
8
9
10
0
20
40
60
80
100
120
140
160
180
200
4KBRWSKT
CEPH
4KBRW
Community
32KBRWSKT
CEPH
32KBRW
Community
4KBRRSKT
CEPH
4KBRR
Community
4KBRRSKT
CEPH
4KBRR
Community
ms
KIOPS ■ SKT CEPH IOPS ■ Community IOPS ◆ Latency
14. 13
VM 성능 비교 : Sequential Workload
2,669 2,729 2,768
2,948
4,287 4,281 4,281 4,296
59.7
28.3
172.4
425.2
73.2
36.7
293.6
292.7
0
50
100
150
200
250
300
350
400
450
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
1MBSWSKT
CEPH
1MBSW
Community
4MBSWSKT
CEPH
4MBSW
Community
1MBSRSKT
CEPH
1MBSR
Community
4MBSRSKT
CEPH
4MBSR
Community
ms
MB/s
■ SKT CEPH BW ■ Community BW ◆ Latency
15. 14
SKT AF-Ceph
AFC-S: 4 Data Node + 1 Management Node (Commodity Server & SSD 기반)
Monitor Node
(관리 서버)
Data Node
(OSD Node)
NVRAM
Journal
SATA SSD 10ea
Data Store
System Configuration
구성 4 Data Node + 1 Monitor Node
상면 5U
SSD SATA SSD 40 ea (in 4U)
NVRAM 8GB NVRAM
용량
Total 40TB / Usable 20TB (w/ 1TB SSD)
Total 80TB / Usable 40TB (w/ 2TB SSD)
Node H/W
CPU Intel Xeon E5 2690v3 2-socket
RAM 128GB (DDR3 1866MHz)
Network 10GbE x 2 for Service & Storage
…
AFC-N: 2U MicroServer (4 Data Node) + 1U NVMe All-Flash JBOF
…
NV-Array
(All-Flash
JBOF)
NV-Drive
(NVMe SSD)
E5 2-socket
Server
(4 Nodes in 2U) • 고성능(PCIe 3.0)
• 고집적(2.5” NVMe SSD 24EA: Up to 96TB)
• ‘16. 4Q 예정
16. 15
SKT AF-Ceph
Real Time Monitoring
Multi Dashboard
Rule Base Alarm
Drag & Drop Admin
Rest API
Real-time Graph
Graph Merge
Drag & Zooming
Auto Configuration
Cluster Management
RBD Management
Object Storage Management