Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
1. AFCeph: SKT Scale Out Storage Ceph
Problem, Progress, and Plan
Myounwon Oh, Byung-Su Park
SDS Tech. Lab,
Network IT Convergence R&D Center
SK Telecom
2. 1
5G
Why we care about All-Flash Storage …
Flash device
High Performance, Low Latency, SLA
UHD
4K
3. 2
Transforming to 5G Network
New ICT infrastructure should be Programmable, Scalable, Flexible,
and Cost Effective
Software Defined Technologies based on Open Software & Open Hardware
5G
Massive
Connectivity
10x
Lower latency
100x-1000x
Higher speed
Efficiency &
Reliability
Virtualization
4. 3
Open HW & SW Projects @ SKT
Open Software
OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop …
Open Hardware
Open Compute Project (OCP), Telecom Infra Project (TIP)
All-flash Storage, Server Switch, Telco Specific H/W…
Software-Defined Technologies
6. 5
Agenda
1. All-flash Ceph Storage Cluster Environment
2. Optimizing All-flash Ceph during Last Year
3. Overall of All-flash Ceph Approach in This Year
4. All-flash Ceph in Detail
• BlueStore
• Quality of Service (QoS)
• Deduplication for Ceph
• OP Lock Completion
5. Ceph Deployment in SKT Private Cloud
6. Operations & Maintenance Tool
7. The Future of All-flash Ceph Storage
7. 6
All-flash Ceph Storage Cluster Environment
Ceph Node Cluster (4)
Ceph Clients
10GbE Network Switch
10GbE Network Switch
Storage NetworkService Network
CPU: 2x E5-2660v3
DRAM: 256GB, Network: Intel 10GbE NIC
Linux: CentOS 7.0 (w/ KRBD)
Kernel: 3.16.4 or 4.1.6
CPU: 2x E5-2690v3
DRAM: 128GB, Network: Intel 2x 10GbE NIC
Linux: CentOS 7.0
Kernel: 3.10.0-123.el7.x86_64
Ceph Version: Hammer version based
NVRAM
Journal
SATA SSD 10ea
Data Store
8. 7
Data Store
(SSD)
Journal
(NVRAM/DIMM)
Network
Client
DRAM
Replica
OSD
1. Send Write Req.
2. Send Replication
3. Write IO to Journal
4. Update
Metadata
5. Write Data
sync every 5 secs
7. Send Write ACK
Metadata
LevelDB xattr
Data
File System Buffers
Messenger
OSD
Ceph
S/W
Stack
Flash
Device
Interface
Optimizing All-flash Ceph during Last Year (Ver. Hammer)
6. Receive Replication
Reply
Performance Improvement !!
Random 4K Write:
42K → 89K (47K ↑)
Random 4K Read:
154K → 321K (167K ↑)
Coarse-grained Lock Overhead
Improvement
Change Logging Progress
Asynchronously
Memory Allocator Change to
reduce CPU Usage
All-flash Based System Configuration
Metadata Transaction Improvement
Merged Write, Remove Useless System
Call, Cache Extension
Performance Optimization
9. 8
Overall of All-flash Ceph Approach in This Year
Performance
Storage
Efficiency
• BlueStore,
Op Lock Completion,
…..
• Data Reduction
Techniques for Flash
Device
Differentiation
Feature
• Quality of Service (QoS)
in distributed
environment
Management /
Operation
• SKT
Operations & Maintenance
Tool
Differentiated Use of Flash Making the best use of flash H/W, Ex) NVDIMM, JBOF
SK Private Infra Application Improvement of marketability with internal requirements
Avoiding Duplicated Development with the Upstream
Joint development and strengthen presence in the upstream
10. 9
BlueStore
OSD
BlueStore
SSD NVM
RocksDB + Blue FS
Alternative to FileStore
No transaction atomicity Journaling necessarily
Inefficient objects enumeration
Has many interesting features
No file system
No Journaling (No double write)
Lighter than FileStore
We are putting much effort on BlueStore in
order to improve I/O performance
Modification of Allocator
Exploits New device
Analysis of internal structure for All-flash
Next Generation OSD Backend
Data Omap/XATTR
ONode
Allocator
Slow WAL/DB
12. 11
Quality of Service (QoS) dmClock
Ensures specific performance levels for applications and workloads
Three Controls
Reservation: minimum guarantee
Limits: maximum allowed
Shares: proportional allocation
Why is Storage IO Allocation (QoS) Hard?
Storage workload characteristics are variable
Available throughput changes with time
Must adjust allocation dynamically
Distributed shared access
dmClock Algorithm (OSDI’ 10)
Real-time tags
Separate tags for reservation, shares & limit
Dynamic tag selection & Synchronization
Aggregate service received by the VM from all the servers in the system
* From Ajay Gulati, mClock: Handling Throughput Variability for Hypervisor IO Scheduling, OSDI’ 10
13. 12
Quality of Service (QoS)
LIBRBD
VM 1
VIRTUALIZATION CONTAINER
LIBRADOS
VM 2 VM 3
R
L
W
Reservation – Minimum IOPs
Limit – Maximum IOPs
Weight – Proportional
share of capacity
OSD #1
OSD #2
OSD #3
OSD #4
OSD #5
OSD #6
OSD #7
OSD #8
OSD #9
OSD #10
OSD #11
OSD #12
dmClock Queue
6’
Existing ShardData's
Weighted Prioritized Queue
11’’
Rho
11
’
6’’
1’
Delta The number of completed IOs
Rho The number of completed as
part of R IOs
Delta
Current Progress in Ceph Community
RhoDelta
RhoDelta
9’’
R2 L2 W2
R3 L3 W3
VM 1
VM 2
VM 3
R1 L1 W1
Two types of dmClock Queue
mClockOpClassQueue: Client Op, OSD Sub Op, Snap, Recovery, Scrub
mClockClientQueue: Client Entity Instance & Client Op, OSD Sub Op, Snap, Recovery, Scrub
14. 13
Quality of Service (QoS) QoS on All-flash Ceph
[Defer Tag Calculation]
[Before]
What we have done for Ceph QoS
Improvement of dmClock Algorithm
Enhancing the utility of the dmClock simulator
Verifying the implementation of dmClock
Contribution
Defer tag calculation with up-to-date delta/rho #4
Add file based dmclock configuration parameter functionality
#3
Etc: Modification for some missing parts of Simulator &
dmClock Algorithm
…..
The Future of QoS on All-flash Ceph
Completion of the remaining parts of Ceph QoS
Insert Flash device aware QoS revision mechanism
Improve of the current dmClock based QoS Algorithm
15. 14
Deduplication for Ceph
Deduplication for Ceph
Space saving
Extending SSD lifetime
Reducing network traffic
Problem
Hard to implement within existing source
Performance
Goal
Compatibility & Simple design (Layered approach)
Reasonable performance (Dedup ratio & I/O performance)
Solution: Double CRUSH + Tiering
Easy implementation
Code reuse
17. 16
Deduplication for Ceph
CRUSH Algorithm
Same codes as original Ceph
Write request
Proxy OSD
Reference
Data Path
Ack Path
<Table> Lookup Table
OID Fingerprin
t
D.OSD S.Chunk
Offset
OID_1 aabbcc OSD_1 2
OID_2 aabbcf OSD_1 4
Modified
Modified parts
Hash A
Hash B
Hash C
Hash D
Hash E
Hash A Hash A
Hash
A
Hash
A
• S.OID
• S.ChunkOffset
• Reference
Count
• S.OID
• S.ChunkOffset
• Reference
Count
<Figure> Dedup Table
Collision
domain
• D.OID
• D.ChunkOffset
Collision
domain
Replicated OSD
Same codes as original Ceph
Primary OSD
(Dedup OSD)
Return TRUE if data
can be dedup
Ack and updated
dedup table info
Original data, metadata
and dedup metadata
C
R
U
S
H
19. 18
Op Lock Completion
Commit
finisher
Applied
finisher
Journal
completion ?
Issue_repop
Data
completion ?
Updating
pg status
Counting
journal
completion
If all commit
are finished,
Send ack to
client
Updating
pg status
Counting data
completion
If all apply are
finished,
Send ack to
client
Journal
completion ?
Issue_repop
Data
completion ?
Lazy
updating pg
status
Commit
finisher
Applied
finisher
Counting
journal
completion
If all commit
are finished,
Send ack to
client
Lazy updating
pg status
Counting data
completion
If all apply are
finished,
Send ack to
client
OP Lock
PG Lock
Ack from
secondary OSD
(reader)
Sharded
Op
workder
Updating
pg status
Counting data
completion
If all commit or
apply are
finished
Send ack to
client
Ack from
secondary OSD
(reader)
Counting data
completion
If all commit or
apply are
finished
Send ack to
client
AS IS TO BE
Commit
finisher
Applied
finisher
20. 19
Ceph Deployment in SKT Private Cloud
Deployed for high performance block storage in private cloud
OPENSTACK
Cinder
OSD OSD OSD
Scale-out for
Capacity & Performance
General
Servers
SSD-array
1,000 1,000 1,000
898
685
1.3 1.2 1.3
2.2 2.9
0
6
12
18
24
10 20 40 60 80
0
300
600
900
1,200
msec
number of VM
IOPs
4KB Random Write, CAP=1,000 IOPs
IOPS per VM Latency
21. 20
Operations & Maintenance
• Real Time Monitoring
• Multi Dashboard
• Rule Base Alarm
• Drag & Drop Admin
• Dashboard Configuration
• Rest API
• Real Time Graph
• Graph Merge
• Drag & Zooming
• Auto Configuration
• Cluster Management
• RBD Management
• Object Storage Management
22. 21
The Future of All-flash Ceph
Fully exploits NVRAM/SSD for performance
Extended application of All-flash Ceph in SKT Cloud Infrastructure
NV-Array
JBOF
NVMe SSD
• High Performance (PCIe 3.0, ~6M IOPS)
• High Density (2.5” NVMe SSD 20EA:
Up to 80TB in 1U)
• SSD Hot-Swap
• ‘16. 4Q expected
• Global Top Performance under 25W Standard
• High Reliability w/ LDPC ECC, In-Storage RAID
• NVMe 1.2 Standard, PCIe Gen 3.0
• 3.2TB (`16.3 Sample), 6.4TB (`16.10 w/ 3D TLC)
U.2 (2.5”) Type
25. 24
SimulatedServer #
ClientRec #0
Std::dequeue<Clie
ntReq>requests
ClientRec #1
ClientRec #2
Std::dequeue<Clie
ntReq>requests
ClientRec #n
Std::dequeue<Clie
ntReq>requests
…
PriorityQueue priority_queue
::resv_heap ::limit_heap ::ready_heap
Client #1Client #0 Client #2 Client #n…
The ClientRec
that has the min
tag value
Send
Request
Std::dequeue<Clie
ntReq>requests
requests.front().tags
resv limit prop
IndIntruHeapData
reserv_heap_data,
lim_heap_data,
ready_heap_data;
Save node id
(::*heap_info) on
each heap
The ClientRec with
Min prop tag value
&& limit tag < now
The ClientRec with
Min limit tag value && limit tag > now
do_clean()
erase ClientRec from client map
or mark idle periodically
Appendix #2: Quality of Service (QoS) Priority Queue in detail