Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph

AFCeph: SKT Scale Out Storage Ceph
Problem, Progress, and Plan
Myounwon Oh, Byung-Su Park
SDS Tech. Lab,
Network IT Convergence R&D Center
SK Telecom

1
5G
Why we care about All-Flash Storage …
Flash device
High Performance, Low Latency, SLA
UHD
4K

2
Transforming to 5G Network
New ICT infrastructure should be Programmable, Scalable, Flexible,
and Cost Effective
Software Defined Technologies based on Open Software & Open Hardware
5G
Massive
Connectivity
10x
Lower latency
100x-1000x
Higher speed
Efficiency &
Reliability
Virtualization

3
Open HW & SW Projects @ SKT
Open Software
OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop …
Open Hardware
Open Compute Project (OCP), Telecom Infra Project (TIP)
All-flash Storage, Server Switch, Telco Specific H/W…
Software-Defined Technologies

4
Scalable, Available,
Reliable, Unified Interface,
Open Platform
High Performance,
Low Latency
All-flash Ceph !
Why we care about All-Flash Ceph …

5
Agenda
1. All-flash Ceph Storage Cluster Environment
2. Optimizing All-flash Ceph during Last Year
3. Overall of All-flash Ceph Approach in This Year
4. All-flash Ceph in Detail
• BlueStore
• Quality of Service (QoS)
• Deduplication for Ceph
• OP Lock Completion
5. Ceph Deployment in SKT Private Cloud
6. Operations & Maintenance Tool
7. The Future of All-flash Ceph Storage

6
All-flash Ceph Storage Cluster Environment
Ceph Node Cluster (4)
Ceph Clients
10GbE Network Switch
10GbE Network Switch
Storage NetworkService Network
CPU: 2x E5-2660v3
DRAM: 256GB, Network: Intel 10GbE NIC
Linux: CentOS 7.0 (w/ KRBD)
Kernel: 3.16.4 or 4.1.6
CPU: 2x E5-2690v3
DRAM: 128GB, Network: Intel 2x 10GbE NIC
Linux: CentOS 7.0
Kernel: 3.10.0-123.el7.x86_64
Ceph Version: Hammer version based
NVRAM
Journal
SATA SSD 10ea
Data Store

7
Data Store
(SSD)
Journal
(NVRAM/DIMM)
Network
Client
DRAM
Replica
OSD
1. Send Write Req.
2. Send Replication
3. Write IO to Journal
4. Update
Metadata
5. Write Data
sync every 5 secs
7. Send Write ACK
Metadata
LevelDB xattr
Data
File System Buffers
Messenger
OSD
Ceph
S/W
Stack
Flash
Device
Interface
Optimizing All-flash Ceph during Last Year (Ver. Hammer)
6. Receive Replication
Reply
Performance Improvement !!
Random 4K Write:
42K → 89K (47K ↑)
Random 4K Read:
154K → 321K (167K ↑)
Coarse-grained Lock Overhead
Improvement
Change Logging Progress
Asynchronously
Memory Allocator Change to
reduce CPU Usage
All-flash Based System Configuration
Metadata Transaction Improvement
Merged Write, Remove Useless System
Call, Cache Extension
Performance Optimization

8
Overall of All-flash Ceph Approach in This Year
Performance
Storage
Efficiency
• BlueStore,
Op Lock Completion,
…..
• Data Reduction
Techniques for Flash
Device
Differentiation
Feature
• Quality of Service (QoS)
in distributed
environment
Management /
Operation
• SKT
Operations & Maintenance
Tool
Differentiated Use of Flash Making the best use of flash H/W, Ex) NVDIMM, JBOF
SK Private Infra Application Improvement of marketability with internal requirements
Avoiding Duplicated Development with the Upstream
Joint development and strengthen presence in the upstream

9
BlueStore
OSD
BlueStore
SSD NVM
RocksDB + Blue FS
 Alternative to FileStore
 No transaction atomicity  Journaling necessarily
 Inefficient objects enumeration
 Has many interesting features
 No file system
 No Journaling (No double write)
 Lighter than FileStore
 We are putting much effort on BlueStore in
order to improve I/O performance
 Modification of Allocator
 Exploits New device
 Analysis of internal structure for All-flash
Next Generation OSD Backend
Data Omap/XATTR
ONode
Allocator
Slow WAL/DB

10
BlueStore Upstream Contribution
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
20
40
60
80
100
120
07/20 08/19
Latency(ms)
kIOPS
BlueStore Performance Improvement
k IOPS Latency(ms)
os/bluestore: add multiple finishers to bluestore #10567
30% Up!
 Contribution

11
Quality of Service (QoS) dmClock
 Ensures specific performance levels for applications and workloads
 Three Controls
 Reservation: minimum guarantee
 Limits: maximum allowed
 Shares: proportional allocation
 Why is Storage IO Allocation (QoS) Hard?
 Storage workload characteristics are variable
 Available throughput changes with time
 Must adjust allocation dynamically
 Distributed shared access
 dmClock Algorithm (OSDI’ 10)
 Real-time tags
 Separate tags for reservation, shares & limit
 Dynamic tag selection & Synchronization
 Aggregate service received by the VM from all the servers in the system
* From Ajay Gulati, mClock: Handling Throughput Variability for Hypervisor IO Scheduling, OSDI’ 10

12
Quality of Service (QoS)
LIBRBD
VM 1
VIRTUALIZATION CONTAINER
LIBRADOS
VM 2 VM 3
R
L
W
Reservation – Minimum IOPs
Limit – Maximum IOPs
Weight – Proportional
share of capacity
OSD #1
OSD #2
OSD #3
OSD #4
OSD #5
OSD #6
OSD #7
OSD #8
OSD #9
OSD #10
OSD #11
OSD #12
dmClock Queue
6’
Existing ShardData's
Weighted Prioritized Queue
11’’
Rho
11
’
6’’
1’
Delta The number of completed IOs
Rho The number of completed as
part of R IOs
Delta
Current Progress in Ceph Community
RhoDelta
RhoDelta
9’’
R2 L2 W2
R3 L3 W3
VM 1
VM 2
VM 3
R1 L1 W1
Two types of dmClock Queue
mClockOpClassQueue: Client Op, OSD Sub Op, Snap, Recovery, Scrub
mClockClientQueue: Client Entity Instance & Client Op, OSD Sub Op, Snap, Recovery, Scrub

13
Quality of Service (QoS) QoS on All-flash Ceph
[Defer Tag Calculation]
[Before]
 What we have done for Ceph QoS
 Improvement of dmClock Algorithm
 Enhancing the utility of the dmClock simulator
 Verifying the implementation of dmClock
 Contribution
 Defer tag calculation with up-to-date delta/rho #4
 Add file based dmclock configuration parameter functionality
#3
 Etc: Modification for some missing parts of Simulator &
dmClock Algorithm
…..
 The Future of QoS on All-flash Ceph
 Completion of the remaining parts of Ceph QoS
 Insert Flash device aware QoS revision mechanism
 Improve of the current dmClock based QoS Algorithm

14
Deduplication for Ceph
 Deduplication for Ceph
 Space saving
 Extending SSD lifetime
 Reducing network traffic
 Problem
 Hard to implement within existing source
 Performance
 Goal
 Compatibility & Simple design (Layered approach)
 Reasonable performance (Dedup ratio & I/O performance)
 Solution: Double CRUSH + Tiering
 Easy implementation
 Code reuse

15
CRUSH Algorithm (OID)  OSD
Write (OID, Data)
Proxy OSD
Write(OID, PG ID, Data)
Cache tier
Primary OSD
Storage tier
Write(Hash, new PG ID, Data)
CRUSH Algorithm (Hash)  OSD
Fingerprinting &
Chunking
Checking dedupe
metadata
Managing dedup
reference list
Double CRUSH

16
CRUSH Algorithm
Same codes as original Ceph
Write request
Proxy OSD
Reference
Data Path
Ack Path
<Table> Lookup Table
OID Fingerprin
t
D.OSD S.Chunk
Offset
OID_1 aabbcc OSD_1 2
OID_2 aabbcf OSD_1 4
Modified
Modified parts
Hash A
Hash B
Hash C
Hash D
Hash E
Hash A Hash A
Hash
A
Hash
A
• S.OID
• S.ChunkOffset
• Reference
Count
• S.OID
• S.ChunkOffset
• Reference
Count
<Figure> Dedup Table
Collision
domain
• D.OID
• D.ChunkOffset
Collision
domain
Replicated OSD
Same codes as original Ceph
Primary OSD
(Dedup OSD)
Return TRUE if data
can be dedup
Ack and updated
dedup table info
Original data, metadata
and dedup metadata
C
R
U
S
H

17
Op Lock Completion
Messenger
PG
Journal
Primary OSD Secondary OSD
Messenger
PG
Journal
BW
(MB/s)
IOPS Latency
(ms)
CPU
Messenger 845 216K 2.6 61%
PG 785 201K 2.8 73%
Before Journal (simulation) 761 194K 2.9 76%
Log + metadata processing 664 170K 3.38 77%
Minimizing PG Lock
completion + parallel issue
& completion
643 164K 3.5 80%
Parallel issue & completion 518 132K 4.2 78%
Before Journal (real code) 416 106K 5.4 78%
Only journal
(only data is written in
journal)
363 93K 6.1 80%
No modification 357 88K 6.4 80%
Filestore Filestore
Log operation +
metadata processing
Completion
processing
Log operation +
metadata processing
Completion
processing
1
1
2
3
4
5
6
7
8
2
4
7
8
3

18
Op Lock Completion
Commit
finisher
Applied
finisher
Journal
completion ?
Issue_repop
Data
completion ?
Updating
pg status
Counting
journal
completion
If all commit
are finished,
Send ack to
client
Updating
pg status
Counting data
completion
If all apply are
finished,
Send ack to
client
Journal
completion ?
Issue_repop
Data
completion ?
Lazy
updating pg
status
Commit
finisher
Applied
finisher
Counting
journal
completion
If all commit
are finished,
Send ack to
client
Lazy updating
pg status
Counting data
completion
If all apply are
finished,
Send ack to
client
OP Lock
PG Lock
Ack from
secondary OSD
(reader)
Sharded
Op
workder
Updating
pg status
Counting data
completion
If all commit or
apply are
finished
Send ack to
client
Ack from
secondary OSD
(reader)
Counting data
completion
If all commit or
apply are
finished
Send ack to
client
AS IS TO BE
Commit
finisher
Applied
finisher

19
Ceph Deployment in SKT Private Cloud
 Deployed for high performance block storage in private cloud
OPENSTACK
Cinder
OSD OSD OSD
Scale-out for
Capacity & Performance
General
Servers
SSD-array
1,000 1,000 1,000
898
685
1.3 1.2 1.3
2.2 2.9
0
6
12
18
24
10 20 40 60 80
0
300
600
900
1,200
msec
number of VM
IOPs
4KB Random Write, CAP=1,000 IOPs
IOPS per VM Latency

20
Operations & Maintenance
• Real Time Monitoring
• Multi Dashboard
• Rule Base Alarm
• Drag & Drop Admin
• Dashboard Configuration
• Rest API
• Real Time Graph
• Graph Merge
• Drag & Zooming
• Auto Configuration
• Cluster Management
• RBD Management
• Object Storage Management

21
The Future of All-flash Ceph
 Fully exploits NVRAM/SSD for performance
 Extended application of All-flash Ceph in SKT Cloud Infrastructure
NV-Array
JBOF
NVMe SSD
• High Performance (PCIe 3.0, ~6M IOPS)
• High Density (2.5” NVMe SSD 20EA:
Up to 80TB in 1U)
• SSD Hot-Swap
• ‘16. 4Q expected
• Global Top Performance under 25W Standard
• High Reliability w/ LDPC ECC, In-Storage RAID
• NVMe 1.2 Standard, PCIe Gen 3.0
• 3.2TB (`16.3 Sample), 6.4TB (`16.10 w/ 3D TLC)
U.2 (2.5”) Type

2222
Thank you
Contact Info.
Myoungwon Oh, omwmw@sk.com
Byung-Su Park, bspark8@sk.com

23
Appendix #1: BlueStore
op_tp (*)
kv_sync_thread (1)
fn_anonymous (1)
flush
kv transactions
balancefs
wal_apply
Default Options
bluestore_sync_transaction = false
bluestore_sync_submit_transaction = false
bluestore_sync_wal_apply = true
bluestore_use_multiple_finishers = false
Ack
op_tp (*)
kv_sync_thread (1)
fn_anonymous (*)
flush
kv transactions
balancefs
wal_apply
Default Options
bluestore_sync_transaction = false
bluestore_sync_submit_transaction = false
bluestore_sync_wal_apply = true
bluestore_use_multiple_finishers = true
Ack
bluestore_use_multiple_finishers

24
SimulatedServer #
ClientRec #0
Std::dequeue<Clie
ntReq>requests
ClientRec #1
ClientRec #2
Std::dequeue<Clie
ntReq>requests
ClientRec #n
Std::dequeue<Clie
ntReq>requests
…
PriorityQueue priority_queue
::resv_heap ::limit_heap ::ready_heap
Client #1Client #0 Client #2 Client #n…
The ClientRec
that has the min
tag value
Send
Request
Std::dequeue<Clie
ntReq>requests
requests.front().tags
resv limit prop
IndIntruHeapData
reserv_heap_data,
lim_heap_data,
ready_heap_data;
 Save node id
(::*heap_info) on
each heap
The ClientRec with
Min prop tag value
&& limit tag < now
The ClientRec with
Min limit tag value && limit tag > now
do_clean()
erase ClientRec from client map
or mark idle periodically
Appendix #2: Quality of Service (QoS) Priority Queue in detail

Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph

Ähnlich wie Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph