SlideShare ist ein Scribd-Unternehmen logo
1 von 25
AFCeph: SKT Scale Out Storage Ceph
Problem, Progress, and Plan
Myounwon Oh, Byung-Su Park
SDS Tech. Lab,
Network IT Convergence R&D Center
SK Telecom
1
5G
Why we care about All-Flash Storage …
Flash device
High Performance, Low Latency, SLA
UHD
4K
2
Transforming to 5G Network
New ICT infrastructure should be Programmable, Scalable, Flexible,
and Cost Effective
Software Defined Technologies based on Open Software & Open Hardware
5G
Massive
Connectivity
10x
Lower latency
100x-1000x
Higher speed
Efficiency &
Reliability
Virtualization
3
Open HW & SW Projects @ SKT
Open Software
OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop …
Open Hardware
Open Compute Project (OCP), Telecom Infra Project (TIP)
All-flash Storage, Server Switch, Telco Specific H/W…
Software-Defined Technologies
4
Scalable, Available,
Reliable, Unified Interface,
Open Platform
High Performance,
Low Latency
All-flash Ceph !
Why we care about All-Flash Ceph …
5
Agenda
1. All-flash Ceph Storage Cluster Environment
2. Optimizing All-flash Ceph during Last Year
3. Overall of All-flash Ceph Approach in This Year
4. All-flash Ceph in Detail
• BlueStore
• Quality of Service (QoS)
• Deduplication for Ceph
• OP Lock Completion
5. Ceph Deployment in SKT Private Cloud
6. Operations & Maintenance Tool
7. The Future of All-flash Ceph Storage
6
All-flash Ceph Storage Cluster Environment
Ceph Node Cluster (4)
Ceph Clients
10GbE Network Switch
10GbE Network Switch
Storage NetworkService Network
CPU: 2x E5-2660v3
DRAM: 256GB, Network: Intel 10GbE NIC
Linux: CentOS 7.0 (w/ KRBD)
Kernel: 3.16.4 or 4.1.6
CPU: 2x E5-2690v3
DRAM: 128GB, Network: Intel 2x 10GbE NIC
Linux: CentOS 7.0
Kernel: 3.10.0-123.el7.x86_64
Ceph Version: Hammer version based
NVRAM
Journal
SATA SSD 10ea
Data Store
7
Data Store
(SSD)
Journal
(NVRAM/DIMM)
Network
Client
DRAM
Replica
OSD
1. Send Write Req.
2. Send Replication
3. Write IO to Journal
4. Update
Metadata
5. Write Data
sync every 5 secs
7. Send Write ACK
Metadata
LevelDB xattr
Data
File System Buffers
Messenger
OSD
Ceph
S/W
Stack
Flash
Device
Interface
Optimizing All-flash Ceph during Last Year (Ver. Hammer)
6. Receive Replication
Reply
Performance Improvement !!
Random 4K Write:
42K → 89K (47K ↑)
Random 4K Read:
154K → 321K (167K ↑)
Coarse-grained Lock Overhead
Improvement
Change Logging Progress
Asynchronously
Memory Allocator Change to
reduce CPU Usage
All-flash Based System Configuration
Metadata Transaction Improvement
Merged Write, Remove Useless System
Call, Cache Extension
Performance Optimization
8
Overall of All-flash Ceph Approach in This Year
Performance
Storage
Efficiency
• BlueStore,
Op Lock Completion,
…..
• Data Reduction
Techniques for Flash
Device
Differentiation
Feature
• Quality of Service (QoS)
in distributed
environment
Management /
Operation
• SKT
Operations & Maintenance
Tool
Differentiated Use of Flash Making the best use of flash H/W, Ex) NVDIMM, JBOF
SK Private Infra Application Improvement of marketability with internal requirements
Avoiding Duplicated Development with the Upstream
Joint development and strengthen presence in the upstream
9
BlueStore
OSD
BlueStore
SSD NVM
RocksDB + Blue FS
 Alternative to FileStore
 No transaction atomicity  Journaling necessarily
 Inefficient objects enumeration
 Has many interesting features
 No file system
 No Journaling (No double write)
 Lighter than FileStore
 We are putting much effort on BlueStore in
order to improve I/O performance
 Modification of Allocator
 Exploits New device
 Analysis of internal structure for All-flash
Next Generation OSD Backend
Data Omap/XATTR
ONode
Allocator
Slow WAL/DB
10
BlueStore Upstream Contribution
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
20
40
60
80
100
120
07/20 08/19
Latency(ms)
kIOPS
BlueStore Performance Improvement
k IOPS Latency(ms)
os/bluestore: add multiple finishers to bluestore #10567
30% Up!
 Contribution
11
Quality of Service (QoS) dmClock
 Ensures specific performance levels for applications and workloads
 Three Controls
 Reservation: minimum guarantee
 Limits: maximum allowed
 Shares: proportional allocation
 Why is Storage IO Allocation (QoS) Hard?
 Storage workload characteristics are variable
 Available throughput changes with time
 Must adjust allocation dynamically
 Distributed shared access
 dmClock Algorithm (OSDI’ 10)
 Real-time tags
 Separate tags for reservation, shares & limit
 Dynamic tag selection & Synchronization
 Aggregate service received by the VM from all the servers in the system
* From Ajay Gulati, mClock: Handling Throughput Variability for Hypervisor IO Scheduling, OSDI’ 10
12
Quality of Service (QoS)
LIBRBD
VM 1
VIRTUALIZATION CONTAINER
LIBRADOS
VM 2 VM 3
R
L
W
Reservation – Minimum IOPs
Limit – Maximum IOPs
Weight – Proportional
share of capacity
OSD #1
OSD #2
OSD #3
OSD #4
OSD #5
OSD #6
OSD #7
OSD #8
OSD #9
OSD #10
OSD #11
OSD #12
dmClock Queue
6’
Existing ShardData's
Weighted Prioritized Queue
11’’
Rho
11
’
6’’
1’
Delta The number of completed IOs
Rho The number of completed as
part of R IOs
Delta
Current Progress in Ceph Community
RhoDelta
RhoDelta
9’’
R2 L2 W2
R3 L3 W3
VM 1
VM 2
VM 3
R1 L1 W1
Two types of dmClock Queue
mClockOpClassQueue: Client Op, OSD Sub Op, Snap, Recovery, Scrub
mClockClientQueue: Client Entity Instance & Client Op, OSD Sub Op, Snap, Recovery, Scrub
13
Quality of Service (QoS) QoS on All-flash Ceph
[Defer Tag Calculation]
[Before]
 What we have done for Ceph QoS
 Improvement of dmClock Algorithm
 Enhancing the utility of the dmClock simulator
 Verifying the implementation of dmClock
 Contribution
 Defer tag calculation with up-to-date delta/rho #4
 Add file based dmclock configuration parameter functionality
#3
 Etc: Modification for some missing parts of Simulator &
dmClock Algorithm
…..
 The Future of QoS on All-flash Ceph
 Completion of the remaining parts of Ceph QoS
 Insert Flash device aware QoS revision mechanism
 Improve of the current dmClock based QoS Algorithm
14
Deduplication for Ceph
 Deduplication for Ceph
 Space saving
 Extending SSD lifetime
 Reducing network traffic
 Problem
 Hard to implement within existing source
 Performance
 Goal
 Compatibility & Simple design (Layered approach)
 Reasonable performance (Dedup ratio & I/O performance)
 Solution: Double CRUSH + Tiering
 Easy implementation
 Code reuse
15
Deduplication for Ceph
CRUSH Algorithm (OID)  OSD
Write (OID, Data)
Proxy OSD
Write(OID, PG ID, Data)
Cache tier
Primary OSD
Storage tier
Write(Hash, new PG ID, Data)
CRUSH Algorithm (Hash)  OSD
Fingerprinting &
Chunking
Checking dedupe
metadata
Managing dedup
reference list
Double CRUSH
16
Deduplication for Ceph
CRUSH Algorithm
Same codes as original Ceph
Write request
Proxy OSD
Reference
Data Path
Ack Path
<Table> Lookup Table
OID Fingerprin
t
D.OSD S.Chunk
Offset
OID_1 aabbcc OSD_1 2
OID_2 aabbcf OSD_1 4
Modified
Modified parts
Hash A
Hash B
Hash C
Hash D
Hash E
Hash A Hash A
Hash
A
Hash
A
• S.OID
• S.ChunkOffset
• Reference
Count
• S.OID
• S.ChunkOffset
• Reference
Count
<Figure> Dedup Table
Collision
domain
• D.OID
• D.ChunkOffset
Collision
domain
Replicated OSD
Same codes as original Ceph
Primary OSD
(Dedup OSD)
Return TRUE if data
can be dedup
Ack and updated
dedup table info
Original data, metadata
and dedup metadata
C
R
U
S
H
17
Op Lock Completion
Messenger
PG
Journal
Primary OSD Secondary OSD
Messenger
PG
Journal
BW
(MB/s)
IOPS Latency
(ms)
CPU
Messenger 845 216K 2.6 61%
PG 785 201K 2.8 73%
Before Journal (simulation) 761 194K 2.9 76%
Log + metadata processing 664 170K 3.38 77%
Minimizing PG Lock
completion + parallel issue
& completion
643 164K 3.5 80%
Parallel issue & completion 518 132K 4.2 78%
Before Journal (real code) 416 106K 5.4 78%
Only journal
(only data is written in
journal)
363 93K 6.1 80%
No modification 357 88K 6.4 80%
Filestore Filestore
Log operation +
metadata processing
Completion
processing
Log operation +
metadata processing
Completion
processing
1
1
2
3
4
5
6
7
8
2
4
7
8
3
18
Op Lock Completion
Commit
finisher
Applied
finisher
Journal
completion ?
Issue_repop
Data
completion ?
Updating
pg status
Counting
journal
completion
If all commit
are finished,
Send ack to
client
Updating
pg status
Counting data
completion
If all apply are
finished,
Send ack to
client
Journal
completion ?
Issue_repop
Data
completion ?
Lazy
updating pg
status
Commit
finisher
Applied
finisher
Counting
journal
completion
If all commit
are finished,
Send ack to
client
Lazy updating
pg status
Counting data
completion
If all apply are
finished,
Send ack to
client
OP Lock
PG Lock
Ack from
secondary OSD
(reader)
Sharded
Op
workder
Updating
pg status
Counting data
completion
If all commit or
apply are
finished
Send ack to
client
Ack from
secondary OSD
(reader)
Counting data
completion
If all commit or
apply are
finished
Send ack to
client
AS IS TO BE
Commit
finisher
Applied
finisher
19
Ceph Deployment in SKT Private Cloud
 Deployed for high performance block storage in private cloud
OPENSTACK
Cinder
OSD OSD OSD
Scale-out for
Capacity & Performance
General
Servers
SSD-array
1,000 1,000 1,000
898
685
1.3 1.2 1.3
2.2 2.9
0
6
12
18
24
10 20 40 60 80
0
300
600
900
1,200
msec
number of VM
IOPs
4KB Random Write, CAP=1,000 IOPs
IOPS per VM Latency
20
Operations & Maintenance
• Real Time Monitoring
• Multi Dashboard
• Rule Base Alarm
• Drag & Drop Admin
• Dashboard Configuration
• Rest API
• Real Time Graph
• Graph Merge
• Drag & Zooming
• Auto Configuration
• Cluster Management
• RBD Management
• Object Storage Management
21
The Future of All-flash Ceph
 Fully exploits NVRAM/SSD for performance
 Extended application of All-flash Ceph in SKT Cloud Infrastructure
NV-Array
JBOF
NVMe SSD
• High Performance (PCIe 3.0, ~6M IOPS)
• High Density (2.5” NVMe SSD 20EA:
Up to 80TB in 1U)
• SSD Hot-Swap
• ‘16. 4Q expected
• Global Top Performance under 25W Standard
• High Reliability w/ LDPC ECC, In-Storage RAID
• NVMe 1.2 Standard, PCIe Gen 3.0
• 3.2TB (`16.3 Sample), 6.4TB (`16.10 w/ 3D TLC)
U.2 (2.5”) Type
2222
Thank you
Contact Info.
Myoungwon Oh, omwmw@sk.com
Byung-Su Park, bspark8@sk.com
23
Appendix #1: BlueStore
op_tp (*)
kv_sync_thread (1)
fn_anonymous (1)
flush
kv transactions
balancefs
wal_apply
Default Options
bluestore_sync_transaction = false
bluestore_sync_submit_transaction = false
bluestore_sync_wal_apply = true
bluestore_use_multiple_finishers = false
Ack
op_tp (*)
kv_sync_thread (1)
fn_anonymous (*)
flush
kv transactions
balancefs
wal_apply
Default Options
bluestore_sync_transaction = false
bluestore_sync_submit_transaction = false
bluestore_sync_wal_apply = true
bluestore_use_multiple_finishers = true
Ack
bluestore_use_multiple_finishers
24
SimulatedServer #
ClientRec #0
Std::dequeue<Clie
ntReq>requests
ClientRec #1
ClientRec #2
Std::dequeue<Clie
ntReq>requests
ClientRec #n
Std::dequeue<Clie
ntReq>requests
…
PriorityQueue priority_queue
::resv_heap ::limit_heap ::ready_heap
Client #1Client #0 Client #2 Client #n…
The ClientRec
that has the min
tag value
Send
Request
Std::dequeue<Clie
ntReq>requests
requests.front().tags
resv limit prop
IndIntruHeapData
reserv_heap_data,
lim_heap_data,
ready_heap_data;
 Save node id
(::*heap_info) on
each heap
The ClientRec with
Min prop tag value
&& limit tag < now
The ClientRec with
Min limit tag value && limit tag > now
do_clean()
erase ClientRec from client map
or mark idle periodically
Appendix #2: Quality of Service (QoS) Priority Queue in detail

Weitere ähnliche Inhalte

Was ist angesagt?

Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Community
 
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in CephCeph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Community
 

Was ist angesagt? (18)

QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance Archiecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
 
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance BarriersCeph on All Flash Storage -- Breaking Performance Barriers
Ceph on All Flash Storage -- Breaking Performance Barriers
 
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in CephCeph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Ceph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to EnterpriseCeph Day Taipei - Bring Ceph to Enterprise
Ceph Day Taipei - Bring Ceph to Enterprise
 
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Bring Ceph to Enterprise
 

Andere mochten auch

Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Community
 

Andere mochten auch (17)

Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo - Ceph Community Update
 
Ceph Day San Jose - From Zero to Ceph in One Minute
Ceph Day San Jose - From Zero to Ceph in One Minute Ceph Day San Jose - From Zero to Ceph in One Minute
Ceph Day San Jose - From Zero to Ceph in One Minute
 
Ceph Day San Jose - Ceph in a Post-Cloud World
Ceph Day San Jose - Ceph in a Post-Cloud World Ceph Day San Jose - Ceph in a Post-Cloud World
Ceph Day San Jose - Ceph in a Post-Cloud World
 
Ceph Day Seoul - Community Update
Ceph Day Seoul - Community UpdateCeph Day Seoul - Community Update
Ceph Day Seoul - Community Update
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
 
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFSCeph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - HA NAS with CephFS
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage
 
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture  Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - High Performance Layered Architecture
 
London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + GanetiLondon Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
London Ceph Day: Unified Cloud Storage with Synnefo + Ceph + Ganeti
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
Ceph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in CtripCeph Day Shanghai - Ceph in Ctrip
Ceph Day Shanghai - Ceph in Ctrip
 
librados
libradoslibrados
librados
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
 

Ähnlich wie Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph

Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
inside-BigData.com
 
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
HostedbyConfluent
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
IO Visor Project
 
High Performance Communication for Oracle using InfiniBand
High Performance Communication for Oracle using InfiniBandHigh Performance Communication for Oracle using InfiniBand
High Performance Communication for Oracle using InfiniBand
webhostingguy
 

Ähnlich wie Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph (20)

G rpc talk with intel (3)
G rpc talk with intel (3)G rpc talk with intel (3)
G rpc talk with intel (3)
 
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and moreAdvanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
Advanced Networking: The Critical Path for HPC, Cloud, Machine Learning and more
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFV
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0What’s New in ScyllaDB Open Source 5.0
What’s New in ScyllaDB Open Source 5.0
 
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
Give Your Confluent Platform Superpowers! (Sandeep Togrika, Intel and Bert Ha...
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Ceph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in CephCeph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in Ceph
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...Seminar Accelerating Business Using Microservices Architecture in Digital Age...
Seminar Accelerating Business Using Microservices Architecture in Digital Age...
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
Safer Commutes & Streaming Data | George Padavick, Ohio Department of Transpo...
 
Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016Polyteda Power DRC/LVS July 2016
Polyteda Power DRC/LVS July 2016
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
PowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAPowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDA
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
High Performance Communication for Oracle using InfiniBand
High Performance Communication for Oracle using InfiniBandHigh Performance Communication for Oracle using InfiniBand
High Performance Communication for Oracle using InfiniBand
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph

  • 1. AFCeph: SKT Scale Out Storage Ceph Problem, Progress, and Plan Myounwon Oh, Byung-Su Park SDS Tech. Lab, Network IT Convergence R&D Center SK Telecom
  • 2. 1 5G Why we care about All-Flash Storage … Flash device High Performance, Low Latency, SLA UHD 4K
  • 3. 2 Transforming to 5G Network New ICT infrastructure should be Programmable, Scalable, Flexible, and Cost Effective Software Defined Technologies based on Open Software & Open Hardware 5G Massive Connectivity 10x Lower latency 100x-1000x Higher speed Efficiency & Reliability Virtualization
  • 4. 3 Open HW & SW Projects @ SKT Open Software OpenStack, ONOS, Ceph, Cloud Foundry, Hadoop … Open Hardware Open Compute Project (OCP), Telecom Infra Project (TIP) All-flash Storage, Server Switch, Telco Specific H/W… Software-Defined Technologies
  • 5. 4 Scalable, Available, Reliable, Unified Interface, Open Platform High Performance, Low Latency All-flash Ceph ! Why we care about All-Flash Ceph …
  • 6. 5 Agenda 1. All-flash Ceph Storage Cluster Environment 2. Optimizing All-flash Ceph during Last Year 3. Overall of All-flash Ceph Approach in This Year 4. All-flash Ceph in Detail • BlueStore • Quality of Service (QoS) • Deduplication for Ceph • OP Lock Completion 5. Ceph Deployment in SKT Private Cloud 6. Operations & Maintenance Tool 7. The Future of All-flash Ceph Storage
  • 7. 6 All-flash Ceph Storage Cluster Environment Ceph Node Cluster (4) Ceph Clients 10GbE Network Switch 10GbE Network Switch Storage NetworkService Network CPU: 2x E5-2660v3 DRAM: 256GB, Network: Intel 10GbE NIC Linux: CentOS 7.0 (w/ KRBD) Kernel: 3.16.4 or 4.1.6 CPU: 2x E5-2690v3 DRAM: 128GB, Network: Intel 2x 10GbE NIC Linux: CentOS 7.0 Kernel: 3.10.0-123.el7.x86_64 Ceph Version: Hammer version based NVRAM Journal SATA SSD 10ea Data Store
  • 8. 7 Data Store (SSD) Journal (NVRAM/DIMM) Network Client DRAM Replica OSD 1. Send Write Req. 2. Send Replication 3. Write IO to Journal 4. Update Metadata 5. Write Data sync every 5 secs 7. Send Write ACK Metadata LevelDB xattr Data File System Buffers Messenger OSD Ceph S/W Stack Flash Device Interface Optimizing All-flash Ceph during Last Year (Ver. Hammer) 6. Receive Replication Reply Performance Improvement !! Random 4K Write: 42K → 89K (47K ↑) Random 4K Read: 154K → 321K (167K ↑) Coarse-grained Lock Overhead Improvement Change Logging Progress Asynchronously Memory Allocator Change to reduce CPU Usage All-flash Based System Configuration Metadata Transaction Improvement Merged Write, Remove Useless System Call, Cache Extension Performance Optimization
  • 9. 8 Overall of All-flash Ceph Approach in This Year Performance Storage Efficiency • BlueStore, Op Lock Completion, ….. • Data Reduction Techniques for Flash Device Differentiation Feature • Quality of Service (QoS) in distributed environment Management / Operation • SKT Operations & Maintenance Tool Differentiated Use of Flash Making the best use of flash H/W, Ex) NVDIMM, JBOF SK Private Infra Application Improvement of marketability with internal requirements Avoiding Duplicated Development with the Upstream Joint development and strengthen presence in the upstream
  • 10. 9 BlueStore OSD BlueStore SSD NVM RocksDB + Blue FS  Alternative to FileStore  No transaction atomicity  Journaling necessarily  Inefficient objects enumeration  Has many interesting features  No file system  No Journaling (No double write)  Lighter than FileStore  We are putting much effort on BlueStore in order to improve I/O performance  Modification of Allocator  Exploits New device  Analysis of internal structure for All-flash Next Generation OSD Backend Data Omap/XATTR ONode Allocator Slow WAL/DB
  • 11. 10 BlueStore Upstream Contribution 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 20 40 60 80 100 120 07/20 08/19 Latency(ms) kIOPS BlueStore Performance Improvement k IOPS Latency(ms) os/bluestore: add multiple finishers to bluestore #10567 30% Up!  Contribution
  • 12. 11 Quality of Service (QoS) dmClock  Ensures specific performance levels for applications and workloads  Three Controls  Reservation: minimum guarantee  Limits: maximum allowed  Shares: proportional allocation  Why is Storage IO Allocation (QoS) Hard?  Storage workload characteristics are variable  Available throughput changes with time  Must adjust allocation dynamically  Distributed shared access  dmClock Algorithm (OSDI’ 10)  Real-time tags  Separate tags for reservation, shares & limit  Dynamic tag selection & Synchronization  Aggregate service received by the VM from all the servers in the system * From Ajay Gulati, mClock: Handling Throughput Variability for Hypervisor IO Scheduling, OSDI’ 10
  • 13. 12 Quality of Service (QoS) LIBRBD VM 1 VIRTUALIZATION CONTAINER LIBRADOS VM 2 VM 3 R L W Reservation – Minimum IOPs Limit – Maximum IOPs Weight – Proportional share of capacity OSD #1 OSD #2 OSD #3 OSD #4 OSD #5 OSD #6 OSD #7 OSD #8 OSD #9 OSD #10 OSD #11 OSD #12 dmClock Queue 6’ Existing ShardData's Weighted Prioritized Queue 11’’ Rho 11 ’ 6’’ 1’ Delta The number of completed IOs Rho The number of completed as part of R IOs Delta Current Progress in Ceph Community RhoDelta RhoDelta 9’’ R2 L2 W2 R3 L3 W3 VM 1 VM 2 VM 3 R1 L1 W1 Two types of dmClock Queue mClockOpClassQueue: Client Op, OSD Sub Op, Snap, Recovery, Scrub mClockClientQueue: Client Entity Instance & Client Op, OSD Sub Op, Snap, Recovery, Scrub
  • 14. 13 Quality of Service (QoS) QoS on All-flash Ceph [Defer Tag Calculation] [Before]  What we have done for Ceph QoS  Improvement of dmClock Algorithm  Enhancing the utility of the dmClock simulator  Verifying the implementation of dmClock  Contribution  Defer tag calculation with up-to-date delta/rho #4  Add file based dmclock configuration parameter functionality #3  Etc: Modification for some missing parts of Simulator & dmClock Algorithm …..  The Future of QoS on All-flash Ceph  Completion of the remaining parts of Ceph QoS  Insert Flash device aware QoS revision mechanism  Improve of the current dmClock based QoS Algorithm
  • 15. 14 Deduplication for Ceph  Deduplication for Ceph  Space saving  Extending SSD lifetime  Reducing network traffic  Problem  Hard to implement within existing source  Performance  Goal  Compatibility & Simple design (Layered approach)  Reasonable performance (Dedup ratio & I/O performance)  Solution: Double CRUSH + Tiering  Easy implementation  Code reuse
  • 16. 15 Deduplication for Ceph CRUSH Algorithm (OID)  OSD Write (OID, Data) Proxy OSD Write(OID, PG ID, Data) Cache tier Primary OSD Storage tier Write(Hash, new PG ID, Data) CRUSH Algorithm (Hash)  OSD Fingerprinting & Chunking Checking dedupe metadata Managing dedup reference list Double CRUSH
  • 17. 16 Deduplication for Ceph CRUSH Algorithm Same codes as original Ceph Write request Proxy OSD Reference Data Path Ack Path <Table> Lookup Table OID Fingerprin t D.OSD S.Chunk Offset OID_1 aabbcc OSD_1 2 OID_2 aabbcf OSD_1 4 Modified Modified parts Hash A Hash B Hash C Hash D Hash E Hash A Hash A Hash A Hash A • S.OID • S.ChunkOffset • Reference Count • S.OID • S.ChunkOffset • Reference Count <Figure> Dedup Table Collision domain • D.OID • D.ChunkOffset Collision domain Replicated OSD Same codes as original Ceph Primary OSD (Dedup OSD) Return TRUE if data can be dedup Ack and updated dedup table info Original data, metadata and dedup metadata C R U S H
  • 18. 17 Op Lock Completion Messenger PG Journal Primary OSD Secondary OSD Messenger PG Journal BW (MB/s) IOPS Latency (ms) CPU Messenger 845 216K 2.6 61% PG 785 201K 2.8 73% Before Journal (simulation) 761 194K 2.9 76% Log + metadata processing 664 170K 3.38 77% Minimizing PG Lock completion + parallel issue & completion 643 164K 3.5 80% Parallel issue & completion 518 132K 4.2 78% Before Journal (real code) 416 106K 5.4 78% Only journal (only data is written in journal) 363 93K 6.1 80% No modification 357 88K 6.4 80% Filestore Filestore Log operation + metadata processing Completion processing Log operation + metadata processing Completion processing 1 1 2 3 4 5 6 7 8 2 4 7 8 3
  • 19. 18 Op Lock Completion Commit finisher Applied finisher Journal completion ? Issue_repop Data completion ? Updating pg status Counting journal completion If all commit are finished, Send ack to client Updating pg status Counting data completion If all apply are finished, Send ack to client Journal completion ? Issue_repop Data completion ? Lazy updating pg status Commit finisher Applied finisher Counting journal completion If all commit are finished, Send ack to client Lazy updating pg status Counting data completion If all apply are finished, Send ack to client OP Lock PG Lock Ack from secondary OSD (reader) Sharded Op workder Updating pg status Counting data completion If all commit or apply are finished Send ack to client Ack from secondary OSD (reader) Counting data completion If all commit or apply are finished Send ack to client AS IS TO BE Commit finisher Applied finisher
  • 20. 19 Ceph Deployment in SKT Private Cloud  Deployed for high performance block storage in private cloud OPENSTACK Cinder OSD OSD OSD Scale-out for Capacity & Performance General Servers SSD-array 1,000 1,000 1,000 898 685 1.3 1.2 1.3 2.2 2.9 0 6 12 18 24 10 20 40 60 80 0 300 600 900 1,200 msec number of VM IOPs 4KB Random Write, CAP=1,000 IOPs IOPS per VM Latency
  • 21. 20 Operations & Maintenance • Real Time Monitoring • Multi Dashboard • Rule Base Alarm • Drag & Drop Admin • Dashboard Configuration • Rest API • Real Time Graph • Graph Merge • Drag & Zooming • Auto Configuration • Cluster Management • RBD Management • Object Storage Management
  • 22. 21 The Future of All-flash Ceph  Fully exploits NVRAM/SSD for performance  Extended application of All-flash Ceph in SKT Cloud Infrastructure NV-Array JBOF NVMe SSD • High Performance (PCIe 3.0, ~6M IOPS) • High Density (2.5” NVMe SSD 20EA: Up to 80TB in 1U) • SSD Hot-Swap • ‘16. 4Q expected • Global Top Performance under 25W Standard • High Reliability w/ LDPC ECC, In-Storage RAID • NVMe 1.2 Standard, PCIe Gen 3.0 • 3.2TB (`16.3 Sample), 6.4TB (`16.10 w/ 3D TLC) U.2 (2.5”) Type
  • 23. 2222 Thank you Contact Info. Myoungwon Oh, omwmw@sk.com Byung-Su Park, bspark8@sk.com
  • 24. 23 Appendix #1: BlueStore op_tp (*) kv_sync_thread (1) fn_anonymous (1) flush kv transactions balancefs wal_apply Default Options bluestore_sync_transaction = false bluestore_sync_submit_transaction = false bluestore_sync_wal_apply = true bluestore_use_multiple_finishers = false Ack op_tp (*) kv_sync_thread (1) fn_anonymous (*) flush kv transactions balancefs wal_apply Default Options bluestore_sync_transaction = false bluestore_sync_submit_transaction = false bluestore_sync_wal_apply = true bluestore_use_multiple_finishers = true Ack bluestore_use_multiple_finishers
  • 25. 24 SimulatedServer # ClientRec #0 Std::dequeue<Clie ntReq>requests ClientRec #1 ClientRec #2 Std::dequeue<Clie ntReq>requests ClientRec #n Std::dequeue<Clie ntReq>requests … PriorityQueue priority_queue ::resv_heap ::limit_heap ::ready_heap Client #1Client #0 Client #2 Client #n… The ClientRec that has the min tag value Send Request Std::dequeue<Clie ntReq>requests requests.front().tags resv limit prop IndIntruHeapData reserv_heap_data, lim_heap_data, ready_heap_data;  Save node id (::*heap_info) on each heap The ClientRec with Min prop tag value && limit tag < now The ClientRec with Min limit tag value && limit tag > now do_clean() erase ClientRec from client map or mark idle periodically Appendix #2: Quality of Service (QoS) Priority Queue in detail