Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
1. Ceph All-Flash Array Design
Based on NUMA Architecture
QCT (QuantaCloud Technology)
Marco Huang, Technical Manager
Becky Lin, ProgramManager
2. • All-flash Ceph and Use Cases
• QCT QxStor All-flash Ceph for IOPS
• QCT Lab Environment Overview & Detailed Architecture
• Importance of NUMA and Proof Points
Agenda
2 QCTCONFIDENTIAL
3. QCT Powers Most of Cloud Services
Global Tier 1 Hyperscale Datacenters, Telcos and Enterprises
3 QCTCONFIDENTIAL
• QCT (Quanta Cloud Technology) was a
subsidiary of Quanta Computer
• Quanta Computer is a fortune global 500
company with over $32B revenue
4. Why All-flash Storage?
• Falling flash prices: Flashprices fell
as much as 75% over the 18 months
leading up to mid-2016 and the trend
continues.
“TechRepublic: 10 storage trends to watch in
2016”
• Flashis 10x cheaper than DRAM:
withpersistence and highcapacity
“NetApp”
• Flashis 100x cheaper than disk:
pennies per IOPS vs. dollars per IOPS
“NetApp”
• Flashis 1000x fasterthan disk:
latency drops from milliseconds to
microseconds
“NetApp”
€$¥
• Flashperformance advantage:
HDDs have anadvantage in $/GB,
while flashhas anadvantage in
$/IOPS.
“TechTarget: Hybrid storage arrays vs. all-
flash arrays: A little flash or a lot?”
• NVMe-based storage trend: 60%
of enterprise storage appliances
will have NVMe bays by 2020
“G2M Research”
Requiresub-millisecond latencyNeed performance-optimized
storageformission-critical apps
Flash capacity gains while the
price drops
QCTCONFIDENTIAL
6. QCT QxStor Red Hat Ceph Storage Edition
Optimized for workloads
ThroughputOptimized
• Densest 1U Ceph building block
• Smaller failure domain
• 3x SSDS3710 journal
• 12x HDD 7.2krpm
• Obtain best throughput &
density at once
• Scale at high scale 700TB
• 2x 2x NVMe P3700 journal
• 2x 35xHDD
• Block or Object Storage, Video, Audio, Image,
Streaming media, Big Data
• 3x replication
USECASE
QxStor RCT-400QxStor RCT-200
Cost/Capacity Optimized
QxStor RCC-400
• Maximize storage capacity
• Highest density 560TB* raw
capacity per chassis
• 2x 35xHDD
• Object storage, Archive,
Backup, Enterprise Dropbox
• Erasure coding
D51PH-1ULH T21P-4U
* Optional model, oneMB per chassis, can support 620TB raw capacity
IOPS Optimized
QxStor RCI-300
• All FlashDesign
• Lowest latency
• 4x P3520 2TB or 4x P3700
1.6TB
• Database, HPC, Mission
Critical Applications
• 2x replication
D51BP-1U
6
T21P-4U
7. 7
QCT QxStor RCI-300
All Flash Design Ceph for I/O Intensive Workloads
SKU1: All flash Ceph - theBest IOPS SKU
• Ceph Storage Server:D51BP-1U
• CPU: 2x E5-2995 v4 or plus
• RAM: 128GB
• NVMe SSD:4x P3700 1.6TB
• NIC: 10GbE dual port or 40GbE dual port
SKU2: All flash Ceph - IOPS/Capacity Balanced SKU
(best TCO, as of today)
• Ceph Storage Server:D51BP-1U
• CPU: 2x E5-2980 v4 or higher cores
• RAM: 128GB
• NVMe SSD:4x P3520 2TB
• NIC: 10GbE dual port or 40GbE dual port
NUMA Balanced Ceph Hardware
Highest IOPS & Lowest Latency
Optimized Ceph & HW Integration for
IOPS intensive workloads
QCTCONFIDENTIAL
8. 8
NVMe: Best-in-Class IOPS, Lower/Consistent Latency
Lowest Latency of Standard Storage Interfaces
0
500000
100% Read 70% Read 0% Read
IOPS
IOPS - 4K RandomWorkloads
PCIe/NVMe SAS 12Gb/s
3x better IOPS vs SAS 12Gbps For the same #CPU cycles, NVMe delivers over 2X the IOPs of SAS!
Gen1 NVMe has 2 to 3x better Latency Consistency vs SAS
Test andSystem Configurations: PCI Express*
(PCIe*
)/NVM Express*
(NVMe) Measurements made onIntel® Core™ i7-3770S system @3.1GHzand 4GBMem running Windows*
Server 2012 Standard O/S, IntelPCIe/NVMe SSDs,data collected
by IOmeter*
tool. SAS Measurements from HGST Ultrastar*
SSD800M/1000M(SAS), SATA S3700 Series. For more complete information about performance andbenchmark results, visit http://www.intel.com/performance. Source: Intel
Internal Testing.
10. 5-Node all-NVMe Ceph Cluster
Dual-Xeon E5 2699v4@2.3GHz, 88 HT, 128GB DDR4
RHEL 7.3, 3.10, Red Hat Ceph 2.1
10x Client Systems
Dual-Xeon E5 2699v4@2.3GHz
88 HT, 128 GB DDR4
CephOSD1
CephOSD2
CephOSD3
CephOSD4
CephOSD16
…
NVMe3
NVMe2
NVMe4
NVMe1
20x 2TB P3520 SSDs
80 OSDs
2x Replication
19TB EffectiveCapacity
Tests at cluster fill-level
82%
CephRBDClient
Docker3
Sysbench Client
Docker4
Sysbench Client
Docker2 (krbd)
Percona DB Server
Docker1 (krbd)
Percona DB Server
ClusterNW10GbE
Sysbench Containers
16 vCPUs, 32GB RAM
FIO 2.8, Sysbench0.5
DB Containers
16 vCPUs, 32GB RAM,
200GB RBD volume,
100GB MySQL dataset
InnoDB bufcache25GB(25%)
Public NW 10 GbE
QuantaGrid D51BP-1U
QCTCONFIDENTIAL
Detailed System Architecture in QCT Lab
QCTCONFIDENTIAL
11. QCTCONFIDENTIAL
Stage Test Subject Benchmark tools Major Task
I/O Baseline Raw Disk FIO
Determinemaximum server IO
backplanebandwidth
Network Baseline NIC iPerf
Ensure consistent network
bandwidth between all nodes
Bare Metal RBD Baseline LibRBD FIO CBT
Use FIO RBD engine to test
performance using libRBD
Docker Container OLTP
Baseline
Percona DB + Sysbench Sysbench/OLTP
Establish number ofworkload-
driver VMs desired per client
Benchmark criteria:
1. Default: ceph.conf
2. Software Level Tuning: ceph.conftuned
3. Software + NUMA CPU Pinning: ceph.conftuned +NUMACPU Pinning
Benchmark Methodology
QCTCONFIDENTIAL
QCTCONFIDENTIAL
12. QCTCONFIDENTIAL
• Use faster media for journals, metadata
• Use recent Linux kernels
– blk-mq support packs big performance gains with NVMe media
– optimizations for non-rotational media
• Use tuned where available
– adaptive latency performance tuning [2]
• Virtual memory, network and storage tweaks
– use commonly recommended VM, network settings [1-4]
– enable rq_affinity, read ahead for NVMe devices
• BIOS and CPU performance governor settings
– disable C-states and enable Turbo-boost
– use “performance” CPU governor
Configuring All-flash Ceph
System Tuning for Low-latencyWorkloads
[1] https://wiki.mikejung.biz/Ubuntu_Performance_Tuning
[2] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/tuned-adm.html
[3] http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html
[4] https://www.suse.com/documentation/ses-4/singlehtml/book_storage_admin/book_storage_admin.html
13. Parameter Default value Tuned value
objecter_inflight_ops 1024 102400 Objecter is responsible for sending requests to OSD.
objecter_inflight_ops_byte
s
104857600 1048576000
Objecter_inflight_ops/objecter_inflight_op_bytes tell objecter to
throttle outgoing ops according to budget (values based on
experiments in the Dumpling timeframe)
ms_dispatch_throttle_byte
s
104857600 1048576000
ms_dispatch_throttle_bytes throttle is to dispatch message size
for simple messenger (values based on experiments in the
Dumpling timeframe)
filestore_queue_max_ops 50 5000
filestore_queue_max_ops/filestore_queue_max_bytes throttle
are used to throttle inflight ops for filestore
filestore_queue_max_bytes 104857600 1048576000
These throttles are checked before sending ops to journal, so if
filestore does not get enough budget for current op, OSD op
thread will be blocked
Configuring All-flash Ceph
Ceph Tunables
QCTCONFIDENTIAL
14. 14
Parameter
Default
value
Tuned
value
filestore_max_sync_interv
al
5 10
filestore_max_sync_interval controls the interval (in seconds) that sync
thread flush data from memory to disk. Use page cache - by default filestore
writes data to memory and sync thread is responsible for flushing data to
disk, then journal entries can be trimmed. Note that large
filestore_max_sync_interval can cause performance spike
filestore_op_threads 2 6
filestore_op_threads controls the number of filesystem operation threads
that execute in parallel.
If the storage backend is fast enough and has enough queues to support
parallel operations, it’s recommended to increase this parameter, given
there is enough CPU available
osd_op_threads 2 32
osd_op_threads controls the number of threads to service Ceph OSD
Daemon operations.
Setting this to 0 will disable multi-threading.
Increasing this number may increase the request processing rate
If the storage backend is fast enough and has enough queues to support
parallel operations, it’s recommended to increase this parameter, given
there is enough CPU available
Configuring All-flash Ceph
Ceph Tunables
QCTCONFIDENTIAL
15. Parameter
Default
value
Tuned value
journal_queue_max_ops 300 3000
journal_queue_max_bytes/journal_queue_max_op throttles
are to throttleinflight ops for journal
If journal does not get enough budget for current op, it will
block OSD op thread
journal_queue_max_bytes 33554432 1048576000
journal_max_write_entries 100 1000
journal_max_write_entries/journal_max_write_bytes
throttles areused to throttleops or bytes for every journal
write
Tweaking these two parameters maybe helpful for small
writes
journal_max_write_bytes 10485760 1048576000
Configuring All-flash Ceph
Ceph Tunables
QCTCONFIDENTIAL
16. • Leverage latest Intel NVMe technology to reach high
performance, bigger capacity, with lower $/GB
– Intel DC P3520 2TB raw performance: 375Kread IOPS, 26K write IOPS
• By using multiple OSD partitions, Ceph performance scales
linearly
– Reduces lock contention within a single OSD process
– Lower latency at all queue-depths, biggest impact to random reads
• Introduces the concept of multiple OSD’s on the same physical
device
– Conceptually similar crush map data placement rules as managing disks
in an enclosure
Multi-partitioned NVMe SSDs
OSD1
OSD2
OSD3
OSD4
QCTCONFIDENTIAL
17. Multi-partitioned NVMe SSDs
0
2
4
6
8
10
12
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000
AvgLatency(ms)
IOPS
Multiple OSD'sper Device comparison
4K Random Read (Latency vs. IOPS)
5 nodes, 20/40/80 OSDs
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
0
10
20
30
40
50
60
70
80
90
%CPUUtilization
Single Node CPU Utilization Comparison
4K Random Read @QD32
4/8/16 OSDs
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
These measurements were doneon a Ceph node based Intel P3700 NVMe SSDs butare equally applicable to other
QCTCONFIDENTIAL
20. • NUMA-balance network and storage devices across CPU sockets
• Bind IO devices to local CPU socket (IRQ pinning)
• Align OSD data and Journals to the same NUMA node
• Pin OSD processes to local CPU socket (NUMA node pinning)
NUMA Considerations
CORE CORE
CORE CORE
…
Socket 0
Ceph OSD
CORE CORE
Ceph OSD
CORE
Ceph OSD
CORE
…
Socket 1
Ceph OSD
Memory
Memory
QPI
NUMA Node 0 NUMA Node 1
Storage NICs NICs Storage
Remote
Local
QCTCONFIDENTIAL
21. QCTCONFIDENTIAL
NUMA-Balanced Config on QCT QuantaGrid D51BP-1U
CPU 0 CPU 1
RAM RAM
4 NVMe drive slots
1 NIC slot
QPI
PCIe Gen3 x4
Ceph OSD 1-8 Ceph OSD 9-16
PCIe Gen3 x8PCIe Gen3 x8 PCIe Gen3 x4
QCT QuantaGrid D51BP-1U
4 NVMe drive slots
1 NIC slot
QCTCONFIDENTIAL
22. QCTCONFIDENTIAL22
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
200000 250000 300000 350000 400000 450000 500000
AverageLatency(ms)
IOPS
70/30 4k OLTP Performance, Before vs After NUMA balance
IODepth scaling 4-128K, 5 nodes, 10 clients x 10 RBDvolumes, RedHat Ceph 2.1
SW Tuned SW+NUMA CPU Pinning
40% better IOPS,
100% betterlatency at QD=8
with NUMA balance
At QD=8
100% betteraveragelatency
15-20% better 90th pctlatency
10-15% better 99th pctlatency
Performance Testing Results
Latency improvements after NUMA optimizations
QCTCONFIDENTIAL
23. § All-NVMe Ceph enables high performance workloads
§ NUMA balanced architecture
§ Small footprint (1U), lower overall TCO
§ Million IOPS with very low latency
QCTCONFIDENTIAL
24. Visit www.QCT.io for QxStor Red Hat Ceph Storage Edition:
• Reference Architecture Red Hat Ceph Storage on QCT Servers
• Datasheet QxStor Red Hat Ceph Storage
• Solution Brief QCT and Intel Hadoop Over Ceph Architecture
• Solution Brief Deploying Red Hat Ceph Storage on QCT servers
• Solution Brief Containerized Ceph for On-Demand, Hyperscale Storage
For Other Information…
QCTCONFIDENTIAL