Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server

All-Flash Ceph on NUMA-
Balanced Server
Veda Shankar, QCT
Tushar Gohad, Intel
2017 Ceph Day San Jose

• All-flash Ceph and Use Cases
• QCT QxStor All-flash Ceph for IOPS
• QCT Lab Environment Overview & Detailed Architecture
• Importance of NUMA and proof points
Agenda
2

Ceph – A robust, distributed, petabyte-scale software-
defined storage platform
QCT QxStor Red Hat Ceph Storage Edition
Throughput Optimized
• Densest 1U Ceph building blk
• Smaller failure domain
• Dual socket Xeon E5v4
• 3xSSD S3710 ceph journal
• 12xHDD 7.2krpm
• 1x10 GbE NIC
• Obtain best throughput &
density at once (2x 280TB)
• 2x Dual socket Xeon E5v4
• 2x 2x NVMe P3700 journal
• 2x 35xHDD 7.2krpm
• 2x 40 GbE NIC
• Block or Object Storage, Video, Audio, Image,
Streaming media, Big Data
• 3x replication
USECASE
QxStor RCT-400QxStor RCT-200
Cost/Capacity
Optimized
QxStor RCC-400
• Maximize storage capacity
• Highest density 560TB* raw
capacity per chassis
• 2x Dual socket Xeon E5v4
• 2x 35xHDD 7.2krpm
• 2x 10 GbE NIC
• Object storage, Archive,
Backup, Enterprise Dropbox
• Erasure coding
D51PH-1ULH T21P-4U
* Optional model, one MB per chassis, can support 620TB raw capacity
IOPS Optimized
QxStor RCI-300
• All Flash Design, low latency
• Performance & capacity SKUs
• 2x Dual socket Xeon E5v4 higher
cores
• 4x P3520 2TB or 4x P3700 1.6TB
• 1x 10 GbE NIC
• Database, HPC, Mission
Critical Applications
• 3x replication
D51BP-1U
4
T21P-4U

5
QCT QxStor RCI-300
All Flash Design Ceph for I/O Intensive Workloads
SKU 1: All flash Ceph - the Best IOPS SKU
• Ceph Storage Server: D51BP-1U
• CPU: 2x E5-2995 v4 or plus
• RAM: 128GB
• NVMe SSD: 4x P3700 1.6TB
• NIC: 10GbE dual port or 40GbE dual port
SKU 2: All flash Ceph - IOPS/Capacity Balanced SKU
(best TCO, as of today)
• Ceph Storage Server: D51BP-1U
• CPU: 2x E5-2980 v4 or higher cores
• RAM: 128GB
• NVMe SSD: 4x P3520 2TB
• NIC: 10GbE dual port or 40GbE dual port
NUMA Balanced Ceph Hardware
Highest IOPS & Lowest Latency
Optimized Ceph & HW Integration for
IOPS intensive workloads

6
Intel NVMe SSD Options
P3700 P3520
Lithography MLC NAND 3D NAND G1 MLC
Endurance Rating 17 DWPD 0.7 DWPD
Capacity Available 2.0 TB 2.0 TB
Sequential Read 2,800 MB/s 1,700 MB/s
Sequential Write 2,000 MB/s 1,350 MB/s
Random 4K Read 450,000 IOPS 375,000 IOPS
Random 4K Write 175,000 IOPS 26,000 IOPS
Comparative Price $ 2.6x $ x

Why All-flash Storage?
• Falling flash prices: Flash prices fell
as much as 75% over the 18 months
leading up to mid-2016 and the trend
continues.
“TechRepublic: 10 storage trends to watch in
2016”
• Flash is 10x cheaper than DRAM:
with persistence and high capacity
“NetApp”
• Flash is 100x cheaper than disk:
pennies per IOPS vs. dollars per IOPS
“NetApp”
• Flash is 1000x faster than disk:
latency drops from milliseconds to
microseconds
“NetApp”
• Flash performance advantage:
HDDs have an advantage in $/GB,
while flash has an advantage in
$/IOPS.
“TechTarget: Hybrid storage arrays vs. all-
flash arrays: A little flash or a lot?”
• NVMe-based storage trend: 60%
of enterprise storage appliances
will have NVMe bays by 2020
“G2M Research”
Require sub-millisecond latencyNeed performance-optimized
storage for mission-critical apps
Flash capacity gains while the
price drops

8
Storage Evolution
Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published
specifications of in-market memory products against internal Intel specifications.

NVM Express (NVMe)
Standardized interface for non-volatile memory, http://nvmexpress.org
9

10
NVMe: Best-in-Class IOPS, Lower/Consistent Latency
Lowest Latency of Standard Storage Interfaces
0
500000
100% Read 70% Read 0% Read
IOPS
IOPS - 4K Random Workloads
PCIe/NVMe SAS 12Gb/s
3x better IOPS vs SAS 12Gbps For the same #CPU cycles, NVMe delivers over 2X the IOPs of SAS!
Gen1 NVMe has 2 to 3x better Latency Consistency vs SAS
Test and System Configurations: PCI Express* (PCIe*)/NVM Express* (NVMe) Measurements made on Intel® Core™ i7-3770S system @ 3.1GHz and 4GB Mem running Windows* Server 2012 Standard O/S, Intel
PCIe/NVMe SSDs, data collected by IOmeter* tool. SAS Measurements from HGST Ultrastar* SSD800M/1000M (SAS), SATA S3700 Series. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance. Source: Intel Internal Testing.

QCT CONFIDENTIAL
RADOS
LIBRADOS
QCT Lab Environment Overview
Ceph 1
….
Monitors
Storage Clusters
Interfaces RBD
(Block Storage)
Cluster Network
10GbE
Public Network
10GbE
Clients
Client 1 Client 2 Client 9 Client 10
Ceph 2 Ceph 5
….

QCT CONFIDENTIAL
Stage Test Subject Benchmark tools Major Task
I/O Baseline Raw Disk FIO
Determine maximum server IO
backplane bandwidth
Network Baseline NIC iPerf
Ensure consistent network
bandwidth between all nodes
Bare Metal RBD Baseline LibRBD FIO CBT
Use FIO RBD engine to test
performance using libRBD
Docker Container OLTP
Baseline
Percona DB + Sysbench Sysbench/OLTP
Establish number of workload-
driver VMs desired per client
Benchmark criteria:
1. Default: ceph.conf
2. Software Level Tuning: ceph.conf tuned
3. Software + NUMA CPU Pinning: ceph.conf tuned + NUMA CPU Pinning
Benchmark Methodology
QCT CONFIDENTIAL

5-Node all-NVMe Ceph Cluster
Dual-Xeon E5 2699v4@2.3GHz, 88 HT, 128GB DDR4
RHEL 7.3, 3.10, Red Hat Ceph 2.1
10x Client Systems
Dual-Xeon E5 2699v4@2.3GHz
88 HT, 128 GB DDR4
Ceph OSD1
Ceph OSD2
Ceph OSD3
Ceph OSD4
Ceph OSD16
…
NVMe3
NVMe2
NVMe4
NVMe1
20x 2TB P3520 SSDs
80 OSDs
2x Replication
19TB Effective Capacity
Tests at cluster fill-level
82%
Ceph RBD Client
Docker3
Sysbench Client
Docker4
Sysbench Client
Docker2 (krbd)
Percona DB Server
Docker1 (krbd)
Percona DB Server
Cluster NW 10 GbE
Sysbench Containers
16 vCPUs, 32GB RAM
FIO 2.8, Sysbench 0.5
DB Containers
16 vCPUs, 32GB RAM,
200GB RBD volume,
100GB MySQL dataset
InnoDB buf cache 25GB(25%)
Public NW 10 GbE
QuantaGrid D51BP-1U
QCT CONFIDENTIAL
Detailed System Architecture in QCT Lab

QCT CONFIDENTIAL
• use faster media for journals, metadata
• use recent Linux kernels
– blk-mq support packs big performance gains with NVMe media
– optimizations for non-rotational media
• use tuned where available
– adaptive latency performance tuning [2]
• virtual memory, network and storage tweaks
– use commonly recommended VM, network settings [1-4]
– enable rq_affinity, read ahead for NVMe devices
• BIOS and CPU performance governor settings
– disable C-states and enable Turbo-boost
– use “performance” CPU governor
Configuring All-flash Ceph
System Tuning for Low-latency Workloads
[1] https://wiki.mikejung.biz/Ubuntu_Performance_Tuning
[2] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/tuned-adm.html
[3] http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html
[4] https://www.suse.com/documentation/ses-4/singlehtml/book_storage_admin/book_storage_admin.html

Parameter Default value Tuned value
objecter_inflight_ops 1024 102400 Objecter is responsible for sending requests to OSD.
objecter_inflight_ops_bytes 104857600 1048576000
Objecter_inflight_ops/objecter_inflight_op_bytes tell objecter to
throttle outgoing ops according to budget (values based on
experiments in the Dumpling timeframe)
ms_dispatch_throttle_bytes 104857600 1048576000
ms_dispatch_throttle_bytes throttle is to dispatch message size for
simple messenger (values based on experiments in the Dumpling
timeframe)
filestore_queue_max_ops 50 5000
filestore_queue_max_ops/filestore_queue_max_bytes throttle are
used to throttle inflight ops for filestore
filestore_queue_max_bytes 104857600 1048576000
These throttles are checked before sending ops to journal, so if filestore
does not get enough budget for current op, OSD op thread will be
blocked
Ceph Tunables

16
Parameter
Default
value
Tuned
value
filestore_max_sync_interval 5 10
filestore_max_sync_interval controls the interval (in seconds) that sync thread flush
data from memory to disk. Use page cache - by default filestore writes data to
memory and sync thread is responsible for flushing data to disk, then journal entries
can be trimmed. Note that large filestore_max_sync_interval can cause performance
spike
filestore_op_threads 2 6
filestore_op_threads controls the number of filesystem operation threads that
execute in parallel.
If the storage backend is fast enough and has enough queues to support parallel
operations, it’s recommended to increase this parameter, given there is enough CPU
available
osd_op_threads 2 32
osd_op_threads controls the number of threads to service Ceph OSD Daemon
operations.
Setting this to 0 will disable multi-threading.
Increasing this number may increase the request processing rate
If the storage backend is fast enough and has enough queues to support parallel
operations, it’s recommended to increase this parameter, given there is enough CPU
available
Ceph Tunables

Parameter
Default
value
Tuned value
journal_queue_max_ops 300 3000
journal_queue_max_bytes/journal_queue_max_op throttles are
to throttle inflight ops for journal
If journal does not get enough budget for current op, it will block
OSD op thread
journal_queue_max_bytes 33554432 1048576000
journal_max_write_entries 100 1000
journal_max_write_entries/journal_max_write_bytes throttles
are used to throttle ops or bytes for every journal write
Tweaking these two parameters maybe helpful for small writes
journal_max_write_bytes 10485760 1048576000
Ceph Tunables

• Leverage latest Intel NVMe technology to reach high
performance, bigger capacity, with lower $/GB
– Intel DC P3520 2TB raw performance: 375K read IOPS, 26K write IOPS
• By using multiple OSD partitions, Ceph performance scales
linearly
– Reduces lock contention within a single OSD process
– Lower latency at all queue-depths, biggest impact to random reads
• Introduces the concept of multiple OSD’s on the same physical
device
– Conceptually similar crush map data placement rules as managing disks
in an enclosure
Multi-partitioned NVMe SSDs
OSD 1
OSD 2
OSD 3
OSD 4

Multi-partitioned NVMe SSDs
0
2
4
6
8
10
12
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000
AvgLatency (ms)
IOPS
Multiple OSD's per Device comparison
4K Random Read (Latency vs. IOPS)
5 nodes, 20/40/80 OSDs
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
0
10
20
30
40
50
60
70
80
90
% CPU Utilization
Single Node CPU Utilization Comparison
4K Random Read @QD32
4/8/16 OSDs
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
These measurements were done on a Ceph node based Intel P3700 NVMe SSDs but are equally applicable to other

0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000
Average Latency (ms)
IOPS
4K Random Read (Latency vs. IOPS), IODepth scaling 4-128
5 nodes, 10 clients x 10 RBD volumes, Red Hat Ceph Storage 2.1
Default Tuned
~1.57M IOPS @~4ms
200% improvement in IOPS and Latency
Performance Testing Results
4K 100% Random Read
~1.34M IOPS @~1ms, QD=16
200% improvement in IOPS and Latency

0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
IOPS
Latency vs IOPS (100% wr, 70/30 rd/rw)
IODepth scaling 4-128K, 5 nodes, 10 clients x 10 RBD volumes, Red Hat Ceph 2.1
100% write 70/30 OLTP mix
4K 100% Random Write, 70/30 OLTP Mix
~450k 70/30 OLTP IOPS
@~1ms, QD=4
~165k Write IOPS
@~2ms, QD=4

• NUMA-balance network and storage devices across CPU sockets
• Bind IO devices to local CPU socket (IRQ pinning)
• Align OSD data and Journals to the same NUMA node
• Pin OSD processes to local CPU socket (NUMA node pinning)
NUMA Considerations
CORE CORE
CORE CORE
…
Socket 0
Ceph OSD
CORE CORE
Ceph OSD
CORE
Ceph OSD
CORE
…
Socket 1
Ceph OSD
Memory
Memory
QPI
NUMA Node 0 NUMA Node 1
Storage NICs NICs Storage
Remote
Local

QCT CONFIDENTIAL
NUMA-Balanced Config on QCT QuantaGrid D51BP-1U
CPU 0 CPU 1
RAM RAM
4 NVMe drive slots
1 NIC slot
QPI
PCIe Gen3 x4
Ceph OSD 1-8 Ceph OSD 9-16
PCIe Gen3 x8PCIe Gen3 x8 PCIe Gen3 x4
QCT QuantaGrid D51BP-1U
4 NVMe drive slots
1 NIC slot

QCT CONFIDENTIAL24
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
200000 250000 300000 350000 400000 450000 500000
IOPS
70/30 4k OLTP Performance, Before vs After NUMA balance
IODepth scaling 4-128K, 5 nodes, 10 clients x 10 RBD volumes, Red Hat Ceph 2.1
SW Tuned SW+NUMA CPU Pinning
40% better IOPS,
100% better latency at QD=8
with NUMA balance
At QD=8
100% better average latency
15-20% better 90th pct latency
10-15% better 99th pct latency
Latency improvements after NUMA optimizations

NUMA Partitioning and Hyper-converged
More than one way to slice the system
25
CORE
…
Socket 0
Ceph OSD
CORE
COMPUTE
CORE
Ceph OSD
…
Socket 1
Memory
Memory
QPI
NUMA Node 0 NUMA Node 1
Storage NICs
CORE
Ceph OSD
CORE
Ceph OSD
CORE
COMPUTE
CORE
COMPUTE
CORE
COMPUTE

NUMA Partitioning and Hyper-converged (NFV)
Cluster-on-Die (Xeon Processors with > 10 cores)
26
…
Socket 0
CORE
COMPUTE
CORE
Ceph OSD
…
Socket 1
Memory
Memory
QPI
NUMA Node 0
Storage NIC
CORE
Ceph OSD
CORE
COMPUTE
CORE
COMPUTE
CORE
COMPUTE
NIC NIC
NUMA Node 1 NUMA Node 2 NUMA Node 3
NIC
CORE
COMPUTE
CORE
COMPUTE

§ All-NVMe Ceph enables high performance workloads
§ NUMA balanced architecture
§ Small footprint (1U), lower overall TCO
§ Million IOPS with very low latency

Visit www.QCT.iofor QxStor Red Hat Ceph Storage Edition:
• Reference Architecture Red Hat Ceph Storage on QCT Servers
• Datasheet QxStor Red Hat Ceph Storage
• Solution Brief QCT and Intel Hadoop Over Ceph Architecture
• Solution Brief Deploying Red Hat Ceph Storage on QCT servers
• Solution Brief Containerized Ceph for On-Demand, Hyperscale Storage
For Other Information…

# Please do not change this file directly since it is managed by Ansible and will be overwritten
[global]
fsid = 7e191449-3592-4ec3-b42b-e2c4d01c0104
max open files = 131072
crushtool = /usr/bin/crushtool
debug_lockdep = 0/1
debug_context = 0/1
debug_crush = 1/1
debug_buffer = 0/1
debug_timer = 0/0
debug_filer = 0/1
debug_objecter = 0/1
debug_rados = 0/5
debug_rbd = 0/5
debug_ms = 0/5
debug_monc = 0/5
debug_tp = 0/5
debug_auth = 1/5
debug_finisher = 1/5
debug_heartbeatmap = 1/5
debug_perfcounter = 1/5
debug_rgw = 1/5
debug_asok = 1/5
debug_throttle = 1/1
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_mon = 0/0
debug_paxos = 0/0
osd_crush_chooseleaf_type = 0
filestore_xattr_use_omap = true
osd_pool_default_size = 1
osd_pool_default_min_size = 1
Configuration Detail – ceph.conf (1/2)

rbd_cache = true
mon_compact_on_trim = false
log_to_syslog = false
log_file = /var/log/ceph/$name.log
mutex_perf_counter = true
throttler_perf_counter = false
ms_nocrc = true
[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by
QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by
SELinux or AppArmor
rbd_cache = true
rbd_cache_writethrough_until_flush = false
[mon]
[mon.qct50]
host = qct50
# we need to check if monitor_interface is defined in the inventory per host or if it's set in a
group_vars file
mon addr = 10.5.15.50
mon_max_pool_pg_num = 166496
mon_osd_max_split_count = 10000
[osd]
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"
osd journal size = 10240
cluster_network = 10.5.16.0/24
public_network = 10.5.15.0/24
filestore_queue_max_ops = 5000
osd_client_message_size_cap = 0
objecter_infilght_op_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
filestore_wbthrottle_enable = True
filestore_fd_cache_shards = 64
objecter_inflight_ops = 1024000
filestore_max_sync_interval = 10
filestore_op_threads = 16
osd_pg_object_context_cache_count = 10240
journal_queue_max_ops = 3000
filestore_odsync_write = True
journal_queue_max_bytes = 10485760000
journal_max_write_entries = 1000
filestore_queue_committing_max_ops = 5000
journal_max_write_bytes = 1048576000
filestore_fd_cache_size = 10240
osd_client_message_cap = 0
journal_dynamic_throttle = True
osd_enable_op_tracker = False
Configuration Detail – ceph.conf (2/2)

cluster:
head: "root@qct50"
clients: ["root@qct50", "root@qct51", "root@qct52", "root@qct53", "root@qct54",
"root@qct55", "root@qct56", "root@qct57", "root@qct58", "root@qct59"]
osds: ["root@qct62", "root@qct63", "root@qct64", "root@qct65", "root@qct66"]
mons: ["root@qct50"]
osds_per_node: 16
fs: xfs
mkfs_opts: -f -i size=2048 -n size=64k
mount_opts: -o inode64,noatime,logbsize=256k
conf_file: /etc/ceph/ceph.conf
ceph.conf: /etc/ceph/ceph.conf
iterations: 1
rebuild_every_test: False
tmp_dir: "/tmp/cbt"
clusterid: 7e191449-3592-4ec3-b42b-e2c4d01c0104
use_existing: True
pool_profiles:
replicated:
pg_size: 8192
pgp_size: 8192
replication: 2
benchmarks:
librbdfio:
rbdadd_mons: "root@qct50:6789"
rbdadd_options: "noshare"
time: 300
ramp: 100
vol_size: 8192
mode: ['randread']
numjobs: 1
use_existing_volumes: False
procs_per_volume: [1]
volumes_per_client: [10]
op_size: [4096]
concurrent_procs: [1]
iodepth: [4, 8, 16, 32, 64, 128]
osd_ra: [128]
norandommap: True
cmd_path: '/root/cbt_packages/fio/fio'
log_avg_msec: 250
pool_profile: 'replicated'
Configuration Detail - CBT YAML File

QCT CONFIDENTIAL
www.QCT.io
Looking for
innovative cloud solution?
Come to QCT, who else?

Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server

Ähnlich wie Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server