Unblocking The Main Thread Solving ANRs and Frozen Frames
Â
Deploying flash storage for Ceph without compromising performance
1. Ceph Day LA â July 16, 2015
Deploying Flash Storage For Ceph
2. Š 2015 Mellanox Technologies 2- Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
Storage
Front / Back-EndServer / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
3. Š 2015 Mellanox Technologies 3- Mellanox Confidential -
Scale-Out Architecture Requires A Fast Network
§ď§âŻScale-out grows capacity and performance in parallel
§ď§âŻRequires fast network for replication, sharing, and metadata (file)
â˘âŻ Throughput requires bandwidth
â˘âŻ IOPS requires low latency
§ď§âŻProven in HPC, storage appliances, cloud, and now⌠Ceph
Interconnect Capabilities Determine Scale Out Performance
4. Š 2015 Mellanox Technologies 4- Mellanox Confidential -
Solid State Storage Technology Evolution â Lower Latency
Advanced Networking and Protocol Offloads Required to Match Storage Media Performance
0.1
10
1000
HD SSD NVM
Access
 Time
 (micro-ÂâSec)
Storage
 Media
 Technology
50%
100%
Networked
 Storage
Storage Protocol
 (SW) Network
Storage Media
Network
HW & SW
Hard
Drives
NAND
Flash
Next Gen
NVM
5. Š 2015 Mellanox Technologies 5- Mellanox Confidential -
Ceph and Networks
§ď§âŻHigh performance networks enable maximum cluster availability
â˘âŻ Clients, OSD, Monitors and Metadata servers communicate over multiple network layers
â˘âŻ Real-time requirements for heartbeat, replication, recovery and re-balancing
§ď§âŻCluster (âbackendâ) network performance dictates clusterâs performance and scalability
â˘âŻ âNetwork load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients
and the Ceph Storage Clusterâ (Ceph Documentation)
6. Š 2015 Mellanox Technologies 6- Mellanox Confidential -
Ceph Deployment Using 10GbE and 40GbE
§ď§âŻCluster (Private) Network @ 40/56GbE
â˘âŻ Smooth HA, unblocked heartbeats, efficient data balancing
§ď§âŻThroughput Clients @ 40/56GbE
â˘âŻ Guaranties line rate for high ingress/egress clients
§ď§âŻIOPs Clients @ 10GbE or 40/56GbE
â˘âŻ 100K+ IOPs/Client @4K blocks
2.5x Higher Throughput , 15% Higher IOPs with 40Gb Ethernet vs. 10GbE!
(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)
Throughput Testing results based on fio benchmark, 8MB block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
IOPs Testing results based on fio benchmark, 4KB block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
Cluster
 Network
Admin
 Node
40GbE
Public
 Network
10GbE/40GBE
Ceph
 Nodes
(Monitors,
 OSDs,
 MDS)
Client
 Nodes
10GbE/40GbE
7. Š 2015 Mellanox Technologies 7- Mellanox Confidential -
Ceph Is Accelerated by A Faster Network â Optimized at 56GbE
4,300 4,350
5,475 5,495
-
1,000
2,000
3,000
4,000
5,000
6,000
Ceph fio_rbd 64K random read Ceph fio_rbd 256K random read
40 Gb/s (MB/s) 56 Gb/s (MB/s)
27% More Throughput On Random Reads
8. Š 2015 Mellanox Technologies 8- Mellanox Confidential -
Ceph Reference Architectures Using Disk
9. Š 2015 Mellanox Technologies 9- Mellanox Confidential -
Optimizing Ceph For Throughput and Price/Throughput
§ď§âŻRed Hat, Supermicro, Seagate, Mellanox, Intel
§ď§âŻExtensive Performance Testing: Disk, Flash, Network, CPU, OS, Ceph
§ď§âŻReference Architecture Published Soon
10GbE Network Setup 40GbE Network Setup
10. Š 2015 Mellanox Technologies 10- Mellanox Confidential -
Testing 12 to 72 Disks Per Node, 2x10GbE vs. 1x40GbE
§ď§âŻKey Test Results
â˘âŻ More disks = more MB/s per server, less/OSD
â˘âŻ More flash is faster (usually)
â˘âŻ All-flash 2 SSDs as fast as many disks
§ď§âŻ40GbE Advantages
â˘âŻ Up to 2x read throughput per server
â˘âŻ Up to 50% decrease in latency
â˘âŻ Easier than bonding multiple 10GbE links
13. Š 2015 Mellanox Technologies 13- Mellanox Confidential -
Ceph Flash Optimization
Highlights Compared to Stock Ceph
â˘âŻ Read performance up to 8x better
â˘âŻ Write performance up to 2x better with tuning
Optimizations
â˘âŻ All-flash storage for OSDs
â˘âŻ Enhanced parallelism and lock optimization
â˘âŻ Optimization for reads from flash
â˘âŻ Improvements to Ceph messenger
Test Configuration
â˘âŻ InfiniFlash Storage with IFOS 1.0 EAP3
â˘âŻ Up to 4 RBDs
â˘âŻ 2 Ceph OSD nodes, connected to InfiniFlash
â˘âŻ 40GbE NICs from Mellanox
SanDisk InfiniFlash
17. Š 2015 Mellanox Technologies 17- Mellanox Confidential -
RDMA Enables Efficient Data Movement
§ď§âŻHardware Network Acceleration Ă ď Higher bandwidth, Lower latency
§ď§âŻHighest CPU efficiency Ă ď more CPU Power To Run Applications
Efficient Data Movement
With RDMA
Higher Bandwidth
Lower Latency
More CPU Power For
Applications
18. Š 2015 Mellanox Technologies 18- Mellanox Confidential -
RDMA Enables Efficient Data Movement At 100Gb/s
§ď§âŻWithout RDMA
â˘âŻ 5.7 GB/s throughput
â˘âŻ 20-26% CPU utilization
â˘âŻ 4 cores 100% consumed by moving data
§ď§âŻWith Hardware RDMA
â˘âŻ 11.1 GB/s throughput at half the latency
â˘âŻ 13-14% CPU utilization
â˘âŻ More CPU power for applications, better ROI
x x x x
100GbE With CPU Onload 100 GbE With Network Offload
CPU Onload Penalties
â˘âŻ Half the Throughput
â˘âŻ Twice the Latency
â˘âŻ Higher CPU Consumption
2X Better Bandwidth
Half the Latency
33% Lower CPU
See the demo: https://www.youtube.com/watch?v=u8ZYhUjSUoI
19. Š 2015 Mellanox Technologies 19- Mellanox Confidential -
Adding RDMA to Ceph
§ď§âŻRDMA Beta Included in Hammer
â˘âŻ Mellanox, Red Hat, CohortFS, and Community collaboration
â˘âŻ Full RDMA expected in Infernalis
§ď§âŻRefactoring of Ceph Messaging Layer
â˘âŻ New RDMA messenger layer called XioMessenger
â˘âŻ New class hierarchy allowing multiple transports (simple one is TCP)
â˘âŻ Async design that leverages Accelio
â˘âŻ Reduced locks; Reduced number of threads
§ď§âŻXioMessenger built on top of Accelio (RDMA abstraction layer)
â˘âŻ Integrated into all CEPH user space components: daemons and clients
â˘âŻ Both âpublic networkâ and âcloud networkâ
20. Š 2015 Mellanox Technologies 20- Mellanox Confidential -
§ď§âŻOpen source!
â˘âŻ https://github.com/accelio/accelio/ && www.accelio.org
§ď§âŻFaster RDMA integration to application
§ď§âŻAsynchronous
§ď§âŻMaximize msg and CPU parallelism
â˘âŻ Enable >10GB/s from single node
â˘âŻ Enable <10usec latency under load
§ď§âŻIn Giant and Hammer
â˘âŻ http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger
Accelio, High-Performance Reliable Messaging and RPC Library
23. Š 2015 Mellanox Technologies 23- Mellanox Confidential -
Ceph For Large Scale Storageâ Fujitsu Eternus CD10000
§ď§âŻHyperscale Storage
â˘âŻ 4 to 224 nodes
â˘âŻ Up to 56 PB raw capacity
§ď§âŻRuns Ceph with Enhancements
â˘âŻ 3 different storage nodes
â˘âŻ Object, block, and file storage
§ď§âŻMellanox InfiniBand Cluster Network
â˘âŻ 40Gb InfiniBand cluster network
â˘âŻ 10Gb Ethernet front end network
24. Š 2015 Mellanox Technologies 24- Mellanox Confidential -
Media & Entertainment Storage â StorageFoundry Nautilus
§ď§âŻTurnkey Object Storage
â˘âŻ Built on Ceph
â˘âŻ Pre-configured for rapid deployment
â˘âŻ Mellanox 10/40GbE networking
§ď§âŻHigh-Capacity Configuration
â˘âŻ 6-8TB Helium-filled drives
â˘âŻ Up to 2PB in 18U
§ď§âŻHigh-Performance Configuration
â˘âŻ Single client read 2.2 GB/s
â˘âŻ SSD caching + Hard Drives
â˘âŻ Supports Ethernet, IB, FC, FCoE front-end ports
§ď§âŻMore information: www.storagefoundry.net
25. Š 2015 Mellanox Technologies 25- Mellanox Confidential -
SanDisk InfiniFlash
§ď§âŻFlash Storage System
â˘âŻ Announced March 2015
â˘âŻ 512 TB (raw) in one 3U enclosure
â˘âŻ Tested with 40GbE networking
§ď§âŻHigh Throughput
â˘âŻ 8 SAS ports, up to 7GB/s
â˘âŻ Connect to 2 or 4 OSD nodes
â˘âŻ Up to 1M IOPS with two nodes
§ď§âŻMore information:
â˘âŻ http://bigdataflash.sandisk.com/infiniflash
26. Š 2015 Mellanox Technologies 26- Mellanox Confidential -
More Ceph Solutions
§ď§âŻCloud â OnyxCCS ElectraStack
â˘âŻ Turnkey IaaS
â˘âŻ Multi-tenant computing system
â˘âŻ 5x faster Node/Data restoration
â˘âŻ https://www.onyxccs.com/products/8-series
§ď§âŻFlextronics CloudLabs
â˘âŻ OpenStack on CloudX design
â˘âŻ 2SSD + 20HDD per node
â˘âŻ Mix of 1Gb/40GbE network
â˘âŻ http://www.flextronics.com/
§ď§âŻISS Storage Supercore
â˘âŻ Healthcare solution
â˘âŻ 82,000 IOPS on 512B reads
â˘âŻ 74,000 IOPS on 4KB reads
â˘âŻ 1.1GB/s on 256KB reads
â˘âŻ http://www.iss-integration.com/supercore.html
§ď§âŻScalable Informatics Unison
â˘âŻ High availability cluster
â˘âŻ 60 HDD in 4U
â˘âŻ Tier 1 performance at archive cost
â˘âŻ https://scalableinformatics.com/unison.html
27. Š 2015 Mellanox Technologies 27- Mellanox Confidential -
Even More Ceph Solutions
§ď§âŻKeeper Technology â keeperSAFE
â˘âŻ Ceph appliance
â˘âŻ For US Government
â˘âŻ File Gateway for NFS, SMB, & StorNext
â˘âŻ Mellanox Switches
§ď§âŻMonash University -- Melbourne, Australia
â˘âŻ 3 Ceph Clusters, >6PB total storage
â˘âŻ 8, 17 (27), and 37 nodes
â˘âŻ OpenStack Cinder and S3/Swift Object Storage
â˘âŻ Mellanox networking, 10GbE nodes, 56GbE ISLs
28. Š 2015 Mellanox Technologies 28- Mellanox Confidential -
Summary
§ď§âŻCeph scalability and performance benefit from high performance networks
â˘âŻ Especially with lots of disk
§ď§âŻCeph being optimized for flash storage
§ď§âŻEnd-to-end 40/56 Gb/s transport accelerates Ceph today
â˘âŻ 100Gb/s testing has begun!
â˘âŻ Available in various Ceph solutions and appliances
§ď§âŻRDMA is next to optimize flash performanceâbeta in Hammer
30. Š 2015 Mellanox Technologies 30- Mellanox Confidential -
SanDisk IF-500 topology on a single 512 TB IF-100
Flash Memory Summit 2015
Santa Clara, CA 30
-Ââ
Â
Â
Â
 IF-Ââ100
 BW
 is
 ~8.5GB/s
 (with
 6Gb
 SAS,
 12
 Gb
 is
 coming
 EOY)
 and
 ~1.5M
 4K
 IOPS
Â
-Ââ⯠We
 saw
 that
 Ceph
 is
 very
 resource
 hungry,
 so,
 need
 at
 least
 2
 physical
 nodes
 on
 top
 of
 IF-Ââ100
Â
-Ââ⯠We
 need
 to
 connect
 all
 8
 ports
 of
 an
 HBA
 to
 saturate
 IF-Ââ100
 for
 bigger
 block
 size
Â
31. Š 2015 Mellanox Technologies 31- Mellanox Confidential -
SanDisk Ceph-InfiniFlash Setup Details
Flash Memory Summit 2015
Santa Clara, CA 31
Performance
 ConďŹg
Â
 -Ââ
 IF-Ââ500
Â
2
 Node
 Cluster
 (
 32
 drives
 shared
 to
 each
 OSD
 node)
Â
Node
Â
Â
 2
 Servers
Â
(Dell
 R720)
 2x
 E5-Ââ2680
 12C
 2.8GHz
Â
Â
 4x
 16GB
 RDIMM,
 dual
 rank
 x4
 (64GB)
Â
Â
Â
1x
 Mellanox
 X3
 Dual
 40GbE
Â
Â
 1x
 LSI
 9207
 HBA
 card
Â
RBD
 Client
Â
 4
 Servers
Â
(Dell
 R620)
Â
1
 x
 E5-Ââ2680
 10C
 2.8GHz
Â
Â
 2
 x
 16GB
 RDIMM,
 dual
 rank
 x4
 (32
 GB)
Â
 1x
 Mellanox
 X3
Â
Dual
 40GbE
Â
Â
Â
Storage
 â
 IF-Ââ100
 with
 64
 Icechips
 in
 A2
 ConďŹg
Â
IF-Ââ100
 IF-Ââ100
 is
 connected
 64
 x
 1YX2
 Icechips
 in
 A2
 topology.
 Total
 storage
 -Ââ
 64
 *
 8
 tb
 =
 512tb
Â
Network
 Details
Â
40G
 Switch
 NA
Â
Â
Â
OS
 Details
Â
Â
OS
Â
 Ubuntu
 14.04
 LTS
 64bit
 3.13.0-Ââ32
Â
LSI
 card/
 driver
 SAS2308(9207)
 mpt2sas
Â
Â
Mellanox
 40gbps
 nw
 card
 MT27500
 [ConnectX-Ââ3]
 mlx4_en
Â
 -Ââ
Â
 2.2-Ââ1
 (Feb
 2014)
Â
Cluster
 ConďŹgura[on
Â
Â
CEPH
 Version
 sndk-Ââifos-Ââ1.0.0.04
 0.86.rc.eap2
Â
Replicadon
 (Default)
Â
2
Â
 [Host]
Â
Â
Â
Note:
 -Ââ
 Host
 level
 replicadon.
Â
Number
 of
 Pools,
 PGs
 &
 RBDs
Â
pool
 =
 4
Â
 ;PG
 =
Â
 2048
 per
 pool
Â
Â
2
 RBDs
 from
 each
 pool
Â
RBD
 size
 2TB
Â
Â
Â
Number
 of
 Monitors
 1
Â
Â
Â
Number
 of
 OSD
 Nodes
 2
Â
Â
Â
Number
 of
 OSDs
 per
 Node
 32
 total
 OSDs
Â
 =
 32
 *
 2
 =
Â
 64
Â
32. Š 2015 Mellanox Technologies 32- Mellanox Confidential -
SanDisk: 8K Random - 2 RBD/Client with File System
IOPS: 2 LUNs /Client (Total 4 Clients)
0
Â
50000
Â
100000
Â
150000
Â
200000
Â
250000
Â
300000
Â
1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
Â
0
 25
 50
 75
 100
Â
Stock
 Ceph
 IFOS
 1.0
Â
Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth]
Read Percent
0
Â
20
Â
40
Â
60
Â
80
Â
100
Â
120
Â
1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
Â
0
 25
 50
 75
 100
Â
IOPS Latency
(ms)
33. Š 2015 Mellanox Technologies 33- Mellanox Confidential -
SanDisk: 64K Random -2 RBD/Client with File System
IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)
[Queue Depth]
Read Percent
0
Â
20000
Â
40000
Â
60000
Â
80000
Â
100000
Â
120000
Â
140000
Â
160000
Â
1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
Â
0
 25
 50
 75
 100
Â
Stock
 Ceph
Â
IFOS
 1.0
Â
0
Â
20
Â
40
Â
60
Â
80
Â
100
Â
120
Â
140
Â
160
Â
180
Â
1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
 1
 2
 4
 8
 16
 32
Â
0
 25
 50
 75
 100
Â
IOPS Latency
(ms)